enum serialization with protobuf-net

enum serialization with protobuf-net - c#

I updated an older version of protobuf to the current one in a huge project (the used version is around 1-2 years old. I don't know the rev).
Sadly the newer version throws an exception
CreateWireTypeException in ProtoReader.cs line 292
in the following test case:
enum Test
{
test1 = 0,
test2
};
static public void Test1()
{
Test original = Test.test2;
using (MemoryStream ms = new MemoryStream())
{
Serializer.SerializeWithLengthPrefix<Test>(ms, original, PrefixStyle.Fixed32, 1);
ms.Position = 0;
Test obj;
obj = Serializer.DeserializeWithLengthPrefix<Test>(ms, PrefixStyle.Fixed32);
}
}
I found out enums are not supposed to serialized directly outside of a class but our system is too huge to simply wrap all the enums in classes. Are there any other solutions to this problem? It works fine with Serialize and Deserialize only the DeserializeWithLengthPrefix throws exceptions.
The testcase works fine in older revisions e.g. r262 of protobuf-net.

Simply, a bug; this is fixed in r640 (now deployed to both NuGet and google-code), along with an additional test based on your code above so that it can't creep back in.
Re performance (comments); the first hint I would look at would be: "prefer groups". Basically, the protobuf specification includes 2 different ways of including sub-objects - "groups" and "length-prefix". Groups was the original implementation, but google have now move towards "length-prefix", and try to advise people not to use "groups". However! Because of how protobuf-net works, "groups" are actually noticeably cheaper to write; this is because unlike the google implementation, protobuf-net does not know the length of things in advance. This means that to write a length-prefix, it needs to do one of:
calculate the length (almost as much work as actually serializing the data, bud adds an entire duplicate of the code) as needed; write the length, then actually serialize the data
serialize to a buffer, write the length, write the buffer
leave a place-holder, serialize, then loop back and write the actual length into the place-holder, adjusting the padding if needed
I've implemented all 3 approaches at different times, but v2 uses the 3rd option. I keep toying with adding a 4th implementation:
leave a place-holder, serialize, then loop back and write the actual length using an overlong form (so no padding adjustments ever needed)
but... consensus seems to be that the "overlong form" is a bit risky; still, it would work nicely for protobuf-net to protobuf-net.
But as you can see: length-prefix always has some overhead. Now imagine fairly deeply nested objects, and you can see a few blips. Groups work very differently; the encoding format for a group is:
write a start marker; serialize; write an end marker
that's it; no length needed; really, really, really cheap to write. On the wire, the main difference between them is:
groups: cheap to write, but you can't skip them if you encounter them as unexpected data; you have to parse the headers of the payload
length-prefix: more expensive to write, but cheap to skip if you encounter them as unexpected data - you just read the length and copy/move that many bytes
But! too much detail!
What does that mean for you? Well, imagine you have:
[ProtoContract]
public class SomeWrapper
{
[ProtoMember(1)]
public List<Person> People { get { return people; } }
private readonly List<Person> people = new List<Person>();
}
You can make the super complex change:
[ProtoContract]
public class SomeWrapper
{
[ProtoMember(1, DataFormat=DataFormat.Group)]
public List<Person> People { get { return people; } }
private readonly List<Person> people = new List<Person>();
}
and it'll use the cheaper encoding scheme. All your existing data will be fine as long as you are using protobuf-net.

Related

Serialization in Unity

So to put it as briefly as possible, I am trying to save my character's essential information, and not only have I heard PlayerPrefs is ill-advised, but it also won't work right for some of the information I have (for instance, I can't PlayerPrefs my Profession and its accompanying stats and inherited class info), so I've pretty much assumed the best, if not only way, to accomplish this is through Serialization.
Now, I am pretty positive I understand Serialization in a core way, but I wouldn't claim that I know it very intimately, and thus, I'm in a bit of a bind.
I have quite a few scripts written, and here's the gist for them.
Note: My scripts very well may be a mess, but if that's so, please tell me. I don't claim that they're great, only that I have a lot there, and AFAIK, they're all alright, it's just doing the Serializing that is difficult for me for whatever reason.
Slight description of them: I am simply trying to make a character script for a Guard that will take both the Job: Mercenary, as well as the Type: GuardPrototype, and then, I want to be able to save that. In theory, the GameControl.cs script would accomplish that, but I'm having troubles (obviously), and I have a bunch of things commented out because I am fairly clueless, lol.
So, that said, I did do the Persistence and Saving tutorial from Unity, but I'm not only using/calling different scripts, I'm not handling simple floats, so I've had a hard time modifying that. Ultimately, I just want to know two things: Is my code that I am trying to save sensible? If it is, then how on earth would I use Serialization to save the info?
Thanks in advance, I appreciate any help I get.
TL;DR How in the hell does Serialization work with things that aren't simple floats, that are in separate scripts?
Notes:
The following are the chains of scripts I intend to use
ClassParadigm -> Mercenary //this is the job that gets used
TypeParadigm //because there are multiple it could be -> StandardParadigm -> GuardPrototype //of all the standard types, it is of a guard
Then, I want to have a script call them.
- Character (in this case, GuardA), which will then take a job, and a type (both established above), as well as StandardPlayerParadigm //What a standard player will possess
Finally, this is all supposed to be placed on an object in Unity, which I could then make a prefab of. So, in other words, if this were a character in my game, whenever that prefab was on the field, it'd be a GuardPrototype + Mercenary.
Edit: Thanks to Mike Hunt because they definitely assisted me big time with the main problem at hand. I now have a slightly different issue, but this seems MUCH more feasible.
Firstly, I updated my gist.
Secondly, I am having a thing in Unity where, when I attach the XMLSerialization script to a gameObject, it has some child profiles in it (like a nested menu that I don't want it to have).
I'm not quite sure how to combat that, and what's more it certainly doesn't seem like it's actually assigning the values I want it to have due to that (As in I want the GuardA script to have assigned stats from its "type" script I wrote, but I don't think it's working). I'm positive I just did something a bit excessive and somewhere in my code it made it call something extra, but I can't for the life of me figure out where that would've been.
So two questions now: A) What is going on with that?
B) Is this an effective use? Did I not quite implement this as intended?
Also, third question: This seems like an impeccable method for having duplicate enemies with minor variance in stats, but what exactly would I need to do to just save my standard player? Seems like it's still not quite hitting the mark for that, but I could be wrong and just not realize it.

If you want to use binary serialization that'll be the best to implement ISerializable.
You need provide 2 items:
Method for serialization, that will guide serializer 'what' and 'how' to save:
void GetObjectData(SerializationInfo info,
StreamingContext context)
Custom constructor. The ISerializable interface implies a constructor with the signature constructor
(SerializationInfo information, StreamingContext context)
And if you need another example article Object serizalization.
But for the game I would suggest to look at some custom Xml based serizaliser, so you don't need to write directions for binary serizaliation on every class change, and only properties needed. Ofc there might be some troubles with properties in Unity :(.

Create a class that will store the info to save and decorate it of the Serializable attribute:
[Serializable]
public class Storage{
public string name;
public int score;
}
When you want to save data, create an instance of this class, populate it, use .NET serialization and save with PlayerPrefs:
// Create
Storage storage = new Storage();
// Populate
storage.name = "Geoff";
storage.score = 10000;
// .NET serialization
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, items as Storage);
// use PlayerPrefs
PlayerPrefs.SetString("data", Convert.ToBase64String(ms.GetBuffer()));
You can retrieve with the invert process:
if (PlayerPrefs.HasKey("data") == false) { return null; }
string str = PlayerPrefs.GetString("data");
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream(Convert.FromBase64String(str));
Storage storage = bf.Deserialize(ms) as Storage;
I would suggest to convert that into a generic method so you can use any type with any key:
public static class Save
{
public static void SaveData<T>(string key, object value) where T : class
{
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, value as T);
PlayerPrefs.SetString(key, Convert.ToBase64String(ms.GetBuffer()));
}
public static T GetData<T>(string key) where T: class
{
if (PlayerPrefs.HasKey(key) == false) { return null; }
string str = PlayerPrefs.GetString(key);
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream(Convert.FromBase64String(str));
return bf.Deserialize(ms) as T;
}
}

Passing a complex data vs passing its property

I know this is a basic question but I kind of wanted to know some key details in their differences and how does it affects performance !
Below is a C# example.
Suppose I have a Person class with properties like
public class Person
{
int PersonIds{get;set;}
string Name{get;set;}
...
...
}
and I have defined somewhere
List<Person> myFamily; // I later initialize to contail all my family members.
Now I have 2 functions fun1 and fun2
List<Person> fun1(List<Person> myFamily)
{
... // here some logic occurs and I get some less no. of Person in return list.
...
}
somewhere else
List<Person> selectedPersons = fun1(myFamily);
VS
List<int> fun2(List<int> myFamilyPersonIds)
{
... // Here same logic occurs as fun1 but it only needs personID to perform it.
...
}
somewhere else
List<int> selectedPersonIds = fun2(myFamilyPersonIds);
List<Person> selectedPersons = myFamily.where(a=>selectedPersonIds.contains(a.PersonId));
I want to know in what ways does this effect the performance.
Tips and suggestions are also welcome.

The only way to know the correct answer is to use a tool to profile the code, and see where your program is really spending it's time. Before this, you're just doing guess-work and pre-mature optimization.
However, assuming you already have these objects constructed, I tend to prefer the fun1() option. This is because it's passing around references to the same objects in memory. The fun2() option needs to make copies of the integer IDs. As a reference is the same size as an integer, I'd expect copying integers to be about the same amount of work as copying references. That part is a wash.
However, by staying with references you can save the later step of finding the whole object based on the ID. That should simplify your code, making it easier to read and maintain, and save work for the computer (improve performance), too.
Also, imagine for a moment that your property were something larger than an integer... say, a string. In this case, passing the object could be a huge performance win. But again, what I expect is just a naive guess without results from a real profiling tool.

Serializing a Dictionary to disk?

We have a Hashtable (specifically the C# Dictionary class) that holds several thousands/millions of (Key,Value) pairs for near O(1) search hits/misses.
We'd like to be able to flush this data structure to disk (serialize it) and load it again later (deserialize) such that the internal hashtable of the Dictionary is preserved.
What we do right now:
Load from Disk => List<KVEntity>. (KVEntity is serializable. We use Avro to serialize - can drop Avro if needed)
Read every KVEntity from array => dictionary. This regenerates the dictionary/hashtable internal state.
< System operates, Dictionary can grow/shrink/values change etc >
When saving, read from the dictionary into array (via myKVDict.Values.SelectMany(x => x) into a new List<KVEntity>)
We serialize the array (List<KVEntity>) to disk to save the raw data
Notice that during our save/restore, we lose the internal tashtable/dictionary state and have to rebuild it each time.
We'd like to directly serialize to/from Dictionary (including it's internal "live" state) instead of using an intermediate array just for the disk i/o. How can we do that?
Some pseudo code:
// The actual "node" that has information. Both myKey and myValue have actual data work storing
public class KVEntity
{
public string myKey {get;set;}
public DataClass myValue {get;set;}
}
// unit of disk IO/serialization
public List<KVEntity> myKVList {get;set;}
// unit of run time processing. The string key is KVEntity.myKey
public Dictionary<string,KVEntity> myKVDict {get;set;}

Storing the internal state of the Dictionary instance would be bad practice - a key tenet of OOP is encapsulation: that internal implementation details are deliberately hidden from the consumer.
Furthermore, the mapping algorithm used by Dictionary might change across different versions of the .NET Framework, especially given that CIL assemblies are designed to be forward-compatible (i.e. a program written against .NET 2.0 will generally work against .NET 4.5).
Finally, there are no real performance gains from serialising the internal state of the dictionary. It is much better to use a well-defined file format with a focus on maintainability than speed. Besides, if the dictionary contains "several thousands" of entries then that should load from disk in under 15ms by my reckon (assuming you have an efficient on-disk format). Finally, a data structure optimised for RAM will not necessarily work well on-disk where sequential reads/writes are better.
Your post is very adamant about working with the internal state of the dictionary, but your existing approach seems fine (albiet, it could do with some optimisations). If you revealed more details we can help you make it faster.
Optimisations
The main issues I see with your existing implementation is the conversion to/from Arrays and Lists, which is unnecessary given that Dictionary is directly enumerable.
I would do something like this:
Dictionary<String,TFoo> dict = ... // where TFoo : new() && implements a arbitrary Serialize(BinaryWriter) and Deserialize(BinaryReader) methods
using(FileStream fs = File.OpenWrite("filename.dat"))
using(BinaryWriter wtr = new BinaryWriter(fs, Encoding.UTF8)) {
wtr.Write( dict.Count );
foreach(String key in dict.Keys) {
wtr.Write( key );
wtr.Write('\0');
dict[key].Serialize( wtr );
wtr.Write('\0'); // assuming NULL characters can work as record delimiters for safety.
}
}
Assuming that your TFoo's Serialize method is fast, I really don't think you'll get any faster speeds than this approach.
Implementing a de-serializer is an exercise for the reader, but should be trivial. Note how I stored the size of the dictionary to the file, so the returned dictionary can be set with the correct size when it's created, thus avoiding the re-balancing problem that #spender describes in his comment.

So we're going to stick with our existing strategy given Dai's reasoning and that we have C# and Java compatibility to maintain (which means the extra tree-state bits of the C# Dictionary would be dropped on the Java side anyways which would load only the node data as it does right now).
For later readers still interested in this I found a very good response here that somewhat answers the question posed. A critical difference is that this answer is for B+ Trees, not Dictionaries, although in practical applications those two data structures are very similar in performance. B+ Tree performance closer to Dictionaries than regular trees (like binary, red-black, AVL etc). Specifically, Dictionaries deliver near O(1) performance (but no "select from a range" abilities) while B+ Trees have O(logb(X)) where b = base is usually large which makes them very performant compared to regular trees where b=2. I'm copy-pasting it here for completeness but all credit goes to csharptest.net for the B+ Tree code, test, benchmarks and writeup(s).
For completeness I'm going to add my own implementation here.
Introduction - http://csharptest.net/?page_id=563
Benchmarks - http://csharptest.net/?p=586
Online Help - http://help.csharptest.net/
Source Code - http://code.google.com/p/csharptest-net/
Downloads - http://code.google.com/p/csharptest-net/downloads
NuGet Package - http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree

In memory representation of large data

Currently, I am working on a project where I need to bring GBs of data on to client machine to do some task and the task needs whole data as it do some analysis on the data and helps in decision making process.
so the question is, what are the best practices and suitable approach to manage that much amount of data into memory without hampering the performance of client machine and application.
note: at the time of application loading, we can spend time to bring data from database to client machine, that's totally acceptable in our case. but once the data is loaded into application at start up, performance is very important.

This is a little hard to answer without a problem statement, i.e. what problems you are currently facing, but the following is just some thoughts, based on some recent experiences we had in a similar scenario. It is, however, a lot of work to change to this type of model - so it also depends how much you can invest trying to "fix" it, and I can make no promise that "your problems" are the same as "our problems", if you see what I mean. So don't get cross if the following approach doesn't work for you!
Loading that much data into memory is always going to have some impact, however, I think I see what you are doing...
When loading that much data naively, you are going to have many (millions?) of objects and a similar-or-greater number of references. You're obviously going to want to be using x64, so the references will add up - but in terms of performance the biggesst problem is going to be garbage collection. You have a lot of objects that can never be collected, but the GC is going to know that you're using a ton of memory, and is going to try anyway periodically. This is something I looked at in more detail here, but the following graph shows the impact - in particular, those "spikes" are all GC killing performance:
For this scenario (a huge amount of data loaded, never released), we switched to using structs, i.e. loading the data into:
struct Foo {
private readonly int id;
private readonly double value;
public Foo(int id, double value) {
this.id = id;
this.value = value;
}
public int Id {get{return id;}}
public double Value {get{return value;}}
}
and stored those directly in arrays (not lists):
Foo[] foos = ...
the significance of that is that because some of these structs are quite big, we didn't want them copying themselves lots of times on the stack, but with an array you can do:
private void SomeMethod(ref Foo foo) {
if(foo.Value == ...) {blah blah blah}
}
// call ^^^
int index = 17;
SomeMethod(ref foos[index]);
Note that we've passed the object directly - it was never copied; foo.Value is actually looking directly inside the array. The tricky bit starts when you need relationships between objects. You can't store a reference here, as it is a struct, and you can't store that. What you can do, though, is store the index (into the array). For example:
struct Customer {
... more not shown
public int FooIndex { get { return fooIndex; } }
}
Not quite as convenient as customer.Foo, but the following works nicely:
Foo foo = foos[customer.FooIndex];
// or, when passing to a method, SomeMethod(ref foos[customer.FooIndex]);
Key points:
we're now using half the size for "references" (an int is 4 bytes; a reference on x64 is 8 bytes)
we don't have several-million object headers in memory
we don't have a huge object graph for GC to look at; only a small number of arrays that GC can look at incredibly quickly
but it is a little less convenient to work with, and needs some initial processing when loading
additional notes:
strings are a killer; if you have millions of strings, then that is problematic; at a minimum, if you have strings that are repeated, make sure you do some custom interning (not string.Intern, that would be bad) to ensure you only have one instance of each repeated value, rather than 800,000 strings with the same contents
if you have repeated data of finite length, rather than sub-lists/arrays, you might consider a fixed array; this requires unsafe code, but avoids another myriad of objects and references
As an additional footnote, with that volume of data, you should think very seriously about your serialization protocols, i.e. how you're sending the data down the wire. I would strongly suggest staying far away from things like XmlSerializer, DataContractSerializer or BinaryFormatter. If you want pointers on this subject, let me know.

Create new instance or just set internal variables

I've written a helper class that takes a string in the constructor and provides a lot of Get properties to return various aspects of the string. Currently the only way to set the line is through the constructor and once it is set it cannot be changed. Since this class only has one internal variable (the string) I was wondering if I should keep it this way or should I allow the string to be set as well?
Some example code my help why I'm asking:
StreamReader stream = new StreamReader("ScannedFile.dat");
ScannerLine line = null;
int responses = 0;
while (!stream.EndOfStream)
{
line = new ScannerLine(stream.ReadLine());
if (line.IsValid && !line.IsKey && line.HasResponses)
responses++;
}
Above is a quick example of counting the number of valid responses in a given scanned file. Would it be more advantageous to code it like this instead?
StreamReader stream = new StreamReader("ScannedFile.dat");
ScannerLine line = new ScannerLine();
int responses = 0;
while (!stream.EndOfStream)
{
line.RawLine = stream.ReadLine();
if (line.IsValid && !line.IsKey && line.HasResponses)
responses++;
}
This code is used in the back end of a ASP.net web application and needs to be somewhat responsive. I am aware that this may be a case of premature optimization but I'm coding this for responsiveness on the client side and maintainability.
Thanks!
EDIT - I decided to include the constructor of the class as well (Yes, this is what it really is.) :
public class ScannerLine
{
private string line;
public ScannerLine(string line)
{
this.line = line;
}
/// <summary>Gets the date the exam was scanned.</summary>
public DateTime ScanDate
{
get
{
DateTime test = DateTime.MinValue;
DateTime.TryParseExact(line.Substring(12, 6).Trim(), "MMddyy", CultureInfo.InvariantCulture, DateTimeStyles.None, out test);
return test;
}
}
/// <summary>Gets a value indicating whether to use raw scoring.</summary>
public bool UseRaw { get { return (line.Substring(112, 1) == "R" ? true : false); } }
/// <summary>Gets the raw points per question.</summary>
public float RawPoints
{
get
{
float test = float.MinValue;
float.TryParse(line.Substring(113, 4).Insert(2, "."), out test);
return test;
}
}
}
**EDIT 2 - ** I included some sample properties of the class to help clarify. As you can see, the class takes a fixed string from a scanner and simply makes it easier to break apart the line into more useful chunks. The file is a line delimiated file from a Scantron machine and the only way to parse it is a bunch of string.Substring calls and conversions.

I would definitely stick with the immutable version if you really need the class at all. Immutability makes it easier to reason about your code - if you store a reference to a ScannerLine, it's useful to know that it's not going to change. The performance is almost certain to be insignificant - the IO involved in reading the line is likely to be more significant than creating a new object. If you're really concerned about performance, should should benchmark/profile the code before you decide to make a design decision based on those performance worries.
However, if your state is just a string, are you really providing much benefit over just storing the strings directly and having appropriate methods to analyse them later? Does ScannerLine analyse the string and cache that analysis, or is it really just a bunch of parsing methods?

You're first approach is more clear. Performance wise you can gain something but I don't think is worth.

I would go with the second option. It's more efficient, and they're both equally easy to understand IMO. Plus, you probably have no way of knowing how many times those statements in the while loop are going to be called. So who knows? It could be a .01% performance gain, or a 50% performance gain (not likely, but maybe)!

Immutable classes have a lot of advantages. It makes sense for a simple value class like this to be immutable. The object creation time for classes is small for modern VMs. The way you have it is just fine.

I'd actually ditch the "instance" nature of the class entirely, and use it as a static class, not an instance as you are right now. Every property is entirely independent from each other, EXCEPT for the string used. If these properties were related to each other, and/or there were other "hidden" variables that were set up every time that the string was assigned (pre-processing the properties for example), then there'd be reasons to do it one way or the other with re-assignment, but from what you're doing there, I'd change it to be 100% static methods of the class.
If you insist on having the class be an instance, then for pure performance reasons I'd allow re-assignment of the string, as then the CLR isn't creating and destroying instances of the same class continually (except for the string itself obviously).
At the end of the day, IMO this is something you can really do any way you want since there are no other class instance variables. There may be style reasons to do one or the other, but it'd be hard to be "wrong" when solving that problem. If there were other variables in the class that were set upon construction, then this'd be a whole different issue, but right now, code for what you see as the most clear.

I'd go with your first option. There's no reason for the class to be mutable in your example. Keep it simple unless you actually have a need to make it mutable. If you're really that concerned with performance, then run some performance analysis tests and see what the differences are.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.