I'm trying to understand and to decide the best approach to my problem.
I've an xsd that represents the schema of the information that I agreed with a client.
Now, in my application (c#, .net3.5) I use and consume an object that has been deserialized from an xml created according to the xsd schema.
As soon as I fill the object with data, I want to pass it to another application and also store it in a db. I have two questions:
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Unfortunately in the db I have a limited sized field to store the info, so I need a sort of compression of the serialized object. Binary serialization creates smaller data then xml serialization or I need in any case to compress this data? if yes, how?
Thanks!
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Neither is specific enough; binary can be good or bad; xml can be good or bad. Generally speaking binary is smaller and faster to process, but changing to such will be unusable from code that expects xml.
Binary serialization creates smaller data then xml serialization or I need in any case to compress this data?
It can be smaller; or it can be larger; indeed, compression can make things smaller or larger too.
If space is your primary concern, I would suggest running it through something like protobuf-net (a binary serializer without the versioning issues common to BinaryFormatter), and then speculatively try compressing it with GZipStream. If the compressed version is smaller: store that (and a marker - perhaps a preamble - that says "I'm compressed"). If the compressed version gets bigger than the original version, store the original (again with a preamble).
Here's a recent breakdown of the performance (speed and size) of the common .NET serializers: http://theburningmonk.com/2013/09/binary-and-json-serializer-benchmarks-updated/
Is it currently possible to use IsPacked=true for user defined structures? If not, then is it planned in the future?
I'm getting the following exception when I tried to apply that attribute to a field of the type ColorBGRA8[]: System.InvalidOperationException : Only simple data-types can use packed encoding
My scenario is as follows: I'm writing a game and have tons of blitable structures for various things such as colors, vectors, matrices, vertices, constant buffers. Their memory layout needs to be precisely defined at compile time in order to match for example the constant buffer layout from a shader (where fields generally? need to be aligned on a 16 byte boundary).
I don't mean to waste anyone's time, but I couldn't find any recent information about this particular question.
Edit after it has been answered
I am currently testing a solution which uses protobuf-net for almost everything but large arrays of user defined, but blitable structures. All my fields of arrays of custom structures have been replaced by arrays of bytes, which can be packed. After protobuf-net is finished deserializing the data, I then use memcpy via p/invoke in order to be able to work with an array of custom structures again.
The following numbers are from a test which serializes one instance containing one field of either the byte[] or ColorBGRA8[]. The raw test data is ~38MiB of data, e.g. 1000000 entries in the color array. Serialization was one in memory using MemoryStream.
Writing
Platform.Copy + Protobuf: 51ms, Size: 38,15 MiB
Protobuf: 2093ms, Size: 109,45 MiB
Reading
Platform.Copy + Protobuf: 43ms
Protobuf: 2307ms
The test shows that for huge arrays of more or less random data, a noticeable memory overhead can occur. This wouldn't have been such a big deal, if not for the (de)serialization times. I understand protobuf-net might not be designed for my extreme case, let alone optimized for it, but it is something I am not willing to accept.
I think I will stick with this hybrid approach, as protobuf-net works extremely well for everything else.
Simply "does not apply". To quote from the encoding specification:
Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".
This doesn't work with custom structures or classes. The two approaches that apply here are strings (length-prefixed) and groups (start/end tokens). The latter is often cheaper to encode, but Google prefer the former.
Protobuf is not designed to arbitrarily match some other byte layout. It is its own encoding format and is only designed to process / output protobuf data. It would be like saying "I'm writing XML, but I want it to look like {non-xml} instead".
Our application at work basically has to create over a million objects each night to run a numerical simulation involving some weather observations that were recorded during the day.
Each object contains a few string properties (and one very large xml property - About 2 MB) - Beacuse of the size of the large xml property we dont load this up and instead prefer to go the database when we need access to this xml blob (which we do for each object)
I was wondering if it makes sense to somehow retrieve the xml data (which is 2MB) compress it in memory and store it in the object - This prevents us having to do a database query for each object when we come to process it.
I would much rather zip the data, store it in the object and at processing time, unzip and process
Is it possible to zip a string in process and how can I do this without creating millions of MemoryStreams / zip streams for each object?
I would think that compression is not a good idea - it adds quite an overhead to processing, which already appears to be quite intensive.
Perhaps a light-weight format would be better - JSON or a binary serialized object representing the data.
Without more detail, it is difficult to give a definite answer, or better options.
Well, there is DotNetZip which has a simple API so you can do something like this:
byte[] compressedProperty;
public string MyProperty
{
get { DeflateStream.UncompressString(compressedProperty); }
set { compressedProperty = DeflateStream.CompressString(value); }
}
Not sure if it will work out performance wise for you though.
Update:
I only know the GZipStream and the DeflateStream class. Neither of them expose a string interface. Even DotNetZip uses a stream under the hood when you call the functions above, it's just wrapped around a nice interface (which you could do with the System.IO.Compression classes on your own). Not sure what your problem is with streams.
If you really want to avoid streams then you probably have to roll your own compression. Here is a guy who rolled a simple Huffman encoder to encode strings in F#. Don't know how well it works but I you want to avoid 3rd party libs and streams then you could give it a crack.
Below is something I read and was wondering if the statement is true.
Serialization is the process of
converting a data structure or object
into a sequence of bits so that it can
be stored in a file or memory buffer,
or transmitted across a network
connection link to be "resurrected"
later in the same or another computer
environment.[1] When the resulting
series of bits is reread according to
the serialization format, it can be
used to create a semantically
identical clone of the original
object. For many complex objects, such
as those that make extensive use of
references, this process is not
straightforward.
Serialization is just a fancy way of describing what you do when you want a certain data structure, class, etc to be transmitted.
For example, say I have a structure:
struct Color
{
int R, G, B;
};
When you transmit this over a network you don't say send Color. You create a line of bits and send it. I could create an unsigned char* and concatenate R, G, and B and then send these. I just did serialization
Serialization of some kind is required, but this can take many forms. It can be something like dotNET serialization, that is handled by the language, or it can be a custom built format. Maybe a series of bytes where each byte represents some "magic value" that only you and your application understand.
For example, in dotNET I can can create a class with a single string property, mark it as serializable and the dotNET framework takes care of most everything else.
I can also build my own custom format where the first 4 bytes represent the length of the data being sent and all subsequent bytes are characters in a string. But then of course you need to worry about byte ordering, unicode vs ansi encoding, etc etc.
Typically it is easier to make use of whatever framework your language/OS/dev framework uses, but it is not required.
Yes, serialization is the only way to transmit data over the wire. Consider what the purpose of serialization is. You define the way that the class is stored. In memory tho, you have no way to know exactly where each portion of the class is. Especially if you have, for instance, a list, if it's been allocated early but then reallocated, it's likely to be fragmented all over the place, so it's not one contiguous block of memory. How do you send that fragmented class over the line?
For that matter, if you send a List<ComplexType> over the wire, how does it know where each ComplexType begins and ends.
The real problem here is not getting over the wire, the problem is ending up with the same semantic object on the other side of the wire. For properly transporting data between dissimilar systems -- whether via TCP/IP, floppy, or punch card -- the data must be encoded (serialized) into a platform independent representation.
Because of alignment and type-size issues, if you attempted to do a straight binary transfer of your object it would cause Undefined Behavior (to borrow the definition from the C/C++ standards).
For example the size and alignment of the long datatype can differ between architectures, platforms, languages, and even different builds of the same compiler.
Is serialization a must in order to transfer data across the wire?
Literally no.
It is conceivable that you can move data from one address space to another without serializing it. For example, a hypothetical system using distributed virtual memory could move data / objects from one machine to another by sending pages ... without any specific serialization step.
And within a machine, the objects could be transferred by switch pages from one virtual address space to another.
But in practice, the answer is yes. I'm not aware of any mainstream technology that works that way.
For anything more complex than a primitive or a homogeneous run of primitives, yes.
Binary serialization is not the only option. You can also serialize an object as an XML file, for example. Or as a JSON.
I think you're asking the wrong question. Serialization is a concept in computer programming and there are certain requirements which must be satisfied for something to be considered a serialization mechanism.
Any means of preparing data such that it can be transmitted or stored in such a way that another program (including but not limited to another instance of the same program on another system or at another time) can read the data and re-instantiate whatever objects the data represents.
Note I slipped the term "objects" in there. If I write a program that stores a bunch of text in a file; and I later use some other program, or some instance of that first program to read that data ... I haven't really used a "serialization" mechanism. If I write it in such a way that the text is also stored with some state about how it was being manipulated ... that might entail serialization.
The term is used mostly to convey the concept that active combinations of behavior and state are being rendered into a form which can be read by another program/instance and instantiated. Most serialization mechanism are bound to a particular programming language, or virtual machine system (in the sense of a Java VM, a C# VM etc; not in the sense of "VMware" virtual machines). JSON (and YAML) are a notable exception to this. They represents data for which there are reasonably close object classes with reasonably similar semantics such that they can be instantiated in multiple different programming languages in a meaningful way.
It's not that all data transmission or storage entails "serialization" ... is that certain ways of storing and transmitting data can be used for serialization. At very list it must be possible to disambiguated among the types of data that the programming language supports. If it reads: 1 is has to know whether that's text or an integer or a real (equivalent to 1.0) or a bit.
Strictly speaking it isn't the only option; you could put an argument that "remoting" meets the meaning inthe text; here a fake object is created at the receiver that contains no state. All calls (methods, properties etc) are intercepted and only the call and result are transferred. This avoids the need to transfer the object itself, but can get very expensive if overly "chatty" usage is involved (I.e. Lots of calls)as each has the latency of the speed of light (which adds up).
However, "remoting" is now rather out of fashion. Most often, yes: the object will need to be serialised and deserialized in some way (there are lots of options here). The paragraph is then pretty-much correct.
Having a messages as objects and serializing into bytes is a better way of understanding and managing what is transmitted over wire. In the old days protocols and data was much simpler, often, programmers just put bytes into output stream. Common understanding was shared by having well-known and simple specifications.
I would say serialization is needed to store the objects in file for persistence, but dynamically allocated pointers in objects need to be build again when we de-serialize, But the serialization for transfer depends on the physical protocol and the mechanism used, for example if i use UART to transfer data then its serialized bit by bit but if i use parallel port then 8 bits together gets transferred , which is not serialized
I am looking for the protobuf-net equivalent to the C++ API Message::ByteSize to find out the serialized message length in bytes.
I haven't played with the C++ API, so you'll have to give me a bit more context / information. What does this method do? Perhaps a sample usage?
If you are consuming data from a stream, there are "WithLengthPrefix" versions to automate limiting to discreet messages, or I believe the method to just read the next length from the stream is on the public API.
If you want to get a length in place of serializing, then currently I suspect the easiest option might be to serialize to a dummy stream and track the length. Oddly enough, an early version of protobuf-net did have "get the length without doing the work" methods, but after discussion on the protobuf-net I removed these. The data serialized is still tracked, obviously. However, because the API is different than the binary data length for objects is not available "for free".
If you clarify what the use-case is, I'm sure we can make it easily available (if it isn't already).
Re the comment; that is what I suspected. Because protobuf-net defers the binary translation to the last moment (because it is dealing with regular .NET types, not some self-generated code) there is no automatic way of getting this value without doing the work. I could add a mechanism to let you get this value by writing to Stream.Null? but if you need the data anyway you might benefit from just writing to MemoryStream and checking the .Length in advance of copying the data.