I'm trying to understand and to decide the best approach to my problem.
I've an xsd that represents the schema of the information that I agreed with a client.
Now, in my application (c#, .net3.5) I use and consume an object that has been deserialized from an xml created according to the xsd schema.
As soon as I fill the object with data, I want to pass it to another application and also store it in a db. I have two questions:
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Unfortunately in the db I have a limited sized field to store the info, so I need a sort of compression of the serialized object. Binary serialization creates smaller data then xml serialization or I need in any case to compress this data? if yes, how?
Thanks!
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Neither is specific enough; binary can be good or bad; xml can be good or bad. Generally speaking binary is smaller and faster to process, but changing to such will be unusable from code that expects xml.
Binary serialization creates smaller data then xml serialization or I need in any case to compress this data?
It can be smaller; or it can be larger; indeed, compression can make things smaller or larger too.
If space is your primary concern, I would suggest running it through something like protobuf-net (a binary serializer without the versioning issues common to BinaryFormatter), and then speculatively try compressing it with GZipStream. If the compressed version is smaller: store that (and a marker - perhaps a preamble - that says "I'm compressed"). If the compressed version gets bigger than the original version, store the original (again with a preamble).
Here's a recent breakdown of the performance (speed and size) of the common .NET serializers: http://theburningmonk.com/2013/09/binary-and-json-serializer-benchmarks-updated/
Related
I am recieving binary stream from an application I am running in Python.
From the binary stream, I want to create a C# object that is inside the stream in byte array.
How do I deserialise the object and retrieve the object from the binary stream?
We can ignore that it's a python application. I am more interested in how binary streaming works.
You seem to think that all languages automatically use the same serialization scheme.
This is not so.
It is not even theoretically possible, because different programming languages have different notions of what it means to be an object.
If you are specifically interested in how to read a Python serialized stream in C#, then ask that. Otherwise, this question is unanswerable because it is based on a false premise.
FOLLOW UP - Out of curiosity, I did some searching for a Python pickle reader in C#. Nothing in the first 3 pages of search results ... though there was a reference to a pickle reader in C++.
Just to add you a little general info:
In C#/.Net there's a general approach to serialize objects to NOT a binary form, because a binary form needs a lot of protocol-like headers to - note - include the metadata, and this causes the receiver to have to know the .Net/CLR inner structure very well.
Instead, today, the objects are usually serialized to XML (when type information is crucial) or JSON formats (when only data matters), so that any receiver may read them quite easily, and more often - any 3rd party may easily generate new object-like data that our application may "just deserialize", regardless of who generated it and on what platform.
However, binary serialization is still used. XML/JSON data, even if compressed, is still usually larger than the binary image. However, the binary serialization is strictly used when we want the data to not be published to the outside world, or if we somehow magically know that it will be only processed on .Net with use of our assemblies.
C# object
C# does not have objects; it's a .Net object.
Secondly we absolutely CANNOT ignore that it's a Python application, because that implies that it's likely it's not running on .Net and therefore the .Net binary format is not native to your Python runtime. That's not to say that it's not possible for the .Net serialization to be available to you in this case, because if you're running IronPython - the .Net python implementation - then you can simply use the Binary serialization APIs from within that and get the .Net object that was serialized.
If, however, it's Python running on a different platform, then you can decode the information in the binary stream, for that you need to know the format, and for that go straight to the horse's mouth and read through the Binary Format Data Structure spec from MSDN.
This will, of course, require (quite a lot) more work!
If the project you're working on allows you to change the way that the original object is serialized, then I strongly suggest changing over to XML serialization or something similar - that is designed to be portable.
Ok. I know how to use Serialization and such, but since that only applies to Objects that's been marked with Serialization attribute - how can I for example load data and use it in an application without using Serialization? Say a data file.
Or, create a datacontainer with serialization that holds files not serialized.
Methods I've used is Binary Serialization and XML Serialization. Any other ways that can load unknown data and perhaps somehow use it in C#?
JSON serialization using JSON.NET
This eats everything! Including anonymous types.
Edit
I know you said "you don't want serialization", but based on your statement "[...]Objects that's been marked with Serialization attribute", I believe you didn't try JSON serialization using JSON.NET!
Maybe a definition of terms is in order; serialization is "the process of converting a data structure or object state into a format that can be stored and "resurrected" later in the same or another computer environment". Pretty much any method of converting "volatile" memory into persistent data and back is "serialization", so even if you roll your own scheme to do it, you're "serializing".
That said, it sounds like you simply don't want to use .NET binary serialization. That's actually the right idea; binary serialization is simple, but very code- and environment-dependent. Moving a serializable class to a different namespace, or serializing a file using the Microsoft CLR and then trying to deserialize it in Mono, can break binary serialization.
First and foremost, you MUST be able to determine what type of object you should try to create based on the file. You simply cannot open some "random" file and expect to be able to get anything meaningful out of it without knowing how the data is structured within the file. The easiest way is for the file to tell you, by specifying the type name of the object it was created from (which you will hopefully have available in your codebase). Most built-in serializers do it this way. Other ways the file can inform consumers of its format include file, row and/or field header codes (very common in older standards as they economize on file size) and extension/MIME type.
With that sorted out, deserialization can take place. If the file was serialized using a built-in serializer, simply use that, but if it's an older format (CSV, fixed-length) then you will have to parse the file, line by line, into objects representing lines, collected within a main object representing the file.
Have a look at the ETL (Extract-Transform-Load) process pattern. This is a modular, scaleable architecture pattern for taking files and turning them into data the program can work with:
Extract - This part of the system is pointed at the filesystem, or other incoming "pipe" for raw data, and its job is to open the file, extract the data into a very basic object format that can be further manipulated, and put those objects into an in-memory "queue" for the Transform step. The goal is to get data from the pipe as fast and efficiently as possible, but you are required at this point to have some knowledge of the data you are working with so that you can effectively encapsulate it for further processing; actually turning the data into the format you really want happens later.
Transform - This part of the system takes the extracted data, and performs the logic that will put that data into a hydrated object from your codebase. This is where, given information from the Extract step about the type of file the data was extracted from, you instantiate a domain object that represents the data model, slice the raw data up into the chunks that will be stored as data members, perform any type conversions (data you get from a file is usually either in string format or in raw bits and must be marshalled or otherwise converted into data types that better represent the concept of the data), and validate that the internal structure of the new object is consistent and meets known business rules. Hydrated, valid objects are placed in an output queue to be processed by the Load step.
Load - This step takes the hydrated, valid business objects from the Transform step and persists them into the data store that is used by your system (such as a SQL database or the program's native flat file format).
Well, the old fashioned way was to use stream access operations and read out the data you wanted. This way you could read/write to pretty much any file.
Serialization simply automates this process based on some contract.
Based on your comment, I'm guessing that your requirement is to read any kind of file without having a contract in the first place.
Let's say you have a raw file with the first byte specifying the length of a string and the next set of bytes representing the string;
For example, 5 | H | e | l | l | o
var stream = File.Open(filename);
var length = stream.ReadByte();
byte[] b = new byte[length];
stream.Read(b, 0, length);
var string = Encoding.ASCII.GetString(b);
Binary I/O is as raw as it gets.
Check MSDN for more.
Im looking for a simple solution to serialize and store objects that contain configuration, application state and data. Its a simple application, its not alot of data. Speed is no issue. I want it to be in-process. I want it to be more easy-to-edit in a texteditor than xml.
I cant find any document database for .net that can handle it in-process.
Simply serializing to xml Im not sure I want to do because its... xml.
Serializing to JSON seems very javascript specific, and I wont use this data in javascript.
I figure there's very neat ways to do this, but atm im leaning to using JSON despite its javascript inclenation.
Just because "JSON" it's an acronym for JavaScript Object Notation, has no relevance on if it fits your needs or not as a data format. JSON is lightweight, text based, easily human readable / editable and it's a language agnostic format despite the name.
I'd definitely lean toward using it, as it sounds pretty ideal for your situation.
I will give a couple of choices :
Binary serialization: depends on content of your objects, if you have complicated dependecy tree it can create a problems on serializing. Also it's not very flexible, as standart binary serialization provided by Microsoft stores saving type information too. That means if you save a type in binary file, and after one month decide to reorganize your code and let's say move the same class to another namespace, on desirialization from binary file previously saved it will fail, as the type is not more the same. There are several workarrounds on that, but I personally try to avoid that kind of serialization as much as I can.
ORM mapping and storing it into small database. SQLite is awesome choice for this kind of stuff as it small (single file) and full ACID support database. You need a mapper, or you need implement mapper by yourself.
I'm sure that you will get some other choice from the folks in a couple of minutes.
So choice is up to you.
Good luck.
I'm confused - when should I be using XML Serialization and when should I be using Binary Serialization in the .NET framework?
Both of the existing answers focus on "cross platform", but that is an unrelated issue. The point they are making there is "don't use BinaryFormatter if you are doing cross-platform" - which I entirely support. However there are a range of binary serialization formats that are very much cross-platform - protobuf / ASN.1 being prime examples.
So, let's look instead at what each has to offer;
Binary is typically smaller, typically faster to process (at both ends), and not easily human readable / editable
Text formats (xml / json) tend to be more verbose than binary (although often compresses well), but are pretty easy to work with by hand; but all that text processing an mapping tends to make them slower
xml is very common is web-services, and benefits from gooling support such as xsd, xslt and robust xml editors
json is the main player in browser-based comms (although it is also used in web-services) - tends to be less formal but still very effective
Notice how interoperability is neither a strength nor weakness of either, as long as you choose an appropriate binary format!
Here's an answer that compares the serialization time, deserialization and space metrics of most of the .NET serializers, for your reference.
Specific to .NET, If you have two applications that are using the same type system, then you can use binary serialization. On the other hand if you have applications that are in different platforms then it is recommended to use XML Serialization. So if i am writing a chat application (client and server), I might use binary serialization, but if I later decide that I should use Python to write a client, then I may not.
If you want user friendly or cross platform output, then use XML.
Also, XML serialization use public fields and properties to serialize and you can change/reformat output by using attributes and custom serialization on class level if you whant.
Bin.Ser. uses private fields to serialize.
I am writing code that parses XML.
I would like to know what is faster to parse: elements or attributes.
This will have a direct effect over my XML design.
Please target the answers to C# and the differences between LINQ and XmlReader.
Thanks.
Design your XML schema so that representation of the information actually makes sense. Usually, the decision between making something in attribute or an element will not affect performance.
Performance problems with XML are in most cases related to large amounts of data that are represented in a very verbose XML dialect. A typical countermeasures is to zip the XML data when storing or transmitting them over the wire.
If that is not sufficient then switching to another format such as JSON, ASN.1 or a custom binary format might be the way to go.
Addressing the second part of your question: The main difference between the XDocument (LINQ) and the XmlReader class is that the XDocument class builds a full document object model (DOM) in memory, which might be an expensive operation, whereas the XmlReader class gives you a tokenized stream on the input document.
With XML, speed is dependent on a lot of factors.
With regards to attributes or elements, pick the one that more closely matches the data. As a guideline, we use attributes for, well, attributes of an object; and elements for contained sub objects.
Depending on the amount of data you are talking about using attributes can save you a bit on the size of your xml streams. For example, <person id="123" /> is smaller than <person><id>123</id></person> This doesn't really impact the parsing, but will impact the speed of sending the data across a network wire or loading it from disk... If we are talking about thousands of such records then it may make a difference to your application.
Of course, if that actually does make a difference then using JSON or some binary representation is probably a better way to go.
The first question you need to ask is whether XML is even required. If it doesn't need to be human readable then binary is probably better. Heck, a CSV or even a fixed-width file might be better.
With regards to LINQ vs XmlReader, this is going to boil down to what you do with the data as you are parsing it. Do you need to instantiate a bunch of objects and handle them that way or do you just need to read the stream as it comes in? You might even find that just doing basic string manipulation on the data might be the easiest/best way to go.
Point is, you will probably need to examine the strengths of each approach beyond just "what parses faster".
Without having any hard numbers to prove it, I know that the WCF team at Microsoft chose to make the DataContractSerializer their standard for WCF. It's limited in that it doesn't support XML attributes, but it is indeed up to 10-15% faster than the XmlSerializer.
From that information, I would assume that using XML attributes will be slower to parse than if you use only XML elements.