versioning serialized files

versioning serialized files - c#

I've got a working app that serializes a document (of type IDocument) to disk. From there there's another app that I've made that can open that document (IDocument implements IPrintDocument) for viewing.
Let's assume that I've written an IDocument to disk, and then a week later a field gets added to the IDocument object. Both the program that writes the files and the one that opens them are updated with this new 'version' of IDocument. It will then break (I assume - haven't had a chance to check, I'm looking ahead here) when trying to open the previous IDocument version. Is there a known pattern that alleviates this kind of problem?

Yes - use a serialization mechanism which is tolerant to versioning.
Predictably enough, I'm going to suggest using Google's Protocol Buffers, for which there are at least two viable .NET implementations. So long as you're careful, Protocol Buffers are both backward and forward compatible - you can read a new message with old code and vice versa, and the old code will still be able to preserve the information it doesn't understand.
Another alternative is XML, whether using .NET's built-in XML serialization or not. The built-in serialization isn't particularly flexible in terms of versioning as far as I'm aware.

The .net built-in serialization is an option, but it does requires you to add place holders on the specific pieces that you want to extend in the future.
You add place holders for the extra elements/attributes like the following code:
[XmlAnyElement()]
public XmlElement[] ExtendedElements { get; set; }
[XmlAnyAttribute()]
public XmlAttribute[] ExtendedAttributes { get; set; }
By adding the above in the involved classes, you can effectively read a information saved that has extra elements/attributes, modify the normal properties that the software knows how to handle and save it. This allows for both backwards and forward compatibility. When adding a new field, just add the desired property.
Note that the above is limited to extend in the specified hooks.
Update: As Jon mentioned in the comment, the above will only work for xml serialization. As far as I know binary serialization doesn't support something similar. In binary serialization you can get both old/new version of the app to be able to read each other serialized info (.net 2.0+), but if you save it back you will loose the extra info the version doesn't handles.
Starting at .net 2.0 the de-serialization process ignores the extra data, if you combine that with optional fields you can effectively get both apps to read other version's formats. The problem is that the data isn't hold by the class like in the xml fields.
Some related links: http://msdn.microsoft.com/en-us/library/system.runtime.serialization.serializationbinder.aspx, http://msdn.microsoft.com/en-us/library/ms229752.aspx
If you don't want xml serialization, I would go with Jon's approach.
Ps. I am unaware if there is some good third party implementation, that we can access, that extends the binary serialization to hold and save the extra data.

Built in serialization should give you some minimal tolerance for version updates using the [OptionalField] attribute. But stuff can get tricky really fast so you better look at using a framework that solved these issues like Jons protobuffers etc...
Another couple of options would be to use an embedded DB like Sqlite for your document store. And manually (or using an ORM) map properties/fields in your object to columns in a table.
or
Use Lucene which will also give you fulltext search through your documents.

Related

Get version when deserializing binary object

I'm serializing a class using a BinaryFormatter. When I open the created file in a texteditor, I can see that at the beginning, some attributes like namespace, version, cultureInfo, ... are written there. How can I read this version string out when deserializing this file again?
Thank you in advance!

You should take a look at this articles at MSDN:
Run-time Serialization, Part 1
Run-time Serialization, Part 2
Run-time Serialization, Part 3
The BinaryFormatter has two properties: Binder and SurrogateSelector.
With these you are able to interfere the serialization / deserialization process and access these informations. More informations about it can be found in the articles above.

You probably should read that part like a normal file (read and check bytes).
However, why would you be interested in that part? If you are, than it's best to add your own version attributes in the normal way as other data to be serialized and retrieve it the normal way (by deserialization like all other data).
Remark to your comment:
If this is the first time, you could write an 'updater', which reads the old file and transforms it with a new (so change the enum values). For the new serialization object, add a version (always, and update it for each version your publish). This case, you can always differ on changes. By making such an update function, you always can change older versions of data to newer versions. In this case (since you don't have a version), you can assume it is the old version.

Xml Reading with backward Compatibility

The XML that i am currently working is directly formed using XML serializer (Serializing Class and its nested counter parts)
Also if there is an addition of a new Property is directly handled by the serializer but the problem comes when there is a deletion of property (value Type) or removal of and entire class or addition of class
I wish to read the old as well as the new XML files.... I cant seem to figure out how..
Process
Some ways
But i don't think these are good for a maintainable code
1) Make the custom XML parser (this will be less flexible as every time the change is done the parser has to be updated and hence tested again).
2) Use multiple Models then migrate from old to new (Taking essential components)
3) Export Old file and import the new file (This will also require another XML file and may b related to point 2)
4) Any other means (Please suggest)
I am not well versed with XML and its versioning.
Also is XML a good choice for this or Any other file type/DB that i can use in place of XML
Any help in this regard would be helpful.

In most ways, XmlSerializer already has pretty good version support built in. In most cases, if you add or remove elements it isn't a problem: extra (unexpected) data will be silently ignored - or put into the [XmlAnyElement] / [XmlAnyAttribute] member (if one) for round-trip. Any missing data just won't be initialized. The only noticeable problem is with sub-types, but adding and removing sub-types (or entire types) is going to be fairly fundamental to any serializer. One common option in the case of sub-types is: use a single model, but just don't remove any sub-types (adding sub-types is fine, assuming you don't need to be forwards compatible). However if this is not possible, the multiple models (model per revision) is not a bad approach.

I usually follow your solution "#2" where I namespace version my models (Myapp.Models.V1.MyModel), this way you can maintain backward compatibility with clients still using the older schema (or in your case, loading an older file).
As suggested in the comments, you can use a simple attribute on the root node to determine the version, and use either xmlreader, or even a simple regex on the first line of the file to read the version number.
As far as your second question, about file type/db, depending on your needs, I would highly recommend looking at a document database like MongoDB or RavenDB, as implementation is straightforward/simple, and does not require the use of an ORM tool like entity framework to handle proper separation of concerns. If you need something portable, in the cases such as desktop app "save file", SqlLite is a good file based databases, but you will likely want to use an ORM for mapping your model to your database.
Links:
MongoDB: http://www.mongodb.org/
RavenDB: http://ravendb.net/
Sqllite: http://www.sqlite.org/

Good coding practice when saving data to files in .net

To give a bit of a background.
I have created an application that allows users to save settings and then recall the settings at a later date. To do this I have created some serializable objects. I have gotten this to work using the BinaryFormatter without much trouble.
Where I start to run into problems is when I upgrade the software and add new settings. Now my serializable objects do not match and so I have to update the files. I have done this successfully for a few versions. But to do this I try deserializing the file and if it throws an exception, I try with the next version. . .and then the next. . .and then the next. . . until I find the right one. Then I have to write conversion functions for each of old versions to convert it into the newest version. I did create a "revision" file as well, so I can just check up front what version they have and then upgrade it, but I still have to keep a lot of different "versions" alive and write the conversion functions for all of them. . . which seems inherently messy to me and prone to bloat later on down the line if I keep going this route.
There has to be a better way to do this, I just am not sure how.
Thanks

You need to write a serialization binder to resolve assemblies.
For settings, I use a Dictionary<string, byte[]> to save to file. I serialize the dictionary and all is well. When I add new settings, I provide a default setting if not found in the settings file.
Also, if you are adding fields to a serialized object, you can decorate with [Optional].

This is exactly what the Settings class is for. You define default values in your app.config, and then a user can change them and when you save, their changes will save to a location in their user profile. When you read them you'll just get the modified settings.
This link is for VS 2005, but it works exactly the same in VS 2012: http://msdn.microsoft.com/en-us/library/aa730869(v=vs.80).aspx
Found the link for VS2012: http://msdn.microsoft.com/EN-US/library/k4s6c3a0(v=VS.110,d=hv.2).aspx

XML format is for such cases. You will find the necessary old settings in very early version of settings file. And even the old version can handle XML settings created from newer version. It does not work "automatically" i.e. with method like Serialize/Deserialize, but writing conversion functions is not easier or faster.

actually this can be done by adding a [DefaultValue()] attribute to the newer properties on your settings objects - at least for XML serialization. I haven't attempted this using binary serialzation. For xml, this means that they are "optional" and the serialization will not break when loading old version of the files. You can find this attribute in the System.ComponentModel namespace as so;
class MySettings
{
public int MaxNumLogins { get; set; }
// specify the value to default to if it's not present in the serialized file...
[DefaultValue(0)]
public int CacheTimeoutMinutes { get; set; }
}

In addition to naming the fields as others have suggested this sort of thing just cries out for version numbers.

You could have a look at ProtoBuf-Net http://code.google.com/p/protobuf-net/wiki/GettingStarted if you are doing Binary because all these things are covered regarding versioning etc. It is also very compact. It also is actively developed and if you potentially have cross platform requirements if you use a .proto file you can also achieve that.
If you would like people to be possibly able to edit the settings (outside of your program) then you could use the XML* serialisation methods.

Is there a way to emulate Jackson's Mixins in JSON.Net?

I'm currently working on a few utility libraries to aid in the integration between two existing systems. As part of the integration process, I need to be able to convert objects to JSON.
For various reasons, I need to be able to modify the serialized field names (i.e convert camel case to snake case, and in some instances change the field name altogether).
One half of the system is written (mostly) in Java, and is entirely under my control. My preferred solution for serializing / deserializing JSON is to use Jackson. For a variety of reasons, it is considered a risk for us to modify the existing entity classes in order to apply the required attributes for Jackson to produce the correct JSON. Fortunately, Jackson provides Mixins, which essentially allow me to apply annotations dynamically. This is far, far superior to writing custom serializers and deserializers to do the same job.
The other half of the system is an ASP.Net application, and again I would like to modify as little of the existing code as I can get away with. I am currently using JSON.Net for serialization / deserialization, and it seems to support everything I need, including defining attributes to override property names.
However, one thing I can't seem to work out is whether JSON.Net supports the same concept of Mixins as Jackson does. If I can get away with it, I'd like to avoid modifying the existing .NET entity classes to include new attributes, but I can't find any documentation suggesting that this feature exists within JSON.Net.
So, does anybody know if there is a (documented / undocumented) way to apply Jackson-like mixins using JSON.Net, or will I need to write customer serializers / deserializers?

Not sure if this helps, but there is sort of external implementation of Jackson's mix-in handling, as part of ClassMate project. Library does many other things too, so I don't know how easy it'd be to extract part that handles merging of regular annotations and mix-ins.

What is the best approach to generalize and aggregate XML dumps in C#?

Here is the business part of the issue:
Several different companies send a
XML dump of the information to be
processed.
The information sent by the companies
are similar ... not exactly same.
Several more companies would be soon
enlisted and would start sending
information
Now, the technical part of the problem is I want to write a generic solution in C# to accommodate this information for processing. I would be transforming the XML in my C# class(es) to fit in to my database model.
Is there any pattern or solution for this issue to be handled generically without needing to change my solution in case of addition of many companies later?
What would be the best approach to write my parser/transformer?

This is how I have done something similar in the past.
As long as each company has its own fixed format which they use for their XML dump,
Have an specific XSLT for each company.
Have a way of indicating which dump is sourced from where (maybe different DUMP folders for each company )
In your program, based on 2, select 1 and apply it to the DUMP
All the XSLT's will transform the XML to your one standard database schema
Save this to your DB
Each new company addition is at the most a new XSLT
In cases where the schema is very similar, the XSLT's can be just re-used and then specific changes made to them.
Drawback to this approach: Debugging XSLT's can be a bit more painful if you do not have the right tools. However a LOT of XML Editors (eg XML Spy etc) have excellent XSLT debugging capabilities.

Sounds to me like you are just asking for a design pattern (or set of patterns) that you could use to do this in a generic, future-proof manner, right?
Ideally some of the attributes that you probably want
Each "transformer" is decoupled from one another.
You can easily add new "transformers" without having to rewrite your main "driver" routine.
You don't need to recompile / redeploy your entire solution every time you modify a transformer, or at least add a new one.
Each "transformer" should ideally implement a common interface that your driver routine knows about - call it IXmlTransformer. The responsibility of this interface is to take in an XML file and to return whatever object model / dataset that you use to save to the database. Each of your transformers would implement this interface. For common logic that is shared by all transformers you could either create a based class that all inherit from, or (my preferred choice) have a set of helper methods which you can call from any of them.
I would start by using a Factory to create each "transformer" from your main driver routine. The factory could use reflection to interrogate all assemblies it can see that, or something like MEF which could do a lot of the work for you. Your driver logic should use the factory to create all the transformers and store them.
Then you need some logic and mechanism to "lookup" each XML file received to a given Transformer - perhaps each XML file has a header that you could use to identify or something similar. Again, you want to keep these decoupled from your main logic so that you can easily add new transformers without modification of the driver routine. You could e.g. supply the XML file to each transformer and ask it "can you transform this file", and it is up to each transformer to "take responsibility" for a given file.
Every time your driver routine gets a new XML file, it looks up the appropriate transformer, and runs it through; the result gets sent to the DB processing area. If no transformer can be found, you dump the file in a directory for interrogation later.
I would recommend reading a book like Agile Principles, Patterns and Practices by Robert Martin (http://www.amazon.co.uk/Agile-Principles-Patterns-Practices-C/dp/0131857258), which gives good examples of appropriate design patterns for situations like yours e.g. Factory and DIP etc.
Hope that helps!

Solution proposed by InSane is likley the most straigh forward and definitely XML friendly approach.
If you looking for writing your own code to do conversion of different data formats than implementing multiple reader entities that would read data from each distinct format and transform to unified format, than your main code would work with this entities in unified way, i.e. by saving to the database.
Search for ETL - (Extract-Trandform-Load) to get more information - What model/pattern should I use for handling multiple data sources? , http://en.wikipedia.org/wiki/Extract,_transform,_load

Using XSLT as proposed in the currently most upvoted answer, is just moving the problem, from c# to xslt.
You are still changing the pieces that process the xml, and you are still exposed to how good/poor is the code structured / whether it is in c# or rules in the xslt.
Regardless if you keep it in c# or go xslt for those bits, the key is to separate the transformation of the xml you receive from the various companies into a unique format, whether that's an intermediate xml or a set of classes where you load the data you are processing.
Whatever you do avoid getting clever and trying to define your own generic transformation layer, if that's what you want Do use XSLT since that's what's for. If you go with c#, keep it simple with a transformation class for each company that implements the simplest interface.
On the c# way, keep any reuse you may have between the transformations to composition, don't even think of inheritance to do so ... this is one of the areas where it gets very ugly quickly if you go that way.

Have you considered BizTalk server?

Just playing the fence here and offering another solution for other readers.
The easiest way to get the data into your models within C# is to use XSLT to convert each companies data into a serialized form of your models. These are the basic steps I would take:
Create a complete model of all your data and use XmlSerializer to write out the model.
Create an XSLT that takes Company A's data and converts it into a valid serialized xml model of your data. Use the previously created XML file as a reference.
Use Deserialize on the new XML you just created. You will now have a reference to your model object containing all the data from the company.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.