Protobuf Exception When Deserializing Large File - c#

I'm using protobuf to serialize large objects to binary files to be deserialized and used again at a later date. However, I'm having issues when I'm deserializing some of the larger files. The files are roughly ~2.3 GB in size and when I try to deserialize them I get several exceptions thrown (in the following order):
Sub-message not read correctly
Invalid wire-type; this usually means you have over-written a file without truncating or setting the length; see Using Protobuf-net, I suddenly got an exception about an unknown wire-type
Unexpected end-group in source data; this usually means the source data is corrupt
I've looked at the question referenced in the second exception, but that doesn't seem to cover the problem I'm having.
I'm using Microsoft's HPC pack to generate these files (they take a while) so the serialization looks like this:
using (var consoleStream = Console.OpenStandardOutput())
{
Serializer.Serialize(consoleStream, dto);
}
And I'm reading the files in as follows:
private static T Deserialize<T>(string file)
{
using (var fs = File.OpenRead(file))
{
return Serializer.Deserialize<T>(fs);
}
}
The files are two different types. One is about 1GB in size, the other about 2.3GB. The smaller files all work, the larger files do not. Any ideas what could be going wrong here? I realise I've not given a lot of detail, can give more as requested.

Here I need to refer to a recent discussion on the protobuf list:
Protobuf uses int to represent sizes so the largest size it can possibly support is <2G. We don't have any plan to change int to size_t in the code. Users should avoid using overly large messages.
I'm guessing that the cause of the failure inside protobuf-net is basically the same. I can probably change protobuf-net to support larger files, but I have to advise that this is not recommended, because it looks like no other implementation is going to work well with such huge data.
The fix is probably just a case of changing a lot of int to long in the reader/writer layer. But: what is the layout of your data? If there is an outer object that is basically a list of the actual objects, there is probably a sneaky way of doing this using an incremental reader (basically, spoofing the repeated support directly).

Related

Serializing a very large List of items into Azure blob storage using C#

I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.
We have been using BinaryFormatter.Serialize() to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize() method must first serialize everything to memory before I can upload the blob, causing my exception.
The binary serializer looks as follows:
public void Upload(object value, string blobName, bool replaceExisting)
{
CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
var formatter = new BinaryFormatter()
{
AssemblyFormat = FormatterAssemblyStyle.Simple,
FilterLevel = TypeFilterLevel.Low,
TypeFormat = FormatterTypeStyle.TypesAlways
};
using (var stream = blockBlob.OpenWrite())
{
formatter.Serialize(stream, value);
}
}
The OutOfMemoryException occurs on the formatter.Serialize(stream, value) line.
I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use #marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.
How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?
Perhaps you should switch to JSON?
Using the JSON Serializer, you can stream to and from files and serialize/deserialize piecemeal (as the file is read).
Would your objects map to JSON well?
This is what I use to take a NetworkStream and put into a Json Object.
private static async Task<JObject> ProcessJsonResponse(HttpResponseMessage response)
{
// Open the stream the stream from the network
using (var s = await ProcessResponseStream(response).ConfigureAwait(false))
{
using (var sr = new StreamReader(s))
{
using (var reader = new JsonTextReader(sr))
{
var serializer = new JsonSerializer {DateParseHandling = DateParseHandling.None};
return serializer.Deserialize<JObject>(reader);
}
}
}
}
Additionally, you could GZip the stream to reduce the file transfer times. We stream directly to GZipped JSON and back again.
Edit, although this is a Deserialize, the same approach should work for a Serialize
JSON serialization can work, as the previous poster mentioned, although one a large enough list, this was also causing OutOfMemoryException exceptions to be thrown because the string was simply too big to fit in memory. You might be able to get around this by serializing in pieces if your object is a list, but if you're okay with binary serialization, a much faster/lower memory way is to use Protobuf serialization.
Protobuf has faster serialization than JSON and requires a smaller memory footprint, but at the cost of it being not human readable. Protobuf-net is a great C# implementation of it. Here is a way to set it up with annotations and here is a way to set it up at runtime. In some instances, you can even GZip the Protobuf serialized bytes and save even more space.

Serializing a Dictionary to disk?

We have a Hashtable (specifically the C# Dictionary class) that holds several thousands/millions of (Key,Value) pairs for near O(1) search hits/misses.
We'd like to be able to flush this data structure to disk (serialize it) and load it again later (deserialize) such that the internal hashtable of the Dictionary is preserved.
What we do right now:
Load from Disk => List<KVEntity>. (KVEntity is serializable. We use Avro to serialize - can drop Avro if needed)
Read every KVEntity from array => dictionary. This regenerates the dictionary/hashtable internal state.
< System operates, Dictionary can grow/shrink/values change etc >
When saving, read from the dictionary into array (via myKVDict.Values.SelectMany(x => x) into a new List<KVEntity>)
We serialize the array (List<KVEntity>) to disk to save the raw data
Notice that during our save/restore, we lose the internal tashtable/dictionary state and have to rebuild it each time.
We'd like to directly serialize to/from Dictionary (including it's internal "live" state) instead of using an intermediate array just for the disk i/o. How can we do that?
Some pseudo code:
// The actual "node" that has information. Both myKey and myValue have actual data work storing
public class KVEntity
{
public string myKey {get;set;}
public DataClass myValue {get;set;}
}
// unit of disk IO/serialization
public List<KVEntity> myKVList {get;set;}
// unit of run time processing. The string key is KVEntity.myKey
public Dictionary<string,KVEntity> myKVDict {get;set;}
Storing the internal state of the Dictionary instance would be bad practice - a key tenet of OOP is encapsulation: that internal implementation details are deliberately hidden from the consumer.
Furthermore, the mapping algorithm used by Dictionary might change across different versions of the .NET Framework, especially given that CIL assemblies are designed to be forward-compatible (i.e. a program written against .NET 2.0 will generally work against .NET 4.5).
Finally, there are no real performance gains from serialising the internal state of the dictionary. It is much better to use a well-defined file format with a focus on maintainability than speed. Besides, if the dictionary contains "several thousands" of entries then that should load from disk in under 15ms by my reckon (assuming you have an efficient on-disk format). Finally, a data structure optimised for RAM will not necessarily work well on-disk where sequential reads/writes are better.
Your post is very adamant about working with the internal state of the dictionary, but your existing approach seems fine (albiet, it could do with some optimisations). If you revealed more details we can help you make it faster.
Optimisations
The main issues I see with your existing implementation is the conversion to/from Arrays and Lists, which is unnecessary given that Dictionary is directly enumerable.
I would do something like this:
Dictionary<String,TFoo> dict = ... // where TFoo : new() && implements a arbitrary Serialize(BinaryWriter) and Deserialize(BinaryReader) methods
using(FileStream fs = File.OpenWrite("filename.dat"))
using(BinaryWriter wtr = new BinaryWriter(fs, Encoding.UTF8)) {
wtr.Write( dict.Count );
foreach(String key in dict.Keys) {
wtr.Write( key );
wtr.Write('\0');
dict[key].Serialize( wtr );
wtr.Write('\0'); // assuming NULL characters can work as record delimiters for safety.
}
}
Assuming that your TFoo's Serialize method is fast, I really don't think you'll get any faster speeds than this approach.
Implementing a de-serializer is an exercise for the reader, but should be trivial. Note how I stored the size of the dictionary to the file, so the returned dictionary can be set with the correct size when it's created, thus avoiding the re-balancing problem that #spender describes in his comment.
So we're going to stick with our existing strategy given Dai's reasoning and that we have C# and Java compatibility to maintain (which means the extra tree-state bits of the C# Dictionary would be dropped on the Java side anyways which would load only the node data as it does right now).
For later readers still interested in this I found a very good response here that somewhat answers the question posed. A critical difference is that this answer is for B+ Trees, not Dictionaries, although in practical applications those two data structures are very similar in performance. B+ Tree performance closer to Dictionaries than regular trees (like binary, red-black, AVL etc). Specifically, Dictionaries deliver near O(1) performance (but no "select from a range" abilities) while B+ Trees have O(logb(X)) where b = base is usually large which makes them very performant compared to regular trees where b=2. I'm copy-pasting it here for completeness but all credit goes to csharptest.net for the B+ Tree code, test, benchmarks and writeup(s).
For completeness I'm going to add my own implementation here.
Introduction - http://csharptest.net/?page_id=563
Benchmarks - http://csharptest.net/?p=586
Online Help - http://help.csharptest.net/
Source Code - http://code.google.com/p/csharptest-net/
Downloads - http://code.google.com/p/csharptest-net/downloads
NuGet Package - http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree

.net Serialization: How to use raw binary writer while maintaining which thing is which

I'm making a roguelike game in XNA with procedurally generated levels.
It takes about a second to generate a whole new level but takes about 4 seconds to serialize it and about 8 seconds to deserialize one with my current methods. Also the files are massive (about 10 megs depending on how big the level is)
I serialize like this.
private void SerializeLevel()
{
string name = Globals.CurrentLevel.LvLSaveString;
using (Stream stream = new FileStream("SAVES\\"+name+".lvl", FileMode.Create, FileAccess.Write, FileShare.None))
{
formatter.Serialize(stream, Globals.CurrentLevel);
stream.Close();
}
}
My game engine architecture is basically a load of nested Lists which might go..
Level\Room\Interior\Interiorthing\sprite
This hierarchy is important to maintain for the game/performance. For instance usually only things in the current room are considered for updates and draws.
I want to try something like the Raw Binary formatter shown in this post to improve serialization/deserialization performance
I can just save the ints and floats and bools which correspond to all the positions of/configurations of things and reinstantiate everything when I load a level (which only takes a second)
My question is how do I use this Raw Binary serializer while also maintaining which object is which, what type it is and which nested list it is in.
In the example cited OP is just serializing a huge list of ints and every 3rd one is taken as the start of a new coordinate.
I could have a new stream for each different type of thing in each room but that would result in loads of different files (I think) Is there a way to segregate the raw binary stream with some kind of hierarchy? Ie. split it up into different sections pertaining to different rooms and different lists of things?
UPDATE
Ok, one thind that was throwing me off was that in question I reference OP is referring to "manual serialization" as "raw binary serialization" which I couldnt find any info on.
If you want to serialize each member of Globals independently, and upon deserialization to manually update the member value, you need to know which member you are currently processing upon deserialization. I can suggest you these:
Process items in the same order. The code in your example will put binary data in the stream that it is nearly impossible to extract, unless you deserialize members in the order they have been serialized. This is going to be maintenance hell if new items are added and is not a good solution regarding both code clarity, maintainability and backwards compatibility.
Use dictionary. As per comments, Globals appears to be a static class, therefore it itself is not serializable. When serializing, put all members of the Globals class in a dictionary, and serialize it. Upon deserialization, you will know that you have a dictionary (not a random mess of objects). Then from the deserialized dictionary restore the Globals object
Use custom class. Create a class with all settings (a better approach). Use a single static instance of the class to access settings. You can serialize and deserialize that class
Settings. The second approach gets closer to an already built-in concept in .NET - Settings. Take a look at it, it seems that the Globals class is in fact a custom variant of a settings configuration

Google Protocol Buffers Serialization hangs writing 1GB+ data

I am serializing a large data set using protocol buffer serialization. When my data set contains 400000 custom objects of combined size around 1 GB, serialization returns in 3~4 seconds. But when my data set contains 450000 objects of combined size around 1.2 GB, serialization call never returns and CPU is constantly consumed.
I am using .NET port of Protocol Buffers.
Looking at the new comments, this appears to be (as the OP notes) MemoryStream capacity limited. A slight annoyance in the protobuf spec is that since sub-message lengths are variable and must prefix the sub-message, it is often necessary to buffer portions until the length is known. This is fine for most reasonable graphs, but if there is an exceptionally large graph (except for the "root object has millions of direct children" scenario, which doesn't suffer) it can end up doing quite a bit in-memory.
If you aren't tied to a particular layout (perhaps due to .proto interop with an existing client), then a simple fix is as follows: on child (sub-object) properties (including lists / arrays of sub-objects), tell it to use "group" serialization. This is not the default layout, but it says "instead of using a length-prefix, use a start/end pair of tokens". The downside of this is that if your deserialization code doesn't know about a particular object, it takes longer to skip the field, as it can't just say "seek forwards 231413 bytes" - it instead has to walk the tokens to know when the object is finished. In most cases this isn't an issue at all, since your deserialization code fully expects that data.
To do this:
[ProtoMember(1, DataFormat = DataFormat.Group)]
public SomeType SomeChild { get; set; }
....
[ProtoMember(4, DataFormat = DataFormat.Group)]
public List<SomeOtherType> SomeChildren { get { return someChildren; } }
The deserialization in protobuf-net is very forgiving (by default there is an optional strict mode), and it will happily deserialize groups in place of length-prefix, and length-prefix in place of groups (meaning: any data you have already stored somewhere should work fine).
1.2G of memory is dangerously close to the managed memory limit for 32 bit .Net processes. My guess is the serialization triggers an OutOfMemoryException and all hell breaks loose.
You should try to use several smaller serializations rather than a gigantic one, or move to a 64bit process.
Cheers,
Florian

Persist List<int> through App Shutdowns

Short Version
I have a list of ints that I need to figure out how to persist through Application Shutdown. Not forever but, you get the idea, I can't have the list disappear before it is dealt with. The method for dealing with it will remove entry's in the list.
What are my options? XML?
Background
We have a WinForm app that uses Local SQL Express DB's that participate in Merge Replication with a Central Server. This will be difficult to explain but we also have(kind of) an I-Series 400 Server that a small portion of data gets written to as well. For various reasons the I-Series is not available through replication and as such all "writes" to it need to be done while it is available.
My first thought to solve this was to simply have a List object that stored the PK that needed to be updated. Then, after a successful sync, I would have a method that checks that list and calls the UpdateISeries() once for each PK in the list. I am pretty sure this would work, except in a case where they shut down innappropriately or lost power, etc. So, does anyone have better ideas on how to solve this? XML file maybe, though I have never done that. I worry about actually creating a Table in SQL Express because of Replication....maybe unfounded but...
For reference, UpdateISeries(int PersonID) is an existing Method in a DLL that is used internally. Re-writting it, as a potential solution to this issue, really isn't viable at the time.
Sounds like you need to serialize and deserialize some objects.
See these .NET topics to find out more.
From the linked page:
Serialization is the process of converting the state of an object into a form that can be persisted or transported. The complement of serialization is deserialization, which converts a stream into an object. Together, these processes allow data to be easily stored and transferred.
If it is not important for the on-disk format to be human readable, and you want it to be as small as possible, look at binary serialization.
Using the serialization mechanism is probably the way to go. Here is an example using the BinaryFormatter.
public void Serialize(List<int> list, string filePath)
{
using (Stream stream = File.OpenWrite(filePath))
{
var formatter = new BinaryFormatter();
formatter.Serialize(stream, list);
}
}
public List<int> Deserialize(string filePath)
{
using (Stream stream = File.OpenRead(filePath)
{
var formatter = new BinaryFormatter();
return (List<int>)formatter.Deserialize(stream);
}
}
If you already have and interact with a SQL database, use that, to get simpler code with fewer dependencies. Replication can be configured to ignore additional tables (even if you have to place them in another schema). This way, you can avoid a number of potential data corruption problems.

Categories