Deserialize Binary Data of Object Array as Elements Are Available

Deserialize Binary Data of Object Array as Elements Are Available - c#

I've been sifting through the posts and forums but could not find a way to achieve this.
I have an array of 10,000,000 Person objects. I'm sending these objects over the network using a Streamed WCF Net.Tcp web service.
The problem is I want to read the first, for example, 5000 Person objects of the array as is it arrives and process only those. Afterwhich I will advance the stream and read another 5000, etc...
I haven't been able to find a way to do this because as far as I can tell there is no explicit size of objects in C#. As in, I can't just read the first 312 Bytes of the stream and say "Yes this is the first Person object. Now read the next 312 Bytes to get the next person.".
I ideally would like to use ProtoBuf-Net to serialize my objects but the .NET BinaryFormatter is fine as well.
I'm also open to sending the data in chunks, such as arrays of 5000. But I want to do so without opening a brand new tcp connection everytime. If only there was a way to tell the code that reads the stream: "Ok, deserialize everything I just sent you (Array of 5000) and then I will continue writing another 5000 to the stream".
Any ideas?
Thanks.

There may not be an explicit size for most objects in .NET but you can find the size of a serialized object. First send the size (in bytes) of the serialized object, then send the serialized object.
// psuedo-code
byte[] serializedObj = DoSerialization(Person); // we see length on an array
using (var writer = new StreamWriter(stream)) {
writer.Write(serializedObj.Length);
stream.Write(serializedObj);
}
You can also do this in bulk by modifying what and how you send your objects. You could create a List<Person>, add N number of Person, serialize the List and send as before.
Although I am not sure if sending the size before sending the data is necessary, it can help when you are reading the stream, to know how many bytes you are expecting.

You can do this with protobuf-net simply by using a ObservableCollection<Person> in your receiving system. When the collection grows larger than 5000 objects during deserialization, remove and processing the items in an ObservableCollection<T>.CollectionChanged callback. Then process any remaining items in an [OnDeserialized] callback.
For instance, consider the following root object:
[ProtoContract]
public class RootObject
{
public RootObject()
{
this.People = new ObservableCollection<Person>();
}
[ProtoMember(1)]
public ObservableCollection<Person> People { get; private set; }
public event EventHandler<EventArgs<StreamingContext>> OnDeserialized;
[OnDeserialized]
internal void OnDeserializedMethod(StreamingContext context)
{
var onDeserialized = OnDeserialized;
if (onDeserialized != null)
onDeserialized(this, new EventArgs<StreamingContext> { Value = context });
}
}
public class EventArgs<T> : EventArgs
{
public T Value { get; set; }
}
Say you have a method you would like to call to process each 5000 Person objects as they are added to the collection, for instance:
const int ProcessIncrement = 5000;
void ProcessItems(ICollection<Person> people, bool force)
{
if (people == null || people.Count == 0)
return;
if (people.Count >= ProcessIncrement || force)
{
// Remove and process the items, possibly on a different thread.
Console.WriteLine(string.Format("Processing {0} people." people.Count));
people.Clear();
}
}
You can pre-allocate your RootObject and add listeners with the necessary logic, and merge contents of the serialization stream into the root:
// Allocate a new RootObject
var newRoot = new RootObject();
// Add listeners to process chunks of Person objects as they are added
newRoot.People.CollectionChanged += (o, e) =>
{
// Process each chunk of 5000.
var collection = (ICollection<Person>)o;
ProcessItems(collection, false);
};
newRoot.OnDeserialized += (o, e) =>
{
// Forcibly process any remaining no matter how many.
ProcessItems(((RootObject)o).People, true);
};
// Deserialize from the stream onto the pre-allocated newRoot
Serializer.Merge(stream, newRoot);
As required, ProcessItems will be called every time an object is added to the collection, processing them in increments of 5000 then processing the remainder unconditionally.
Now, the only question is, does protobuf-net load the entire stream into memory before deserializing the collection, or does it do streaming deserialization? As it turns out, it does the latter, as shown by this sample fiddle that shows the stream position being gradually incremented as the items in the People collection are added, processed and removed.
Here I added the listeners to RootObject manually before deserialization. If you were to add them in the constructor itself, you could use ProtoBuf.Serializer.Deserialize<RootObject>(Stream stream) instead of Serializer.Merge onto a pre-allocated root object, which might be easier to integrate into your current architecture.
Incidentally, this technique should work with XmlSerializer and Json.NET as well.

Related

Is there a way to avoid using side effects to process this data

I have an application I'm writing that runs script plugins to automate what a user used to have to do manually through a serial terminal. So, I am basically implementing the serial terminal's functionality in code. One of the functions of the terminal was to send a command which kicked off the terminal receiving continuously streamed data from a device until the user pressed space bar, which would then stop the streaming of the data. While the data was streaming, the user would then set some values in another application on some other devices and watch the data streamed in the terminal change.
Now, the streamed data can take different shapes, depending on the particular command that's sent. For instance, one response may look like:
---RESPONSE HEADER---
HERE: 1
ARE: 2 SOME:3
VALUES: 4
---RESPONSE HEADER---
HERE: 5
ARE: 6 SOME:7
VALUES: 8
....
another may look like:
here are some values
in cols and rows
....
So, my idea is to have a different parser based on the command I send. So, I have done the following:
public class Terminal
{
private SerialPort port;
private IResponseHandler pollingResponseHandler;
private object locker = new object();
private List<Response1Clazz> response1;
private List<Response2Clazz> response2;
//setter omited for brevity
//get snapshot of data at any point in time while response is polling.
public List<Response1Clazz> Response1 {get { lock (locker) return new List<Response1Clazz>(response1); }
//setter omited for brevity
public List<Response2Clazz> Response2 {get { lock (locker) return new List<Response1Clazz>(response2); }
public Terminal()
{
port = new SerialPort(){/*initialize data*/}; //open port etc etc
}
void StartResponse1Polling()
{
Response1 = new List<Response1Clazz>();
Parser<List<Response1Clazz>> parser = new KeyValueParser(Response1); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser);
//write command to start polling response 1 in a task
}
void StartResponse2Polling()
{
Response2 = new List<Response2Clazz>();
Parser<List<Response2Clazz>> parser = new RowColumnParser(Response2); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser); // this accepts a parser of type T
//write command to start polling response 2
}
OnSerialDataReceived(object sender, Args a)
{
lock(locker){
//do some processing yada yada
//we pass in the serial data to the handler, which in turn delegates to the parser.
pollingResponseHandler.Handle(processedSerialData);
}
}
}
the caller of the class would then be something like
public class Plugin : BasePlugin
{
public override void PluginMain()
{
Terminal terminal = new Terminal();
terminal.StartResponse1Polling();
//update some other data;
Response1Clazz response = terminal.Response1;
//process response
//update more data
response = terminal.Response1;
//process response
//terminal1.StopPolling();
}
}
My question is quite general, but I'm wondering if this is the best way to handle the situation. Right now I am required to pass in an object/List that I want modified, and it's modified via a side effect. For some reason this feels a little ugly because there is really no indication in code that this is what is happening. I am purely doing it because the "Start" method is the location that knows which parser to create and which data to update. Maybe this is Kosher, but I figured it is worth asking if there is another/better way. Or at least a better way to indicate that the "Handle" method produces side effects.
Thanks!

I don't see problems in modifying List<>s that are received as a parameter. It isn't the most beautiful thing in the world but it is quite common. Sadly C# doesn't have a const modifier for parameters (compare this with C/C++, where unless you declare a parameter to be const, it is ok for the method to modify it). You only have to give the parameter a self-explaining name (like outputList), and put a comment on the method (you know, an xml-comment block, like /// <param name="outputList">This list will receive...</param>).
To give a more complete response, I would need to see the whole code. You have omitted an example of Parser and an example of Handler.
Instead I see a problem with your lock in { lock (locker) return new List<Response1Clazz>(response1); }. And it seems to be non-sense, considering that you then do Response1 = new List<Response1Clazz>();, but Response1 only has a getter.

Chance of hitting the same function at the same time by two Threads/Tasks

Assuming the following case:
public HashTable map = new HashTable();
public void Cache(String fileName) {
if (!map.ContainsKey(fileName))
{
map.Add(fileName, new Object());
_Cache(fileName);
}
}
}
private void _Cache(String fileName) {
lock (map[fileName])
{
if (File Already Cached)
return;
else {
cache file
}
}
}
When having the following consumers:
Task.Run(()=> {
Cache("A");
});
Task.Run(()=> {
Cache("A");
});
Would it be possible in any ways that the Cache method would throw a Duplicate key exception meaning that both tasks would hit the map.add method and try to add the same key??
Edit:
Would using the following data structure solve this concurrency problem?
public class HashMap<Key, Value>
{
private HashSet<Key> Keys = new HashSet<Key>();
private List<Value> Values = new List<Value>();
public int Count => Keys.Count;
public Boolean Add(Key key, Value value) {
int oldCount = Keys.Count;
Keys.Add(key);
if (oldCount != Keys.Count) {
Values.Add(value);
return true;
}
return false;
}
}

Yes, of course it would be possible. Consider the following fragment:
if (!map.ContainsKey(fileName))
{
map.Add(fileName, new Object());
Thread 1 may execute if (!map.ContainsKey(fileName)) and find that the map does not contain the key, so it will proceed to add it, but before it gets the chance to add it, Thread 2 may also execute if (!map.ContainsKey(fileName)), at which point it will also find that the map does not contain the key, so it will also proceed to add it. Of course, that will fail.
EDIT (after clarifications)
So, the problem seems to be how to keep the main map locked for as little as possible, and how to prevent cached objects from being initialized twice.
This is a complex problem, so I cannot give you a ready-to-run answer that will work, (especially since I do not currently even have a C# development environment handy,) but generally speaking, I think that you should proceed as follows:
Fully guard your map with lock().
Keep your map locked as little as possible; when an object is not found to be in the map, add an empty object to the map and exit the lock immediately. This will ensure that this map will not become a point of contention for all requests coming in to the web server.
After the check-if-present-and-add-if-not fragment, you are holding an object which is guaranteed to be in the map. However, this object may and may not be initialized at this point. That's fine. We will take care of that next.
Repeat the lock-and-check idiom, this time with the cached object: every single incoming request interested in that specific object will need to lock it, check whether it is initialized, and if not, initialize it. Of course, only the first request will suffer the penalty of initialization. Also, any requests that arrive before the object has been fully initialized will have to wait on their lock until the object is initialized. But that's all very fine, that's exactly what you want.

How to flush Protobuf-net writer

I am using the ProtoWriter/ProtoReader classes to implement something similar to the DataTableSerializer included with the Protobuf-net source. One difference is that after the initial transfer of the table contents all future updates are serialised incrementally.
Currently I'm not disposing the ProtoWriter instance until the program ends (as I want all future updates to be serialised with the same writer). This has the effect of delaying all writing to the output stream until the internal buffer size of 1024 bytes is reached.
Should I be creating a new ProtoWriter for each incremental update? Is there another way to force the writer to write to the stream?
Sample code:
private readonly ProtoWriter _writer;
private void WriteUpdate(IEnumerable<IReactiveColumn> columns, int rowIndex)
{
// Start the row group
ProtoWriter.WriteFieldHeader(ProtobufOperationTypes.Update, WireType.StartGroup, _writer);
var token = ProtoWriter.StartSubItem(rowIndex, _writer);
var rowId = rowIndex;
// Send the row id so that it can be matched against the local row id at the other end.
ProtoWriter.WriteFieldHeader(ProtobufFieldIds.RowId, WireType.Variant, _writer);
ProtoWriter.WriteInt32(rowId, _writer);
foreach (var column in columns)
{
var fieldId = _columnsToFieldIds[column.ColumnId];
WriteColumn(column, fieldId, rowId);
}
ProtoWriter.EndSubItem(token, _writer);
}

Interesting question. The flush method isn't exposed because internally it is not always the case that it is appropriate to flush, but I guess there's not a huge reason not to expose this and just let it no-op. On the other hand:
it is already a lightweight wrapper around a stream: you could dispose and recreate
or you could just keep writing and make full use of the extra buffering

Serialize an Entity Framework Entity that has child Collections

I have a Windows Store app. I am trying to save the state when the system does a suspend and shutdown. I am serializing an Estimate Object which was generated by Entity Framework 6. It has Child Entities that are stores in DataServiceCollections in the Estimate Entity. When I deserialize it, I get the following error:
An item could not be added to the collection. When items in a DataServiceCollection are tracked by the DataServiceContext, new items cannot be added before items have been loaded into the collection.
The Serialization functions I am using are:
public static string SerializeObject(object obj)
{
MemoryStream stream = new MemoryStream();
System.Runtime.Serialization.Json.DataContractJsonSerializer ser = new System.Runtime.Serialization.Json.DataContractJsonSerializer(obj.GetType(), Common.SuspensionManager.KnownTypes);
ser.WriteObject(stream, obj);
byte[] data = new byte[stream.Length];
stream.Seek(0, SeekOrigin.Begin);
stream.Read(data, 0, data.Length);
return Convert.ToBase64String(data);
}
public static T DeserializeObject<T>(string input)
{
System.Runtime.Serialization.Json.DataContractJsonSerializer ser = new System.Runtime.Serialization.Json.DataContractJsonSerializer(typeof(T));
MemoryStream stream = new MemoryStream(Convert.FromBase64String(input));
return (T)ser.ReadObject(stream);
}
The calls to these functions are as follows:
Serialize in the save state:
e.PageState["SelectedEstimate"] = StringSerializer.SerializeObject(Est);
Deserialize in the Restore State:
Est = StringSerializer.DeserializeObject<Estimate>((string)e.PageState["SelectedEstimate"]);
Is there a better Serialization method that can be used with EF Entities? Any help would be greatly appreciated.
Jim

This was posted a while back, but I figure it deserves and answer. The answer is to not use Entity Framework and just create your own entities. I used ObservableCollections for the child members of my entity. One thing that EF does for you is track the state. You can do this yourself by adding the enumeration in your WCF Interface definition:
[DataContract(Name = "RecordState")]
public enum CustomerRecordStateEnum
{
[EnumMember]
Unchanged = 0,
[EnumMember]
Added = 1,
[EnumMember]
Modified = 2,
[EnumMember]
Deleted = 3
}
Then in the Customer or whatever object have a Property for RecordState.
[DataMember]
public Nullable<CustomerRecordStateEnum> RecordState { get; set; }
In your WCF where you generate the object, initialize it to unchanged. In your client, the generated objects will implement IPropertyChanged. You can hook this and set the state to modified. You only want to do this if nothing else has been done to it. If it has been added or deleted and you change the state to modified, things get messy:
void Cust_PropertyChanged(object sender, System.ComponentModel.PropertyChangedEventArgs e)
{
if (e.PropertyName != "RecordState")
{
if (Cust.RecordState == RecordState.Unchanged)
Cust.RecordState = RecordState.Modified;
}
}
After you create your class Cust use:
Cust.PropertyChanged += Cust_PropertyChanged;
The only thing to remember is that if from your app, you update the object (Cust in this case) be sure to set the RecordState back to RecordState.Unchanged.
The whole point of this is that using this scenario you can create entities that keep track of their record state and also serialize them when you need to save the state for a suspension or otherwise. They will deserialize just fine. That way, on a resume, you are not loading them from the service. Loading them from the service would overwrite any changes you had made locally but hadn't saved yet.
Hope this is useful to someone.
Jim

Memory usage serializing chunked byte arrays with Protobuf-net

In our application we have some data structures which amongst other things contain a chunked list of bytes (currently exposed as a List<byte[]>). We chunk bytes up because if we allow the byte arrays to be put on the large object heap then over time we suffer from memory fragmentation.
We've also started using Protobuf-net to serialize these structures, using our own generated serialization DLL.
However we've noticed that Protobuf-net is creating very large in-memory buffers while serializing. Glancing through the source code it appears that perhaps it can't flush its internal buffer until the entire List<byte[]> structure has been written because it needs to write the total length at the front of the buffer afterwards.
This unfortunately undoes our work with chunking the bytes in the first place, and eventually gives us OutOfMemoryExceptions due to memory fragmentation (the exception occurs at the time where Protobuf-net is trying to expand the buffer to over 84k, which obviously puts it on the LOH, and our overall process memory usage is fairly low).
If my analysis of how Protobuf-net is working is correct, is there a way around this issue?
Update
Based on Marc's answer, here is what I've tried:
[ProtoContract]
[ProtoInclude(1, typeof(A), DataFormat = DataFormat.Group)]
public class ABase
{
}
[ProtoContract]
public class A : ABase
{
[ProtoMember(1, DataFormat = DataFormat.Group)]
public B B
{
get;
set;
}
}
[ProtoContract]
public class B
{
[ProtoMember(1, DataFormat = DataFormat.Group)]
public List<byte[]> Data
{
get;
set;
}
}
Then to serialize it:
var a = new A();
var b = new B();
a.B = b;
b.Data = new List<byte[]>
{
Enumerable.Range(0, 1999).Select(v => (byte)v).ToArray(),
Enumerable.Range(2000, 3999).Select(v => (byte)v).ToArray(),
};
var stream = new MemoryStream();
Serializer.Serialize(stream, a);
However if I stick a breakpoint in ProtoWriter.WriteBytes() where it calls DemandSpace() towards the bottom of the method and step into DemandSpace(), I can see that the buffer isn't being flushed because writer.flushLock equals 1.
If I create another base class for ABase like this:
[ProtoContract]
[ProtoInclude(1, typeof(ABase), DataFormat = DataFormat.Group)]
public class ABaseBase
{
}
[ProtoContract]
[ProtoInclude(1, typeof(A), DataFormat = DataFormat.Group)]
public class ABase : ABaseBase
{
}
Then writer.flushLock equals 2 in DemandSpace().
I'm guessing there is an obvious step I've missed here to do with derived types?

I'm going to read between some lines here... because List<T> (mapped as repeated in protobuf parlance) doesn't have an overall length-prefix, and byte[] (mapped as bytes) has a trivial length-prefix that shouldn't cause additional buffering. So I'm guessing what you actually have is more like:
[ProtoContract]
public class A {
[ProtoMember(1)]
public B Foo {get;set;}
}
[ProtoContract]
public class B {
[ProtoMember(1)]
public List<byte[]> Bar {get;set;}
}
Here, the need to buffer for a length-prefix is actually when writing A.Foo, basically to declare "the following complex data is the value for A.Foo"). Fortunately there is a simple fix:
[ProtoMember(1, DataFormat=DataFormat.Group)]
public B Foo {get;set;}
This changes between 2 packing techniques in protobuf:
the default (google's stated preference) is length-prefixed, meaning you get a marker indicating the length of the message to follow, then the sub-message payload
but there is also an option to use a start-marker, the sub-message payload, and an end-marker
When using the second technique it doesn't need to buffer, so: it doesn't. This does mean it will be writing slightly different bytes for the same data, but protobuf-net is very forgiving, and will happily deserialize data from either format here. Meaning: if you make this change, you can still read your existing data, but new data will use the start/end-marker technique.
This demands the question: why do google prefer the length-prefix approach? Probably this is because it is more efficient when reading to skip through fields (either via a raw reader API, or as unwanted/unexpected data) when using the length-prefix approach, as you can just read the length-prefix, and then just progress the stream [n] bytes; by contrast, to skip data with a start/end-marker you still need to crawl through the payload, skipping the sub-fields individually. Of course, this theoretical difference in read performance doesn't apply if you expect that data and want to read it into your object, which you almost certainly do. Also, in the google protobuf implementation, because it isn't working with a regular POCO model, the size of the payloads are already known, so they don't really see the same issue when writing.

Additional re your edit; the [ProtoInclude(..., DataFormat=...)] looks like it simply wasn't being processed. I have added a test for this in my current local build, and it now passes:
[Test]
public void Execute()
{
var a = new A();
var b = new B();
a.B = b;
b.Data = new List<byte[]>
{
Enumerable.Range(0, 1999).Select(v => (byte)v).ToArray(),
Enumerable.Range(2000, 3999).Select(v => (byte)v).ToArray(),
};
var stream = new MemoryStream();
var model = TypeModel.Create();
model.AutoCompile = false;
#if DEBUG // this is only available in debug builds; if set, an exception is
// thrown if the stream tries to buffer
model.ForwardsOnly = true;
#endif
CheckClone(model, a);
model.CompileInPlace();
CheckClone(model, a);
CheckClone(model.Compile(), a);
}
void CheckClone(TypeModel model, A original)
{
int sum = original.B.Data.Sum(x => x.Sum(b => (int)b));
var clone = (A)model.DeepClone(original);
Assert.IsInstanceOfType(typeof(A), clone);
Assert.IsInstanceOfType(typeof(B), clone.B);
Assert.AreEqual(sum, clone.B.Data.Sum(x => x.Sum(b => (int)b)));
}
This commit is tied into some other, unrelated refactorings (some rework for WinRT / IKVM), but should be committed ASAP.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.