Is there a way to deserialize a bytes field to a Stream member, without protobuf-net allocating a new (and potentially big) byte[] upfront?
I'm looking for something like this:
[ProtoContract]
public class Message
{
[ProtoMember(1)]
Stream Payload { get; set; }
}
Where the Stream could be backed by a pre-allocated buffer pool e.g. Microsoft.IO.RecyclableMemoryStream. Even after dropping down to ProtoReader for deserialization all I see is AppendBytes, which always allocates a buffer of field length. One has to drop even further to DirectReadBytes, which only operates directly on the message stream -- I'd like to avoid that.
As background, I'm using protobuf-net to serialize/deserialize messages across the wire. This is a middle-layer component for passing messages between clients, so the messages are really an envelope for an enclosed binary payload:
message Envelope {
required string messageId = 1;
map<string, string> headers = 2;
bytes payload = 3;
}
The size of payload is restricted to ~2 MB, but large enough for the byte[] to land in the LOH.
Using a surrogate as in Protobuf-net: Serializing a 3rd party class with a Stream data member doesn't work because it simply wraps the same monolithic array.
One technique that should work is mentioned in Memory usage serializing chunked byte arrays with Protobuf-net, changing bytes to repeated bytes and relying on the sender to limit each chunk. This solution may be good enough, it'll prevent LOH allocation, but it won't allow buffer pooling.
The question here is about the payload field. No, there is not current a mechanism to handle that, but it is certainly something that could be investigated for options. It could be that we can do something like an ArraySegment<byte> AllocateBuffer(int size) callback on the serialization-context that the caller could use to take control of the allocations (the nice thing about this is that protobuf-net doesn't actually work with ArraySegment<byte>, so it would be a purely incremental change that wouldn't impact any existing working code; if no callback is provided, we would presumably just allocate a flat buffer like it does currently). I'm open to other suggestions, but right now: no - it will allocate a byte[] internally.
Related
I'm creating a client/server application using C# and I would like to know the best approach to deal with data packets in C# code.
From what I read I could A) use the Marshal class to construct a safe class from a data packet raw buffer into or B) use raw pointers in unsafe context to directly cast the data packet buffer into a struct and use it on my code.
I can see big problems in both approaches. For instance using Marshal seems very cumbersome and hard to maintain (compared to the pointer solution) and on the other hand the use of pointers in C# is very limited (e.g. I can't even declare a fixed size array of a non-primitive type which is a deal breaker in such application).
That said I would like to know which of these two approaches is better to deal with network data packets and if there is any other better approach to this. I'm open to any other possible solution.
PS.: I'm actually creating a custom client to an already existing client/server application so the communication protocol between client and server is already created and can not be modified. So my C# client need to adhere to this already existing protocol to its lower level details (eg. binary offsets of each information in the data packets are fixed and need to be respected).
the best approach to deal with network packets. you have to create the Payload attached to the header of actual packet and you receiving client always first read the header and convert it to the actual length. then set the buffer length that you received in the header. it will perfectly work without any loss of packets. it will also improve the memory leakage issue. you don't need to create the hard coded array to get the buffer bytes. just append the actual byte length to the header of packet. it will dynamically set the buffer bytes.
like.
public void SendPackettoNetwork(AgentTransportBO ObjCommscollection, Socket Socket)
{
try
{
byte[] ActualBufferMessage = PeersSerialization(ObjCommscollection);
byte[] ActualBufferMessagepayload = BitConverter.GetBytes(ActualBufferMessage.Length);
byte[] actualbuffer = new byte[ActualBufferMessage.Length + 4];
Buffer.BlockCopy(ActualBufferMessagepayload, 0, actualbuffer, 0, ActualBufferMessagepayload.Length);
Buffer.BlockCopy(ActualBufferMessage, 0, actualbuffer, ActualBufferMessagepayload.Length, ActualBufferMessage.Length);
Logger.WriteLog("Byte to Send :" + actualbuffer.Length, LogLevel.GENERALLOG);
Socket.Send(actualbuffer);
}
catch (Exception ex)
{
Logger.WriteException(ex);
}
}
Just pass your transport class object. use you serialization approach here i used binary formatter serialization of class object
I am currently working on a socket server and I was wondering
Why do serializers like
XmlSerializer
BinaryFormatter
Protobuf-net
DataContractSerializer
all require a Stream instead of a byte array?
It means you can stream to arbitrary destinations rather than just to memory.
If you want to write something to a file, why would you want to create a complete copy in memory first? In some cases that could cause you to use a lot of extra memory, possibly causing a failure.
If you want to create a byte array, just use a MemoryStream:
var memoryStream = new MemoryStream();
serializer.Write(foo, memoryStream); // Or whatever you're using
var bytes = memoryStream.ToArray();
So with an abstraction of "you use streams" you can easily work with memory - but if the abstraction is "you use a byte array" you are forced to work with memory even if you don't want to.
You can easily make a stream over a byte array...but a byte array is inherently size-constrained, where a stream is open-ended...big as you need. Some serialization can be pretty enormous.
Edit: Also, if I need to implement some kind of serialization, I want to do it for the most basic abstraction, and avoid having to do it over multiple abstractions. Stream would be my choice, as there are stream implementations over lots of things: memory, disk, network and so forth. As an implementer, I get those for "free".
if you use a byte array/ buffer you are working temporarily in memory and you are limited in size
While a stream is something that lets you store things on disk, send across to other computers such as the internet, serial port, etc. streams often use buffers to optimize transmission speed.
So streaming will be useful if you are dealing with a large file
#JonSkeet's answer is the correct one, but as an addendum, if the issue you're having with making a temporary stream is "I don't like it because it's effort" then consider writing an extension method:
namespace Project.Extensions
{
public static class XmlSerialiserExtensions
{
public static void Serialise(this XmlSerializer serialiser, byte[] bytes, object obj)
{
using(var temp = new MemoryStream(bytes))
serialiser.Serialize(temp, obj);
}
public static object Deserialise(this XmlSerializer serialiser, byte[] bytes)
{
using(var temp = new MemoryStream(bytes))
return serialiser.Deserialize(temp);
}
}
}
So you can go ahead and do
serialiser.Serialise(buffer, obj);
socket.Write(buffer);
Or
socket.Read(buffer);
var obj = serialiser.Deserialise(buffer);
Byte arrays were used more often when manipulating ASCII (i.e. 1-byte) strings of characters often in machine dependent applications, such as buffers. They lend themselves more to low-level applications, whereas "streams" is a more generalized way of dealing with data, which enables a wider range of applications. Also, streams are a more abstract way of looking at data, which allows considerations such as character type (UTF-8, UTF-16, ASCII, etc.) to be handled by code that is invisible to the user of the data stream.
I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.
We have been using BinaryFormatter.Serialize() to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize() method must first serialize everything to memory before I can upload the blob, causing my exception.
The binary serializer looks as follows:
public void Upload(object value, string blobName, bool replaceExisting)
{
CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
var formatter = new BinaryFormatter()
{
AssemblyFormat = FormatterAssemblyStyle.Simple,
FilterLevel = TypeFilterLevel.Low,
TypeFormat = FormatterTypeStyle.TypesAlways
};
using (var stream = blockBlob.OpenWrite())
{
formatter.Serialize(stream, value);
}
}
The OutOfMemoryException occurs on the formatter.Serialize(stream, value) line.
I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use #marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.
How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?
Perhaps you should switch to JSON?
Using the JSON Serializer, you can stream to and from files and serialize/deserialize piecemeal (as the file is read).
Would your objects map to JSON well?
This is what I use to take a NetworkStream and put into a Json Object.
private static async Task<JObject> ProcessJsonResponse(HttpResponseMessage response)
{
// Open the stream the stream from the network
using (var s = await ProcessResponseStream(response).ConfigureAwait(false))
{
using (var sr = new StreamReader(s))
{
using (var reader = new JsonTextReader(sr))
{
var serializer = new JsonSerializer {DateParseHandling = DateParseHandling.None};
return serializer.Deserialize<JObject>(reader);
}
}
}
}
Additionally, you could GZip the stream to reduce the file transfer times. We stream directly to GZipped JSON and back again.
Edit, although this is a Deserialize, the same approach should work for a Serialize
JSON serialization can work, as the previous poster mentioned, although one a large enough list, this was also causing OutOfMemoryException exceptions to be thrown because the string was simply too big to fit in memory. You might be able to get around this by serializing in pieces if your object is a list, but if you're okay with binary serialization, a much faster/lower memory way is to use Protobuf serialization.
Protobuf has faster serialization than JSON and requires a smaller memory footprint, but at the cost of it being not human readable. Protobuf-net is a great C# implementation of it. Here is a way to set it up with annotations and here is a way to set it up at runtime. In some instances, you can even GZip the Protobuf serialized bytes and save even more space.
I have a function, which generates and returns a MemoryStream. After generation the size of the MemoryStream is fixed, I dont need to write to it anymore only output is required. Write to MailAttachment or write to database for example.
What is the best way to hand the object around? MemoryStream or Byte Array? If I use MemoryStream I have to reset the position after read.
If you have to hold all the data in memory, then in many ways the choice is arbitrary. If you have existing code that operates on Stream, then MemoryStream may be more convenient, but if you return a byte[] you can always just wrap that in a new MemoryStream(blob) anyway.
It might also depend on how big it is and how long you are holding it for; MemoryStream can be oversized, which has advantages and disadvantages. Forcing it to a byte[] may be useful if you are holding the data for a while, since it will trim off any excess; however, if you are only keeping it briefly, it may be counter-productive, since it will force you to duplicate most (at an absolute minimum: half) of the data while you create the new copy.
So; it depends a lot on context, usage and intent. In most scenarios, "whichever works, and is clear and simple" may suffice. If the data is particularly large or held for a prolonged period, you may want to deliberately tweak it a bit.
One additional advantage of the byte[] approach: if needed, multiple threads can access it safely at once (as long as they are reading) - this is not true of MemoryStream. However, that may be a false advantage: most code won't need to access the byte[] from multiple threads.
The MemoryStream class is used to add elements to a stream.
There is a file pointer; It simulates random access, it depends on how it is implemented. Therefore, a MemoryStream is not designed to access any item at any time.
The byte array allows random access of any element at any time until it is unassigned.
Next to the byte [], MemoryStream lives in memory (depending on the name of the class). Then the maximum allocation size is 4 GB.
Finally, use a byte [] if you need to access the data at any index number. Otherwise, MemoryStream is designed to work with something else that requires a stream as input while you just have a string.
Use a byte[] because it's a fixed sized object making it easier for memory allocation and cleanup and holds relatively no overhead - especially since you don't need to use the functions of the MemoryStream. Further you want to get that stream disposed of ASAP so it can release the possible unmanaged resources it may be using.
I am serializing a large data set using protocol buffer serialization. When my data set contains 400000 custom objects of combined size around 1 GB, serialization returns in 3~4 seconds. But when my data set contains 450000 objects of combined size around 1.2 GB, serialization call never returns and CPU is constantly consumed.
I am using .NET port of Protocol Buffers.
Looking at the new comments, this appears to be (as the OP notes) MemoryStream capacity limited. A slight annoyance in the protobuf spec is that since sub-message lengths are variable and must prefix the sub-message, it is often necessary to buffer portions until the length is known. This is fine for most reasonable graphs, but if there is an exceptionally large graph (except for the "root object has millions of direct children" scenario, which doesn't suffer) it can end up doing quite a bit in-memory.
If you aren't tied to a particular layout (perhaps due to .proto interop with an existing client), then a simple fix is as follows: on child (sub-object) properties (including lists / arrays of sub-objects), tell it to use "group" serialization. This is not the default layout, but it says "instead of using a length-prefix, use a start/end pair of tokens". The downside of this is that if your deserialization code doesn't know about a particular object, it takes longer to skip the field, as it can't just say "seek forwards 231413 bytes" - it instead has to walk the tokens to know when the object is finished. In most cases this isn't an issue at all, since your deserialization code fully expects that data.
To do this:
[ProtoMember(1, DataFormat = DataFormat.Group)]
public SomeType SomeChild { get; set; }
....
[ProtoMember(4, DataFormat = DataFormat.Group)]
public List<SomeOtherType> SomeChildren { get { return someChildren; } }
The deserialization in protobuf-net is very forgiving (by default there is an optional strict mode), and it will happily deserialize groups in place of length-prefix, and length-prefix in place of groups (meaning: any data you have already stored somewhere should work fine).
1.2G of memory is dangerously close to the managed memory limit for 32 bit .Net processes. My guess is the serialization triggers an OutOfMemoryException and all hell breaks loose.
You should try to use several smaller serializations rather than a gigantic one, or move to a 64bit process.
Cheers,
Florian