Why do most serializers use a stream instead of a byte array?

Why do most serializers use a stream instead of a byte array? - c#

I am currently working on a socket server and I was wondering
Why do serializers like
XmlSerializer
BinaryFormatter
Protobuf-net
DataContractSerializer
all require a Stream instead of a byte array?

It means you can stream to arbitrary destinations rather than just to memory.
If you want to write something to a file, why would you want to create a complete copy in memory first? In some cases that could cause you to use a lot of extra memory, possibly causing a failure.
If you want to create a byte array, just use a MemoryStream:
var memoryStream = new MemoryStream();
serializer.Write(foo, memoryStream); // Or whatever you're using
var bytes = memoryStream.ToArray();
So with an abstraction of "you use streams" you can easily work with memory - but if the abstraction is "you use a byte array" you are forced to work with memory even if you don't want to.

You can easily make a stream over a byte array...but a byte array is inherently size-constrained, where a stream is open-ended...big as you need. Some serialization can be pretty enormous.
Edit: Also, if I need to implement some kind of serialization, I want to do it for the most basic abstraction, and avoid having to do it over multiple abstractions. Stream would be my choice, as there are stream implementations over lots of things: memory, disk, network and so forth. As an implementer, I get those for "free".

if you use a byte array/ buffer you are working temporarily in memory and you are limited in size
While a stream is something that lets you store things on disk, send across to other computers such as the internet, serial port, etc. streams often use buffers to optimize transmission speed.
So streaming will be useful if you are dealing with a large file

#JonSkeet's answer is the correct one, but as an addendum, if the issue you're having with making a temporary stream is "I don't like it because it's effort" then consider writing an extension method:
namespace Project.Extensions
{
public static class XmlSerialiserExtensions
{
public static void Serialise(this XmlSerializer serialiser, byte[] bytes, object obj)
{
using(var temp = new MemoryStream(bytes))
serialiser.Serialize(temp, obj);
}
public static object Deserialise(this XmlSerializer serialiser, byte[] bytes)
{
using(var temp = new MemoryStream(bytes))
return serialiser.Deserialize(temp);
}
}
}
So you can go ahead and do
serialiser.Serialise(buffer, obj);
socket.Write(buffer);
Or
socket.Read(buffer);
var obj = serialiser.Deserialise(buffer);

Byte arrays were used more often when manipulating ASCII (i.e. 1-byte) strings of characters often in machine dependent applications, such as buffers. They lend themselves more to low-level applications, whereas "streams" is a more generalized way of dealing with data, which enables a wider range of applications. Also, streams are a more abstract way of looking at data, which allows considerations such as character type (UTF-8, UTF-16, ASCII, etc.) to be handled by code that is invisible to the user of the data stream.

Related

Can I use multiple BinaryWriters on the same Stream?

Can I create a new BinaryWriter and write on a Stream, while the stream is already beeing used by another BinaryWriter?
I need to write some data recursively, but I would like to avoid passing a BinaryWriter to a method as a parameter, as I need to pass a Stream instead. So, each method that will write data on the stream may need to create its own BinaryWriter instance. But I don't know if this is right. For now, it works well on a FileStream, but I don't know if it could lead to unexpected results on the users machines.
I wrote a simple example of what I want to achieve. Is this use of the BinaryWriter wrong?
Example:
public Main()
{
using (var ms = new MemoryStream())
{
// Write data on the stream.
WriteData(ms);
}
}
private void WriteData(Stream output)
{
// Create and use a BinaryWriter to use only on this method.
using (var bWriter = new BinaryWriter(output, Encoding.UTF8, true))
{
// Write some data using this BinaryWriter.
bWriter.Write("example data string");
// Send the stream to other method and write some more data there.
WriteMoreData(output);
// Write some more data using this BinaryWriter.
bWriter.Write("another example data string");
}
}
private void WriteMoreData(Stream output)
{
// Create and use a BinaryWriter to use only on this method.
using (var bWriter = new BinaryWriter(output, Encoding.Unicode, true))
{
// Write some data on this BinaryWriter.
bWriter.Write("write even more example data here");
}
}

Is this use of the BinaryWriter wrong?
It should work fine. BinaryWriter does no buffering itself, so each instance won't interfere with data written by other instances. You're passing true for the leaveOpen parameter, so when each instance is disposed, it won't close the underlying stream.
But "wrong" is to some degree in the eye of the beholder. I would say it's better to pass the BinaryWriter.
MemoryStream isn't buffered, but other types are. Each instance of BinaryWriter, when it's disposed, will flush the stream. This could be considered inefficient by some people, as it negates the benefit of the buffering, at least partially. Not an issue here, but may not be the best habit to get into.
In addition, each instance of BinaryWriter is going to create additional work for the garbage collector. If there's really only a few, that's probably not an issue. But if the real-world example involves a lot more calls, that could start to get noticeable, especially when the underlying stream is a MemoryStream (i.e. you're not dealing with some slow device I/O).
More to the point, I don't see any clear advantage to using multiple BinaryWriter instances on the same stream here. It seems like the natural, readable, easily-maintained thing to do would be to create a single BinaryWriter and reuse it until you're done writing.
Why do you want to avoid passing it as a parameter? You're already passing the Stream. Just pass the BinaryWriter instead. If you ever did need direct access to the underlying stream, it's always available via BinaryWriter.BaseStream.
Bottom line: I can't say there's anything clearly wrong per se with your proposal. But it's a deviation from normal conventions without (to me, anyway) a clear benefit. If you have a really good rationale for doing it this way, it should work. But I'd recommend against it.

Deserializing packed bytes to a Stream member with protobuf-net

Is there a way to deserialize a bytes field to a Stream member, without protobuf-net allocating a new (and potentially big) byte[] upfront?
I'm looking for something like this:
[ProtoContract]
public class Message
{
[ProtoMember(1)]
Stream Payload { get; set; }
}
Where the Stream could be backed by a pre-allocated buffer pool e.g. Microsoft.IO.RecyclableMemoryStream. Even after dropping down to ProtoReader for deserialization all I see is AppendBytes, which always allocates a buffer of field length. One has to drop even further to DirectReadBytes, which only operates directly on the message stream -- I'd like to avoid that.
As background, I'm using protobuf-net to serialize/deserialize messages across the wire. This is a middle-layer component for passing messages between clients, so the messages are really an envelope for an enclosed binary payload:
message Envelope {
required string messageId = 1;
map<string, string> headers = 2;
bytes payload = 3;
}
The size of payload is restricted to ~2 MB, but large enough for the byte[] to land in the LOH.
Using a surrogate as in Protobuf-net: Serializing a 3rd party class with a Stream data member doesn't work because it simply wraps the same monolithic array.
One technique that should work is mentioned in Memory usage serializing chunked byte arrays with Protobuf-net, changing bytes to repeated bytes and relying on the sender to limit each chunk. This solution may be good enough, it'll prevent LOH allocation, but it won't allow buffer pooling.

The question here is about the payload field. No, there is not current a mechanism to handle that, but it is certainly something that could be investigated for options. It could be that we can do something like an ArraySegment<byte> AllocateBuffer(int size) callback on the serialization-context that the caller could use to take control of the allocations (the nice thing about this is that protobuf-net doesn't actually work with ArraySegment<byte>, so it would be a purely incremental change that wouldn't impact any existing working code; if no callback is provided, we would presumably just allocate a flat buffer like it does currently). I'm open to other suggestions, but right now: no - it will allocate a byte[] internally.

.Net streams: Returning vs Providing

I have always wondered what the best practice for using a Stream class in C# .Net is. Is it better to provide a stream that has been written to, or be provided one?
i.e:
public Stream DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream;
}
as opposed to;
public void DoStuff(Stream myStream, ...)
{
//write to myStream directly
}
I have always used the former example for sake of lifecycle control, but I have this feeling that it a poor way of "streaming" with Stream's for lack of a better word.

I would prefer "the second way" (operate on a provided stream) since it has a few distinct advantages:
You can have polymorphism (assuming as evidenced by your signature you can do your operations on any type of Stream provided).
It's easily abstracted into a Stream extension method now or later.
You clearly divide responsibilities. This method should not care on how to construct a stream, only on how to apply a certain operation to it.
Also, if you're returning a new stream (option 1), it would feel a bit strange that you would have to Seek again first in order to be able to read from it (unless you do that in the method itself, which is suboptimal once more since it might not always be required - the stream might not be read from afterwards in all cases). Having to Seek after passing an already existing stream to a method that clearly writes to the stream does not seem so awkward.

I see the benefit of Streams is that you don't need to know what you're streaming to.
In the second example, your code could be writing to memory, it could be writing directly to file, or to some network buffer. From the function's perspective, the actual output destination can be decided by the caller.
For this reason, I would prefer the second option.
The first function is just writing to memory. In my opinion, it would be clearer if it did not return a stream, but the actual memory buffer. The caller can then attach a Memory Stream if he/she wishes.
public byte[] DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream.ToArray();
}

100% the second one. You don't want to make assumptions about what kind of stream they want. Do they want to stream to the network or to disk? Do they want it to be buffered? Leave these up to them.
They may also want to reuse the stream to avoid creating new buffers over and over. Or they may want to stream multiple things end-to-end on the same stream.
If they provide the stream, they have control over its type as well as its lifetime. Otherwise, you might as well just return something like a string or array. The stream isn't really giving you any benefit over these.

c# MemoryStream vs Byte Array

I have a function, which generates and returns a MemoryStream. After generation the size of the MemoryStream is fixed, I dont need to write to it anymore only output is required. Write to MailAttachment or write to database for example.
What is the best way to hand the object around? MemoryStream or Byte Array? If I use MemoryStream I have to reset the position after read.

If you have to hold all the data in memory, then in many ways the choice is arbitrary. If you have existing code that operates on Stream, then MemoryStream may be more convenient, but if you return a byte[] you can always just wrap that in a new MemoryStream(blob) anyway.
It might also depend on how big it is and how long you are holding it for; MemoryStream can be oversized, which has advantages and disadvantages. Forcing it to a byte[] may be useful if you are holding the data for a while, since it will trim off any excess; however, if you are only keeping it briefly, it may be counter-productive, since it will force you to duplicate most (at an absolute minimum: half) of the data while you create the new copy.
So; it depends a lot on context, usage and intent. In most scenarios, "whichever works, and is clear and simple" may suffice. If the data is particularly large or held for a prolonged period, you may want to deliberately tweak it a bit.
One additional advantage of the byte[] approach: if needed, multiple threads can access it safely at once (as long as they are reading) - this is not true of MemoryStream. However, that may be a false advantage: most code won't need to access the byte[] from multiple threads.

The MemoryStream class is used to add elements to a stream.
There is a file pointer; It simulates random access, it depends on how it is implemented. Therefore, a MemoryStream is not designed to access any item at any time.
The byte array allows random access of any element at any time until it is unassigned.
Next to the byte [], MemoryStream lives in memory (depending on the name of the class). Then the maximum allocation size is 4 GB.
Finally, use a byte [] if you need to access the data at any index number. Otherwise, MemoryStream is designed to work with something else that requires a stream as input while you just have a string.

Use a byte[] because it's a fixed sized object making it easier for memory allocation and cleanup and holds relatively no overhead - especially since you don't need to use the functions of the MemoryStream. Further you want to get that stream disposed of ASAP so it can release the possible unmanaged resources it may be using.

Reading parts of large files from drive

I'm working with large files in C# (can be up to 20%-40% of available memory) and I will only need small parts of the files to be loaded into memory at a time (like 1-2% of the file). I was thinking that using a FileStream would be the best option, but idk. I will need to give a starting point (in bytes) and a length (in bytes) and copy that region into a byte[]. Access to the file might need to be shared between threads and will be at random spots in the file (non-linear access). I also need it to be fast.
The project already has unsafe methods, so feel free to suggest things from the more dangerous side of C#

A FileStream will allow you to seek to the portion of the file you want, no problem. It's the recommended way to do it in C#, and it's fast.
Sharing between threads: You will need to create a lock to prevent other threads from changing the FileStream position while you're trying to read from it. The simplest way to do this:
// This really needs to be a member-level variable;
private static readonly object fsLock = new object();
// Instantiate this in a static constructor or initialize() method
private static FileStream fs = new FileStream("myFile.txt", FileMode.Open);
public string ReadFile(int fileOffset) {
byte[] buffer = new byte[bufferSize];
int arrayOffset = 0;
lock (fsLock) {
fs.Seek(fileOffset, SeekOrigin.Begin);
int numBytesRead = fs.Read(bytes, arrayOffset , bufferSize);
// Typically used if you're in a loop, reading blocks at a time
arrayOffset += numBytesRead;
}
// Do what you want to the byte array and return it
}
Add try..catch statements and other code as necessary. Everywhere you access this FileStream, put a lock on the member-level variable fsLock... this will keep other methods from reading/manipulating the file pointer while you're trying to read.
Speed-wise, I think you'll find you're limited by disk access speeds, not code.
You'll have to think through all the issues about multi-threaded file access... who intializes/opens the file, who closes it, etc. There's a lot of ground to cover.

I know nothing about the structure of these files, but reading a portion of a file with FileStream or similar sounds like the best and fastest way to do it.
You will not need to copy the byte[] since FileStream can read directly into a byte array.
It sounds like you might know more about the structure of the file, which could bring up additional techniques as well. But if you need to read only a portion of the file, then this would probably be the way to do it.

If you are using .Net 4 look into using memory mapped files in the System.IO.MemoryMappedFiles namespace.
They are perfect for reading small chunks out of large files. There are samples in the MSDN documentation.
You can also do this in earlier versions of .Net, but then you need to wrap the Win32 API (or use http://winterdom.com/dev/net),

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.