Efficiently write ReadOnlySequence to Stream - c#

Reading data using a PipeReader returns a ReadResult containing the requested data as a ReadOnlySequence<byte>. Currently I am using this (simplified) snippet to write the data fetched from the reader to my target stream:
var data = (await pipeReader.ReadAsync(cancellationToken)).Buffer;
// lots of parsing, advancing, etc.
var position = data.GetPosition(0);
while (data.TryGet(ref position, out var memory))
{
await body.WriteAsync(memory);
}
This seems to be a lot of code for such a basic task, which I would usually expect to be a one-liner in .NET. Analyzing the overloads provided by Stream I fail to see how this functionality can be achieved with less code.
Is there some kind of extension method I am missing?

Looks like what you are looking for is planned for .NET 7:
Stream wrappers for more types. Developers have asked for the ability to create a Stream around the contents of a ReadOnlyMemory, a ReadOnlySequence...
Not the greatest answer, but at least you can stop worrying that you are missing something obvious.

Related

How to unit test the following scenario?

I have the following method
public override void SendSync(int frame, byte[] syncData)
{
using (var ms = new MemoryStream(4 + syncData.Length))
{
ms.WriteArray(BitConverter.GetBytes(frame));
ms.WriteArray(syncData);
queuedSyncPackets.Add(ms.GetBuffer());
}
}
and I would like to unit test it to guarantee 2 things:
The ms is disposed at the end ok will use roslyn
4 + syncData.Length is equal to ms.GetBuffer().Capacity right at queuedSyncPackets.Add(ms.GetBuffer());
How can I do that?
Philosophical part:
You really should not be unit testing how method is implemented inside. You should test that it behaves correctly but should not demand particular implementation. In this case it would be more clear to implement method with just new array and two Copy operations instead of using MemoryStream - demanding that method creates and dispose memory stream or uses particular byte array is likely wrong. Indeed if you would expect some particular caching/pooling to be used you can assert that.
The ms is disposed at the end:
Testing presence of using on MemoryStream usage - since MemoryStream has no unmanaged components there is absolutely nothing going to change if you have one or not with code shown (Dispose will prevent future read/write operations which may imact other usage patterns). So as result this particular case is pretty much style preference and should be caught by code analyzers or code review.
If you really want to test that - one option would be to move creation of the stream to a factory method and capture that created stream for later verification. Something like
public override void SendSync(
int frame, byte[] syncData, Func<MemoryStream> streamFactory)
{
using (var ms = streamFactory())...
}
...MyTest()
{
var ms = new MemoryStream();
SendSync(..., ()=>ms);
Assert.IsFalse(ms.CanRead,
"Expecting stream to be disposed and hence no more readable");
}
Note: as mentioned in comments and the question it is probably overkill for this particular case to test that stream was disposed. If someone removes it as not necessary and someone else adds it back as "required by our coding standard" this can't be solved with unit tests - talk to people instead.
Check parameters to a method:
Assuming queuedSyncPackets is an injected dependency you just need to add check at the moment of the call that data is expected. Usually done with some mocking framework like moq, but you can just implement you own class that simply checks the array size in that call.
Note that code shown in the post will succeed the check but it does not have to (as that part of behavior is not documented). GetBuffer() returns internal array of the MemoryStream and its size is same or bigger than resulting data but is not guaranteed to not grow stream when there is no space left. If you must use GetBuffer() that you have to pass size of the stream too.
Approximate sample for the test using moq:
byte[] theBuffer = null;
mockQueuedSyncPackets.Setup(c => c.Add(It.IsAny<byte[]>()))
.Callback<byte[]>((obj) => theBuffer = obj);
var theSender = new Sender(mockQueuedSyncPackets.Object);
theSender.SendSync(123, syncData);
Assert.AreEqual(4 + syncData.Length, theBuffer.Length);

Serializing a very large List of items into Azure blob storage using C#

I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.
We have been using BinaryFormatter.Serialize() to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize() method must first serialize everything to memory before I can upload the blob, causing my exception.
The binary serializer looks as follows:
public void Upload(object value, string blobName, bool replaceExisting)
{
CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
var formatter = new BinaryFormatter()
{
AssemblyFormat = FormatterAssemblyStyle.Simple,
FilterLevel = TypeFilterLevel.Low,
TypeFormat = FormatterTypeStyle.TypesAlways
};
using (var stream = blockBlob.OpenWrite())
{
formatter.Serialize(stream, value);
}
}
The OutOfMemoryException occurs on the formatter.Serialize(stream, value) line.
I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use #marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.
How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?
Perhaps you should switch to JSON?
Using the JSON Serializer, you can stream to and from files and serialize/deserialize piecemeal (as the file is read).
Would your objects map to JSON well?
This is what I use to take a NetworkStream and put into a Json Object.
private static async Task<JObject> ProcessJsonResponse(HttpResponseMessage response)
{
// Open the stream the stream from the network
using (var s = await ProcessResponseStream(response).ConfigureAwait(false))
{
using (var sr = new StreamReader(s))
{
using (var reader = new JsonTextReader(sr))
{
var serializer = new JsonSerializer {DateParseHandling = DateParseHandling.None};
return serializer.Deserialize<JObject>(reader);
}
}
}
}
Additionally, you could GZip the stream to reduce the file transfer times. We stream directly to GZipped JSON and back again.
Edit, although this is a Deserialize, the same approach should work for a Serialize
JSON serialization can work, as the previous poster mentioned, although one a large enough list, this was also causing OutOfMemoryException exceptions to be thrown because the string was simply too big to fit in memory. You might be able to get around this by serializing in pieces if your object is a list, but if you're okay with binary serialization, a much faster/lower memory way is to use Protobuf serialization.
Protobuf has faster serialization than JSON and requires a smaller memory footprint, but at the cost of it being not human readable. Protobuf-net is a great C# implementation of it. Here is a way to set it up with annotations and here is a way to set it up at runtime. In some instances, you can even GZip the Protobuf serialized bytes and save even more space.

.Net streams: Returning vs Providing

I have always wondered what the best practice for using a Stream class in C# .Net is. Is it better to provide a stream that has been written to, or be provided one?
i.e:
public Stream DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream;
}
as opposed to;
public void DoStuff(Stream myStream, ...)
{
//write to myStream directly
}
I have always used the former example for sake of lifecycle control, but I have this feeling that it a poor way of "streaming" with Stream's for lack of a better word.
I would prefer "the second way" (operate on a provided stream) since it has a few distinct advantages:
You can have polymorphism (assuming as evidenced by your signature you can do your operations on any type of Stream provided).
It's easily abstracted into a Stream extension method now or later.
You clearly divide responsibilities. This method should not care on how to construct a stream, only on how to apply a certain operation to it.
Also, if you're returning a new stream (option 1), it would feel a bit strange that you would have to Seek again first in order to be able to read from it (unless you do that in the method itself, which is suboptimal once more since it might not always be required - the stream might not be read from afterwards in all cases). Having to Seek after passing an already existing stream to a method that clearly writes to the stream does not seem so awkward.
I see the benefit of Streams is that you don't need to know what you're streaming to.
In the second example, your code could be writing to memory, it could be writing directly to file, or to some network buffer. From the function's perspective, the actual output destination can be decided by the caller.
For this reason, I would prefer the second option.
The first function is just writing to memory. In my opinion, it would be clearer if it did not return a stream, but the actual memory buffer. The caller can then attach a Memory Stream if he/she wishes.
public byte[] DoStuff(...)
{
var retStream = new MemoryStream();
//Write to retStream
return retStream.ToArray();
}
100% the second one. You don't want to make assumptions about what kind of stream they want. Do they want to stream to the network or to disk? Do they want it to be buffered? Leave these up to them.
They may also want to reuse the stream to avoid creating new buffers over and over. Or they may want to stream multiple things end-to-end on the same stream.
If they provide the stream, they have control over its type as well as its lifetime. Otherwise, you might as well just return something like a string or array. The stream isn't really giving you any benefit over these.

Protobuf Exception When Deserializing Large File

I'm using protobuf to serialize large objects to binary files to be deserialized and used again at a later date. However, I'm having issues when I'm deserializing some of the larger files. The files are roughly ~2.3 GB in size and when I try to deserialize them I get several exceptions thrown (in the following order):
Sub-message not read correctly
Invalid wire-type; this usually means you have over-written a file without truncating or setting the length; see Using Protobuf-net, I suddenly got an exception about an unknown wire-type
Unexpected end-group in source data; this usually means the source data is corrupt
I've looked at the question referenced in the second exception, but that doesn't seem to cover the problem I'm having.
I'm using Microsoft's HPC pack to generate these files (they take a while) so the serialization looks like this:
using (var consoleStream = Console.OpenStandardOutput())
{
Serializer.Serialize(consoleStream, dto);
}
And I'm reading the files in as follows:
private static T Deserialize<T>(string file)
{
using (var fs = File.OpenRead(file))
{
return Serializer.Deserialize<T>(fs);
}
}
The files are two different types. One is about 1GB in size, the other about 2.3GB. The smaller files all work, the larger files do not. Any ideas what could be going wrong here? I realise I've not given a lot of detail, can give more as requested.
Here I need to refer to a recent discussion on the protobuf list:
Protobuf uses int to represent sizes so the largest size it can possibly support is <2G. We don't have any plan to change int to size_t in the code. Users should avoid using overly large messages.
I'm guessing that the cause of the failure inside protobuf-net is basically the same. I can probably change protobuf-net to support larger files, but I have to advise that this is not recommended, because it looks like no other implementation is going to work well with such huge data.
The fix is probably just a case of changing a lot of int to long in the reader/writer layer. But: what is the layout of your data? If there is an outer object that is basically a list of the actual objects, there is probably a sneaky way of doing this using an incremental reader (basically, spoofing the repeated support directly).

I have a Single File And need to serialize multiple objects randomly. How can I in c#?

I have a single file and need to serialize multiple objects of the same class when ever a new object is created. I can't store them in arrays as I need to serialize them the instance an object is create. Please, help me.
What serialization mechanism are you using? XmlSerializer might be a problem because of the root node and things like namespace declarations, which are a bit tricky to get shot of - plus it isn't great at partial deserializations. BinaryFormatter is very brittle to begin with - I don't recommend it in most cases.
One option might be protobuf-net; this is a binary serializer (using Google's "protocol buffers" format - efficient, portable, and version-tolerant). You can serialize multiple objects to a stream with Serializer.SerializeWithLengthPrefix. To deserialize the same items, Serializer.DeserializeItems returns an IEnumerable<T> of the deserialized items - or you could easily make TryDeserializeWithLengthPrefix public (it is currently private, but the source is available).
Just write each object to file after you have created it - job done.
If you want an example, please say - although the unit tests here give an overview.
It would basically be something like (untested):
using(Stream s = File.Create(path))
{
Serializer.SerializeWithLengthPrefix(s, command1, PrefixStyle.Base128, 0);
... your code etc
Serializer.SerializeWithLengthPrefix(s, commandN, PrefixStyle.Base128, 0);
}
...
using(Stream s = File.OpenRead(path)) {
foreach(Command command in
Serializer.DeserializeItems<Command>(s, PrefixStyle.Base128, 0))
{
... do something with command
}
}
See answer here.
In short, just serialize everything to the same file stream, and then deserialize. dotNet would know the size of each object
For every object that arrives, we will convert it into a Base64Encoded string and store it as one line in a text file. So, in this file, every row will have a serialized object per line. While reading we will read the file one line at a time and deserialize this Base64 encoded string into our Object. Easy.. so lets try out the code.
http://www.codeproject.com/KB/cs/serializedeserialize.aspx?display=Print

Categories