Using BinaryWriter or BinaryReader in async code - c#

I have a list of float to write to a file. The code below does the thing but it is synchronous.
List<float> samples = GetSamples();
using (FileStream stream = File.OpenWrite("somefile.bin"))
using (BinaryWriter binaryWriter = new BinaryWriter(stream, Encoding.Default, true))
{
foreach (var sample in samples)
{
binaryWriter.Write(sample);
}
}
I want to do the operation asynchronously but the BinaryWriter does not support async operations, which is normal since it just only writes a few bytes each time. But most of the time the operation uses file I/O and I think it can and should be asynchronous.
I tried to write to a MemoryStream with the BinaryWriter and when that finished I copied the MemoryStream to the FileStream with CopyToAsync, however this caused a performance degradation (total time) up to 100% with big files.
How can I convert the whole operation to asynchronous?

Normal write operations usually end up being completed asynchronously anyway. The OS accepts writes immediately into the write cache, and flushes it to disk at some later time. Your application isn't blocked by the actual disk writes.
Of course, if you are writing to a removable drive then write cache is typically disabled and your program will be blocked.
I will recommend that you can dramatically reduce the number of operations by transferring a large block at a time. To wit:
Allocate a new T[BlockSize] of your desired block size.
Allocate a new byte[BlockSize * sizeof (T)]
Use List<T>.CopyTo(index, buffer, 0, buffer.Length) to copy a batch out of the list.
Use Buffer.BlockCopy to get the data into the byte[].
Write the byte[] to your stream in a single operation.
Repeat 3-5 until you reach the end of the list. Careful about the final batch, which may be a partial block.

Your memory stream approach makes sense, just make sure to write in batches rather than waiting for the memory stream to grow to the full size of the file and then writing it all at once.
Something like this should work fine:
var data = new float[10 * 1024];
var helperBuffer = new byte[4096];
using (var fs = File.Create(#"D:\Temp.bin"))
using (var ms = new MemoryStream(4096))
using (var bw = new BinaryWriter(ms))
{
var iteration = 0;
foreach (var sample in data)
{
bw.Write(sample);
iteration++;
if (iteration == 1024)
{
iteration = 0;
ms.Position = 0;
ms.Read(helperBuffer, 0, 1024 * 4);
await fs.WriteAsync(helperBuffer, 0, 1024 * 4).ConfigureAwait(false);
}
}
}
This is just sample code - make sure to handle errors properly etc.

Sometimes, these helper classes are anything but helpful.
Try this:
List<float> samples = GetSamples();
using (FileStream stream = File.OpenWrite("somefile.bin"))
{
foreach (var sample in samples)
{
await stream.WriteAsync(BitConverter.GetBytes(sample), 0, 4);
}
}

Related

Memory Fragmentation with byte[] in C#

The C#/.NET application I am working on makes use of huge byte arrays and is having memory fragmentation issues. Checked memory usage using CLRMemory
The Code we use is as follows
PdfLoadedDocument loadedDocument = new PdfLoadedDocument("myLoadedDocument.pdf");
// Operations on pdf document
using (var stream = new MemoryStream())
{
loadedDocument.Save(stream);
loadedDocument.Close(true);
return stream.ToArray(); //byte[]
}
And we use similar code at multiple places across our application and we call this in loop for generating bulk audits ranging from a few 100's to 10000's
Now is there a better way to handle this to avoild fragmentation
And as part of audits, we also download large files from Amazon S3 using the following code
using (var client = new AmazonS3Client(_accessKey, _secretKey, _region))
{
var getObjectRequest = new GetObjectRequest();
getObjectRequest.BucketName = "bucketName";
getObjectRequest.Key = "keyName";
using (var downloadStream = new MemoryStream())
{
using (var response = await client.GetObjectAsync(getObjectRequest))
{
using (var responseStream = response.ResponseStream)
{
await responseStream.CopyToAsync(downloadStream);
}
return downloadStream.ToArray(); //byte[]
}
}
}
Is there a better alternative to download large files without them moving to LOH which is taking a toll with Garbage Collector
There's two different things here:
the internals of MemoryStream
the usage of .ToArray()
For what happens inside MemoryStream: it is implemented as a simple byte[], but you can mitigate a lot of the overhead of that by using RecyclableMemoryStream instead via the Microsoft.IO.RecyclableMemoryStream nuget package, which re-uses buffers between independent usages.
For ToArray(), frankly: don't do that. When using vanilla MemoryStream, the better approach is TryGetBuffer(...), which gives you the oversized backing buffer, along with the start/end tokens:
if (!memStream.TryGetBuffer(out var segment))
throw new InvalidOperationException("Unable to obtain data segment; oops?");
// see segment.Offset, .Count, and .Array
It is then your job to not look outside those bounds. If you want to make that easier: consider treating the segment as a span (or memory) instead:
ReadOnlySpan<byte> muchSafer = segment;
// now you can't read out of bounds, and you don't need to apply the offset yourself
This TryGetBuffer(...) approach, however, does not work well with RecyclableMemoryStream - as it makes a defensive copy to prevent problems with independent data; in that scenario, you should treat the stream simply as a stream, i.e. Stream - just write to it, rewind it (Position = 0), and have the consumer read from it, then dispose it when they are done.
As a side note: when reading (or writing) using the Stream API: consider using the array-pool for your scratch buffers; so instead of:
var buffer = new byte[1024];
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{...}
instead try:
var buffer = ArrayPool<byte>.Shared.Rent(1024);
try
{
int bytesRead;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{...}
}
finally
{
ArrayPool<byte>.Shared.Return(buffer);
}
In more advanced scenarios, it may be wise to use the pipelines API rather than the stream API; the point here is that pipelines allows discontiguous buffers, so you never need ridiculously large buffers even when dealing with complex scenarios. This is a niche API, however, and has very limited support in public APIs.

How to correctly benchmark I/O performance in CSharp in Mono on Linux?

I'd like to benchmark I/O performance for a hard disk and I'd like to do that in C#. I've done that before in other programming languages and I'm doing that now in C# in the context of some evaluation of API performance in programming languages.
What I am doing is creating a large buffer of 1 MByte filled with random data. I then write the buffer repeatedly to a file using a regular FileStream, creating many GBytes of data in seconds. Strangely whatever I do - if I invoke Flush() during writes, at the end of writes, or even create a write-through-stream - I get results much better than theoretically possible: At least about three times faster than the I/O this disk is capable of doing. The logical conclusion is: I/O still is asynchronous though I invoke Flush() or use the write-through-option during stream creation.
That brings me to the conclusion that Flush() on a FileStream is not implemented in Mono. My question now is: How can I perform unbuffered file I/O with ensuring that all data is written as expected? Am I missing something or is there some special trick I'm not aware of?
using System;
using System.IO;
using System.Text;
namespace Test
{
class MainClass
{
public static void Main(string[] args)
{
byte[] mem = new byte[1024 * 1024];
Random r = new Random();
r.NextBytes(mem);
FileStream fs = new FileStream("/home/myuser/benchmark.tmp", FileMode.Create, FileAccess.Write, FileShare.None);
DateTime t1 = DateTime.Now;
for (int i = 0; i < 2048; i++) {
fs.Write(mem, 0, mem.Length);
fs.Flush();
}
fs.Flush();
DateTime t2 = DateTime.Now;
fs.Close();
double t = (1024.0 * 1024 * 2048) / (t2 - t1).TotalSeconds;
Console.WriteLine(t + " bytes/s");
t = t / (1024 * 1024);
Console.WriteLine(t + " mbytes/s");
}
}
}
This writes 2 GByte of data.
Output:
3000980511.29551 bytes/s
2861.95803765823 mbytes/s
The disk is capable of doing about 500 mbyte/s so Flush() seems to have no effect.

File Chunking Performance in C#

I am trying to empower users to upload large files. Before I upload a file, I want to chunk it up. Each chunk needs to be a C# object. The reason why is for logging purposes. Its a long story, but I need to create actual C# objects that represent each file chunk. Regardless, I'm trying the following approach:
public static List<FileChunk> GetAllForFile(byte[] fileBytes)
{
List<FileChunk> chunks = new List<FileChunk>();
if (fileBytes.Length > 0)
{
FileChunk chunk = new FileChunk();
for (int i = 0; i < (fileBytes.Length / 512); i++)
{
chunk.Number = (i + 1);
chunk.Offset = (i * 512);
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
chunks.Add(chunk);
chunk = new FileChunk();
}
}
return chunks;
}
Unfortunately, this approach seems to be incredibly slow. Does anyone know how I can improve the performance while still creating objects for each chunk?
thank you
I suspect this is going to hurt a little:
chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();
Try this instead:
byte buffer = new byte[512];
Buffer.BlockCopy(fileBytes, chunk.Offset, buffer, 0, 512);
chunk.Bytes = buffer;
(Code not tested)
And the reason why this code would likely be slow is because Skip doesn't do anything special for arrays (though it could). This means that every pass through your loop is iterating the first 512*n items in the array, which results in O(n^2) performance, where you should just be seeing O(n).
Try something like this (untested code):
public static List<FileChunk> GetAllForFile(string fileName, FileMode.Open)
{
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName))
{
int i = 0;
while (stream.Position <= stream.Length)
{
var chunk = new FileChunk();
chunk.Number = (i);
chunk.Offset = (i * 512);
Stream.Read(chunk.Bytes, 0, 512);
chunks.Add(chunk);
i++;
}
}
return chunks;
}
The above code skips several steps in your process, preferring to read the bytes from the file directly.
Note that, if the file is not an even multiple of 512, the last chunk will contain less than 512 bytes.
Same as Robert Harvey's answer, but using a BinaryReader, that way I don't need to specify an offset. If you use a BinaryWriter on the other end to reassemble the file, you won't need the Offset member of FileChunk.
public static List<FileChunk> GetAllForFile(string fileName) {
var chunks = new List<FileChunk>();
using (FileStream stream = new FileStream(fileName)) {
BinaryReader reader = new BinaryReader(stream);
int i = 0;
bool eof = false;
while (!eof) {
var chunk = new FileChunk();
chunk.Number = i;
chunk.Offset = (i * 512);
chunk.Bytes = reader.ReadBytes(512);
chunks.Add(chunk);
i++;
if (chunk.Bytes.Length < 512) { eof = true; }
}
}
return chunks;
}
Have you thought about what you're going to do to compensate for packet loss and data corruption?
Since you mentioned that the load is taking a long time then I would use asynchronous file reading in order to speed up the loading process. The hard disk is the slowest component of a computer. Google does asynchronous reads and writes on Google Chrome to improve their load times. I had to do something like this in C# in a previous job.
The idea would be to spawn several asynchronous requests over different parts of the file. Then when a request comes in, take the byte array and create your FileChunk objects taking 512 bytes at a time. There are several benefits to this:
If you have this run in a separate thread, then you won't have the whole program waiting to load the large file you have.
You can process a byte array, creating FileChunk objects, while the hard disk is still trying to for-fill read request on other parts of the file.
You will save on RAM space if you limit the amount of pending read requests you can have. This allows less page faulting to the hard disk and use the RAM and CPU cache more efficiently, which speeds up processing further.
You would want to use the following methods in the FileStream class.
[HostProtectionAttribute(SecurityAction.LinkDemand, ExternalThreading = true)]
public virtual IAsyncResult BeginRead(
byte[] buffer,
int offset,
int count,
AsyncCallback callback,
Object state
)
public virtual int EndRead(
IAsyncResult asyncResult
)
Also this is what you will get in the asyncResult:
// Extract the FileStream (state) out of the IAsyncResult object
FileStream fs = (FileStream) ar.AsyncState;
// Get the result
Int32 bytesRead = fs.EndRead(ar);
Here is some reference material for you to read.
This is a code sample of working with Asynchronous File I/O Models.
This is a MS documentation reference for Asynchronous File I/O.

How to hash a single file multiple ways at same time?

I'm trying to design a simple application to be used for calculating a file's CRC32/md5/sha1/sha256/sha384/sha512, and I've run into a bit of a roadblock. This is being done in C#.
I would like to be able to do this as efficiently as possible, so my original thought was to read the file into a memorystream first before processing, but I soon found out that very large files cause me to run out of memory very quickly. So it would seem that I have to use a filestream instead. The problem, as I see it, is that only one hash function can be run at a time, and doing so with a filestream will take a while for each hash to complete.
How might I go about reading a small bit of a file into memory, processing it with all 6 algorithms, and then going onto another chunk... Or does hashing not work that way?
This was my original attempt at reading a file into memory. It failed when I tried to read a CD image into memory prior to running the hashing algorithms on the memorystream:
private void ReadToEndOfFile(string filename)
{
if (File.Exists(filename))
{
FileInfo fi = new FileInfo(filename);
FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read);
byte[] buffer = new byte[16 * 1024];
//double step = Math.Floor((double)fi.Length / (double)100);
this.toolStripStatusLabel1.Text = "Reading File...";
this.toolStripProgressBar1.Maximum = (int)(fs.Length / buffer.Length);
this.toolStripProgressBar1.Value = 0;
using (MemoryStream ms = new MemoryStream())
{
int read;
while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
{
ms.Write(buffer, 0, read);
this.toolStripProgressBar1.Value += 1;
}
_ms = ms;
}
}
}
You're most of the way there, you just don't need to read the whole thing into memory at once.
All of the hashes in .Net derive from the HashAlgorithm class. This has two methods on it: TransformBlock and TransformFinalBlock. So, you should be able to read a chunk for your file, stuff it into the TransformBlock method of whichever hashes you want to use, and then move into the next block. Just remember to call TransformFinalBlock for your last chunk from the file, as that is what gets you the byte array containing the hash.
For now, I would just do each hash one at a time, until it's working, then worry about running the hashes concurrently (using something like the Task Parallel Library)
Hash algorithms are designed in a way that you can calculate the hash value incrementally. You can find a C#/.NET example for that here. You can easily modify the provided code to update multiple hash algorithm instances in each step.
This might be a great opportunity to get your feet wet with the TPL data flow objects. Read the file in one thread and post the data to a BroadcastBlock<T>. The BroadcastBlock<T> will be linked to 6 different ActionBlock<T> instances. Each ActionBlock<T> will correspond to one of your 6 hash strategies.
var broadcast = new BroadcastBlock<byte[]>(x => x);
var strategy1 = new ActionBlock<byte[]>(input => DoHash(input, SHA1.Create()));
var strategy2 = new ActionBlock<byte[]>(input => DoHash(input, MD5.Create()));
// Create the other 4 strategies.
broadcast.LinkTo(strategy1);
broadcast.LinkTo(strategy2);
// Link the other 4.
using (var fs = File.Open(#"yourfile.txt", FileMode.Open, FileAccess.Read))
using (var br = new BinaryReader(fs))
{
while (br.PeekChar() != -1)
{
broadcast.Post(br.ReadBytes(1024 * 16));
}
}
The BroadcastBlock<T> will forward each chunk of data to all linked ActionBlock<T> instances.
Since your question focused more on how to get this all to occur concurrently I will leave the implementation of DoHash up to you.
private void DoHash(byte[] input, HashAlgorithm algorithm)
{
// You will need to implement this.
}

Best way to read a large file into a byte array in C#?

I have a web server which will read large binary files (several megabytes) into byte arrays. The server could be reading several files at the same time (different page requests), so I am looking for the most optimized way for doing this without taxing the CPU too much. Is the code below good enough?
public byte[] FileToByteArray(string fileName)
{
byte[] buff = null;
FileStream fs = new FileStream(fileName,
FileMode.Open,
FileAccess.Read);
BinaryReader br = new BinaryReader(fs);
long numBytes = new FileInfo(fileName).Length;
buff = br.ReadBytes((int) numBytes);
return buff;
}
Simply replace the whole thing with:
return File.ReadAllBytes(fileName);
However, if you are concerned about the memory consumption, you should not read the whole file into memory all at once at all. You should do that in chunks.
I might argue that the answer here generally is "don't". Unless you absolutely need all the data at once, consider using a Stream-based API (or some variant of reader / iterator). That is especially important when you have multiple parallel operations (as suggested by the question) to minimise system load and maximise throughput.
For example, if you are streaming data to a caller:
Stream dest = ...
using(Stream source = File.OpenRead(path)) {
byte[] buffer = new byte[2048];
int bytesRead;
while((bytesRead = source.Read(buffer, 0, buffer.Length)) > 0) {
dest.Write(buffer, 0, bytesRead);
}
}
I would think this:
byte[] file = System.IO.File.ReadAllBytes(fileName);
Your code can be factored to this (in lieu of File.ReadAllBytes):
public byte[] ReadAllBytes(string fileName)
{
byte[] buffer = null;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[fs.Length];
fs.Read(buffer, 0, (int)fs.Length);
}
return buffer;
}
Note the Integer.MaxValue - file size limitation placed by the Read method. In other words you can only read a 2GB chunk at once.
Also note that the last argument to the FileStream is a buffer size.
I would also suggest reading about FileStream and BufferedStream.
As always a simple sample program to profile which is fastest will be most beneficial.
Also your underlying hardware will have a large effect on performance. Are you using server based hard disk drives with large caches and a RAID card with onboard memory cache? Or are you using a standard drive connected to the IDE port?
Depending on the frequency of operations, the size of the files, and the number of files you're looking at, there are other performance issues to take into consideration. One thing to remember, is that each of your byte arrays will be released at the mercy of the garbage collector. If you're not caching any of that data, you could end up creating a lot of garbage and be losing most of your performance to % Time in GC. If the chunks are larger than 85K, you'll be allocating to the Large Object Heap(LOH) which will require a collection of all generations to free up (this is very expensive, and on a server will stop all execution while it's going on). Additionally, if you have a ton of objects on the LOH, you can end up with LOH fragmentation (the LOH is never compacted) which leads to poor performance and out of memory exceptions. You can recycle the process once you hit a certain point, but I don't know if that's a best practice.
The point is, you should consider the full life cycle of your app before necessarily just reading all the bytes into memory the fastest way possible or you might be trading short term performance for overall performance.
I'd say BinaryReader is fine, but can be refactored to this, instead of all those lines of code for getting the length of the buffer:
public byte[] FileToByteArray(string fileName)
{
byte[] fileData = null;
using (FileStream fs = File.OpenRead(fileName))
{
using (BinaryReader binaryReader = new BinaryReader(fs))
{
fileData = binaryReader.ReadBytes((int)fs.Length);
}
}
return fileData;
}
Should be better than using .ReadAllBytes(), since I saw in the comments on the top response that includes .ReadAllBytes() that one of the commenters had problems with files > 600 MB, since a BinaryReader is meant for this sort of thing. Also, putting it in a using statement ensures the FileStream and BinaryReader are closed and disposed.
In case with 'a large file' is meant beyond the 4GB limit, then my following written code logic is appropriate. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries.
In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size.
(using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)
private uint Crc32CAlgorithmBigCrc(string fileName)
{
uint hash = 0;
byte[] buffer = null;
FileInfo fileInfo = new FileInfo(fileName);
long fileLength = fileInfo.Length;
int blockSize = 1024000000;
decimal div = fileLength / blockSize;
int blocks = (int)Math.Floor(div);
int restBytes = (int)(fileLength - (blocks * blockSize));
long offsetFile = 0;
uint interHash = 0;
Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
bool firstBlock = true;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[blockSize];
using (BinaryReader br = new BinaryReader(fs))
{
while (blocks > 0)
{
blocks -= 1;
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(blockSize);
if (firstBlock)
{
firstBlock = false;
interHash = Crc32CAlgorithm.Compute(buffer);
hash = interHash;
}
else
{
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
offsetFile += blockSize;
}
if (restBytes > 0)
{
Array.Resize(ref buffer, restBytes);
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(restBytes);
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
buffer = null;
}
}
//MessageBox.Show(hash.ToString());
//MessageBox.Show(hash.ToString("X"));
return hash;
}
Overview: if your image is added as a action= embedded resource then use the GetExecutingAssembly to retrieve the jpg resource into a stream then read the binary data in the stream into an byte array
public byte[] GetAImage()
{
byte[] bytes=null;
var assembly = Assembly.GetExecutingAssembly();
var resourceName = "MYWebApi.Images.X_my_image.jpg";
using (Stream stream = assembly.GetManifestResourceStream(resourceName))
{
bytes = new byte[stream.Length];
stream.Read(bytes, 0, (int)stream.Length);
}
return bytes;
}
Use the BufferedStream class in C# to improve performance. A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance.
See the following for a code example and additional explanation:
http://msdn.microsoft.com/en-us/library/system.io.bufferedstream.aspx
use this:
bytesRead = responseStream.ReadAsync(buffer, 0, Length).Result;
I would recommend trying the Response.TransferFile() method then a Response.Flush() and Response.End() for serving your large files.
If you're dealing with files above 2 GB, you'll find that the above methods fail.
It's much easier just to hand the stream off to MD5 and allow that to chunk your file for you:
private byte[] computeFileHash(string filename)
{
MD5 md5 = MD5.Create();
using (FileStream fs = new FileStream(filename, FileMode.Open))
{
byte[] hash = md5.ComputeHash(fs);
return hash;
}
}

Categories