We're trying to measure the performance between reading a series of files using sync methods vs async. Was expecting to have about the same time between the two but turns out using async is about 5.5x slower.
This might be due to the overhead of managing the threads but just wanted to know your opinion. Maybe we're just measuring the timings wrong.
These are the methods being tested:
static void ReadAllFile(string filename)
{
var content = File.ReadAllBytes(filename);
}
static async Task ReadAllFileAsync(string filename)
{
using (var file = File.OpenRead(filename))
{
using (var ms = new MemoryStream())
{
byte[] buff = new byte[file.Length];
await file.ReadAsync(buff, 0, (int)file.Length);
}
}
}
And this is the method that runs them and starts the stopwatch:
static void Test(string name, Func<string, Task> gettask, int count)
{
Stopwatch sw = new Stopwatch();
Task[] tasks = new Task[count];
sw.Start();
for (int i = 0; i < count; i++)
{
string filename = "file" + i + ".bin";
tasks[i] = gettask(filename);
}
Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine(name + " {0} ms", sw.ElapsedMilliseconds);
}
Which is all run from here:
static void Main(string[] args)
{
int count = 10000;
for (int i = 0; i < count; i++)
{
Write("file" + i + ".bin");
}
Console.WriteLine("Testing read...!");
Test("Read Contents", (filename) => Task.Run(() => ReadAllFile(filename)), count);
Test("Read Contents Async", (filename) => ReadAllFileAsync(filename), count);
Console.ReadKey();
}
And the helper write method:
static void Write(string filename)
{
Data obj = new Data()
{
Header = "random string size here"
};
int size = 1024 * 20; // 1024 * 256;
obj.Body = new byte[size];
for (var i = 0; i < size; i++)
{
obj.Body[i] = (byte)(i % 256);
}
Stopwatch sw = new Stopwatch();
sw.Start();
MemoryStream ms = new MemoryStream();
Serializer.Serialize(ms, obj);
ms.Position = 0;
using (var file = File.Create(filename))
{
ms.CopyToAsync(file).Wait();
}
sw.Stop();
//Console.WriteLine("Writing file {0}", sw.ElapsedMilliseconds);
}
The results:
-Read Contents 574 ms
-Read Contents Async 3160 ms
Will really appreciate if anyone can shed some light on this as we searched the stack and the web but can't really find a proper explanation.
There are lots of things wrong with the testing code. Most notably, your "async" test does not use async I/O; with file streams, you have to explicitly open them as asynchronous or else you're just doing synchronous operations on a background thread. Also, your file sizes are very small and can be easily cached.
I modified the test code to write out much larger files, to have comparable sync vs async code, and to make the async code asynchronous:
static void Main(string[] args)
{
Write("0.bin");
Write("1.bin");
Write("2.bin");
ReadAllFile("2.bin"); // warmup
var sw = new Stopwatch();
sw.Start();
ReadAllFile("0.bin");
ReadAllFile("1.bin");
ReadAllFile("2.bin");
sw.Stop();
Console.WriteLine("Sync: " + sw.Elapsed);
ReadAllFileAsync("2.bin").Wait(); // warmup
sw.Restart();
ReadAllFileAsync("0.bin").Wait();
ReadAllFileAsync("1.bin").Wait();
ReadAllFileAsync("2.bin").Wait();
sw.Stop();
Console.WriteLine("Async: " + sw.Elapsed);
Console.ReadKey();
}
static void ReadAllFile(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read, 4096, false))
{
byte[] buff = new byte[file.Length];
file.Read(buff, 0, (int)file.Length);
}
}
static async Task ReadAllFileAsync(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read, 4096, true))
{
byte[] buff = new byte[file.Length];
await file.ReadAsync(buff, 0, (int)file.Length);
}
}
static void Write(string filename)
{
int size = 1024 * 1024 * 256;
var data = new byte[size];
var random = new Random();
random.NextBytes(data);
File.WriteAllBytes(filename, data);
}
On my machine, this test (built in Release, run outside the debugger) yields these numbers:
Sync: 00:00:00.4461936
Async: 00:00:00.4429566
All I/O Operation are async. The thread just waits(it gets suspended) for I/O operation to finish. That's why when read jeffrey richter he always tells to do i/o async, so that your thread is not wasted by waiting around.
from Jeffery Ricter
Also creating a thread is not cheap. Each thread gets 1 mb of address space reserved for user mode and another 12kb for kernel mode. After this the OS has to notify all the dll in system that a new thread has been spawned.Same happens when you destroy a thread. Also think about the complexities of context switching
Found a great SO answer here
Related
I'm trying to transfer a large file in "chunks" that then have their hashes validated. I am looking into some performance issues, particularly in UNC paths, and I wrote an IO test that exhibits strange behavior.
Here is the code:
string path = "\\\\unc\\path\\test.txt";
long fileSize = 1000000000;
int chunkSize = 1000000;
if (File.Exists(path))
{
File.Delete(path);
}
using (FileStream fs = File.Create(path))
{
fs.SetLength(fileSize);
}
byte[] data = new byte[chunkSize];
for (long i = 0; i < fileSize; i+= chunkSize)
{
for (int j = 0; j < chunkSize; j++)
{
data[j] = (byte)i; // this is just to write different data each time
}
int thisChunkSize = (int)Math.Min(fileSize - i, chunkSize);
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
{
fs.Lock(i, thisChunkSize);
fs.Seek(i, SeekOrigin.Begin);
fs.Write(data, 0, thisChunkSize);
//fs.Seek(i, SeekOrigin.Begin);
//fs.Read(data, 0, thisChunkSize);
}
using (SHA1 sha1 = SHA1.Create())
{
sha1.ComputeHash(data);
}
}
Running the code as-is, it completes in about 2.5 minutes. When I uncomment the fs.Seek and fs.Read, it completes in about 30 seconds. Running on a local path, it takes about 6.5 seconds either way.
My main theory is there there is some OS bottleneck that is slowing me down when repeatedly opening and closing FileStreams back-to-back. Is there any explanation for why a more expensive operation would result in better performance?
I'm using this method for asynchronously file copy with notifications and cancellation. It is good for copy files locally. But in case of cross drive copy its performance reduces, because at every moment only one drive works. The worst situation occurs when I copy large files from SSD to slow flash or vice versa.
Can any body advice me better solution? Maybe something based on producer-consumer pattern or there are some libraries? (I have searched, but without result)
P.S.: It is not method for direct use - it is wrapped in some others, which prepares files list and choose bufferSize
private static async Task<long> CopyFileAsync(
[NotNull]string sourcePath,
[NotNull]string destPath,
[NotNull]IProgress<FileCopyProgress> progress,
CancellationToken cancellationToken,
long bufferSize = 1024 * 1024 * 10
)
{
if (bufferSize <= 0)
{
throw new ArgumentException(nameof(bufferSize));
}
long totalRead = 0;
long fileSize;
var buffer = new byte[bufferSize];
using (var reader = File.Open(sourcePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
fileSize = reader.Length;
using (var writer = File.Create(destPath, Convert.ToInt32(bufferSize), FileOptions.Asynchronous))
{
while (totalRead < fileSize)
{
var readCount = await reader.ReadAsync(buffer, 0, Convert.ToInt32(bufferSize), cancellationToken).ConfigureAwait(false);
await writer.WriteAsync(buffer, 0, readCount, cancellationToken).ConfigureAwait(false);
totalRead += readCount;
progress.Report(new FileCopyProgress(totalRead, fileSize, null));
cancellationToken.ThrowIfCancellationRequested();
}
}
}
progress.Report(new FileCopyProgress(fileSize, fileSize, null));
return fileSize;
}
I need to read data from a file,process and write result to another file. I use backgroundworker to show process state .I write something like this to use in DoWork event of backgroundworker
private void ProcData(string fileToRead,string fileToWrite)
{
byte[] buffer = new byte[4 * 1024];
//fileToRead & fileToWrite have same size
FileInfo fileInfo = new FileInfo(fileToRead);
using (FileStream streamReader = new FileStream(fileToRead, FileMode.Open))
using (BinaryReader binaryReader = new BinaryReader(streamReader))
using (FileStream streamWriter = new FileStream(fileToWrite, FileMode.Open))
using (BinaryWriter binaryWriter = new BinaryWriter(streamWriter))
{
while (streamWriter.Position < fileInfo.Length)
{
if (streamWriter.Position + buffer.Length > fileInfo.Length)
{
buffer = new byte[fileInfo.Length - streamWriter.Position];
}
//read
buffer = binaryReader.ReadBytes(buffer.Length);
//proccess
Proc(buffer);
//write
binaryWriter.Write(buffer);
//report if procentage changed
//...
}//while
}//using
}
but it is 5 more time slower than just reading from fileToRead and writing to fileToWrite so I think about threading. I read some question in site and try something like this base on this question
private void ProcData2(string fileToRead, string fileToWrite)
{
int threadNumber = 4; //for example
Task[] tasks = new Task[threadNumber];
long[] startByte = new long[threadNumber];
long[] length = new long[threadNumber];
//divide file to threadNumber(4) part
//and update startByte & length
var parentTask = Task.Run(() =>
{
for (int i = 0; i < threadNumber; i++)
{
tasks[i] = Task.Factory.StartNew(() =>
{
Proc2(fileToRead, fileToWrite, startByte[i], length[i]);
});
}
});
parentTask.Wait();
Task.WaitAll(tasks);
}
//
private void Proc2(string fileToRead,string fileToWrite,long fileStartByte,long partLength)
{
byte[] buffer = new byte[4 * 1024];
using (FileStream streamReader = new FileStream(fileToRead, FileMode.Open,FileAccess.Read,FileShare.Read))
using (BinaryReader binaryReader = new BinaryReader(streamReader))
using (FileStream streamWriter = new FileStream(fileToWrite, FileMode.Open,FileAccess.Write,FileShare.Write))
using (BinaryWriter binaryWriter = new BinaryWriter(streamWriter))
{
streamReader.Seek(fileStartByte, SeekOrigin.Begin);
streamWriter.Seek(fileStartByte, SeekOrigin.Begin);
while (streamWriter.Position < fileStartByte+partLength)
{
if (streamWriter.Position + buffer.Length > fileStartByte+partLength)
{
buffer = new byte[fileStartByte+partLength - streamWriter.Position];
}
//read
buffer = binaryReader.ReadBytes(buffer.Length);
//proccess
Proc(buffer);
//write
binaryWriter.Write(buffer);
//report if procentage changed
//...
}//while
}//using
}
but I think it have some problem and by each time switching task it needs to seek again. I think about reading file, use threading for Proc() and then writing result, but it seems wrong. How can I do it properly?(reading a buffer from a file, process and write it on other file by using task)
//===================================================================
base on Pete Kirkham post I modified my method. I do not know why ,but it did not work for me. I added new method for who it may help them. thanks every body
private void ProcData3(string fileToRead, string fileToWrite)
{
int bufferSize = 4 * 1024;
int threadNumber = 4;//example
List<byte[]> bufferPool = new List<byte[]>();
Task[] tasks = new Task[threadNumber];
//fileToRead & fileToWrite have same size
FileInfo fileInfo = new FileInfo(fileToRead);
using (FileStream streamReader = new FileStream(fileToRead, FileMode.Open))
using (BinaryReader binaryReader = new BinaryReader(streamReader))
using (FileStream streamWriter = new FileStream(fileToWrite, FileMode.Open))
using (BinaryWriter binaryWriter = new BinaryWriter(streamWriter))
{
while (streamWriter.Position < fileInfo.Length)
{
//read
for (int g = 0; g < threadNumber; g++)
{
if (streamWriter.Position + bufferSize <= fileInfo.Length)
{
bufferPool.Add(binaryReader.ReadBytes(bufferSize));
}
else
{
bufferPool.Add(binaryReader.ReadBytes((int)(fileInfo.Length - streamWriter.Position)));
break;
}
}
//do
var parentTask = Task.Run(() =>
{
for (int th = 0; th < bufferPool.Count; th++)
{
int index = th;
//threads
tasks[index] = Task.Factory.StartNew(() =>
{
Proc(bufferPool[index]);
});
}//for th
});
//stop parent task(run childs)
parentTask.Wait();
//wait till all task be done
Task.WaitAll(tasks);
//write
for (int g = 0; g < bufferPool.Count; g++)
{
binaryWriter.Write(bufferPool[g]);
}
//report if procentage changed
//...
}//while
}//using
}
Essentially you want a split the processing of the data up into parallel tasks, but you don't want want to split the IO up.
How this happens depends on the size of your data. If it is small enough to fit into memory, then you can read it all into an input array and create an output array, then create tasks to process some of the input array and populate some of the output array, then write the whole output array to file.
If the data is too large for this, then you need to put a limit on the amount of data read and written at a time. So you have your main flow which starts off by reading N blocks of data and creating N tasks to process them. You then wait for the tasks to complete in order, and each time one completes you write the block of output and read a new block of input and create another task. Some experimentation will be required for a good value for N and block size which means tasks tend to complete in about the same rate as the IO works at.
I'm having this piece of code that computes the MD5 hash for a given input file.
public static String ComputeMD5(String filename)
{
using (var md5 = MD5.Create())
{
try
{
using (var stream = File.OpenRead(filename))
{
return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
}
catch (Exception)
{
// File is not accessible, return String.Empty
return String.Empty;
}
}
}
I'm running this time consuming operation in a separate thread. For very big files this operation may take some seconds/minutes. What I want to do is to be able to stop the operation from another thread, for example using a "Stop" button in the GUI. Any suggestions?
You can read file parts and apply MD5.TransformBlock to each read part. (Notice, that last part should be read with MD5.TransformFinalBlock).
Between processing each block you can check if cancellation is required, you are free to use any synchronization primitives you like.
Here is example, that uses CancellationToken:
using System;
using System.IO;
using System.Threading;
using System.Security.Cryptography;
namespace Stack
{
class Program
{
static void Main(string[] args)
{
using (var cancellationTokenSource = new CancellationTokenSource())
{
var thread = new Thread(() =>
{
try
{
var hash = CalcHash("D:/Image.iso", cancellationTokenSource.Token);
Console.WriteLine($"Done: hash is {BitConverter.ToString(hash)}");
}
catch (OperationCanceledException)
{
Console.WriteLine("Canceled :(");
}
});
// Start background thread
thread.Start();
Console.WriteLine("Working, press any key to exit");
Console.ReadLine();
cancellationTokenSource.Cancel();
}
Console.WriteLine("Finished");
Console.ReadLine();
}
static byte[] CalcHash(string path, CancellationToken ct)
{
using (var stream = File.OpenRead(path))
using (var md5 = MD5.Create())
{
const int blockSize = 1024 * 1024 * 4;
var buffer = new byte[blockSize];
long offset = 0;
while (true)
{
ct.ThrowIfCancellationRequested();
var read = stream.Read(buffer, 0, blockSize);
if (stream.Position == stream.Length)
{
md5.TransformFinalBlock(buffer, 0, read);
break;
}
offset += md5.TransformBlock(buffer, 0, buffer.Length, buffer, 0);
Console.WriteLine($"Processed {offset * 1.0 / 1024 / 1024} MB so far");
}
return md5.Hash;
}
}
}
}
I am writing code that loads multiple instances of the same task at once and waits for them all to finish. Each task reads from a file and uploads a byte array of a portion of that file.
var requests = new Task[parts.Count];
foreach (var part in parts)
{
var partNumber = part.Item1;
var partSize = part.Item2;
var ms = new MemoryStream(partSize);
var bw = new BinaryWriter(ms);
var offset = (partNumber - 1) * partMaxSize;
var count = partSize;
bw.Write(assetContentBytes, offset, count);
ms.Position = 0;
Console.WriteLine("beginning upload of part " + partNumber);
requests[partNumber - 1] = uploadClient.UploadPart(uploadResult.AssetId, partNumber, ms);
}
await Task.WhenAll(requests);
I would like to close these MemoryStreams after the related task is complete, but if I write stream.Close() into the loop, the streams close before the task is complete. Is it possible to close each stream after the task is complete? Thanks.
Just extract the part that uses the stream to another async method:
var requests = new Task[parts.Count];
foreach (var part in parts)
{
var partNumber = part.Item1;
var partSize = part.Item2;
requests[partNumber - 1] = UploadPartAsync(partNumber, partSize);
}
await Task.WhenAll(requests);
...
async Task UploadPartAsync(int partNumber, int partSize)
{
using (var ms = new MemoryStream(partSize))
using (var bw = new BinaryWriter(ms))
{
var offset = (partNumber - 1) * partMaxSize;
var count = partSize;
bw.Write(assetContentBytes, offset, count);
ms.Position = 0;
Console.WriteLine("beginning upload of part " + partNumber);
await uploadClient.UploadPart(uploadResult.AssetId, partNumber, ms);
}
}