How to properly parallelise job heavily relying on I/O - c#

I'm building a console application that have to process a bunch of data.
Basically, the application grabs references from a DB. For each reference, parse the content of the file and make some changes. The files are HTML files, and the process is doing a heavy work with RegEx replacements (find references and transform them into links). The results in then stored on the file system and sent to an external system.
If I resume the process, in a sequential way :
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
var filePath = GetFilePath(ref); // This method looks up in a previously loaded file list
var html = File.ReadAllText(filePath); // Read html locally, or from a network drive
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath); // Copy the result locally, or a network drive
SendToWs(ref, convertedHtml);
}
My program is working correctly but is quite slow. That's why I want to parallelise the process.
By now, I made a simple Parallelization adding AsParallel :
var refs = GetReferencesFromDB().AsParallel();
refs.ForAll(ref=>
{
var filePath = GetFilePath(ref);
var html = File.ReadAllText(filePath);
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath);
SendToWs(ref, convertedHtml);
});
This simple change decrease the duration of the process (25% less time). However, what I understand with parallelization is that there won't be much benefits (or worse, less benefits) if parallelyzing over resources relying on I/O, because the i/o won't magically doubles.
That's why I think I should change my approach not to parallelize the whole process, but to create dependent chained queued tasks.
I.E., I should create a flow like :
Queue read file. When finished, Queue ParseHtml. When finished, Queue both send to WS and write locally. When finished, log the result.
However, I don't know how to realize such think.
I feel it will ends in a set of consumer/producer queues, but I didn't find a correct sample.
And moreover, I'm not sure if there will be benefits.
thanks for advices
[Edit] In fact, I'm the perfect candidate for using c# 4.5... if only it was rtm :)
[Edit 2] Another thing making me thinking it's not correctly parallelized, is that in the resource monitor, I see graphs of CPU, network I/O and disk I/O not stable. when one is high, others are low to medium

You're not leveraging any async I/O APIs in any of your code. Everything you're doing is CPU bound and all your I/O operations are going to waste CPU resources blocking. AsParallel is for compute bound tasks, if you want to take advantage of async I/O you need to leverage the Asynchronous Programming Model (APM) based APIs today in <= v4.0. This is done by looking for BeginXXX/EndXXX methods on the I/O based classes you're using and leveraging those whenever available.
Read this post for starters: TPL TaskFactory.FromAsync vs Tasks with blocking methods
Next, you don't want to use AsParallel in this case anyway. AsParallel enables streaming which will result in an immediately scheduling a new Task per item, but you don't need/want that here. You'd be much better served by partitioning the work using Parallel::ForEach.
Let's see how you can use this knowledge to achieve max concurrency in your specific case:
var refs = GetReferencesFromDB();
// Using Parallel::ForEach here will partition and process your data on separate worker threads
Parallel.ForEach(
refs,
ref =>
{
string filePath = GetFilePath(ref);
byte[] fileDataBuffer = new byte[1048576];
// Need to use FileStream API directly so we can enable async I/O
FileStream sourceFileStream = new FileStream(
filePath,
FileMode.Open,
FileAccess.Read,
FileShare.Read,
8192,
true);
// Use FromAsync to read the data from the file
Task<int> readSourceFileStreamTask = Task.Factory.FromAsync(
sourceFileStream.BeginRead
sourceFileStream.EndRead
fileDataBuffer,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async read is completed
readSourceFileStreamTask.ContinueWith(readSourceFileStreamAntecedent =>
{
int soureFileStreamBytesRead;
try
{
// Determine exactly how many bytes were read
// NOTE: this will propagate any potential exception that may have occurred in EndRead
sourceFileStreamBytesRead = readSourceFileStreamAntecedent.Result;
}
finally
{
// Always clean up the source stream
sourceFileStream.Close();
sourceFileStream = null;
}
// This is here to make sure you don't end up trying to read files larger than this sample code can handle
if(sourceFileStreamBytesRead == fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement reading files larger than 1MB. :P");
}
// Convert the file data to a string
string html = Encoding.UTF8.GetString(fileDataBuffer, 0, sourceFileStreamBytesRead);
// Parse the HTML
string convertedHtml = ParseHtml(html);
// This is here to make sure you don't end up trying to write files larger than this sample code can handle
if(Encoding.UTF8.GetByteCount > fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement writing files larger than 1MB. :P");
}
// Convert the file data back to bytes for writing
Encoding.UTF8.GetBytes(convertedHtml, 0, convertedHtml.Length, fileDataBuffer, 0);
// Need to use FileStream API directly so we can enable async I/O
FileStream destinationFileStream = new FileStream(
destinationFilePath,
FileMode.OpenOrCreate,
FileAccess.Write,
FileShare.None,
8192,
true);
// Use FromAsync to read the data from the file
Task destinationFileStreamWriteTask = Task.Factory.FromAsync(
destinationFileStream.BeginWrite,
destinationFileStream.EndWrite,
fileDataBuffer,
0,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async write is completed
destinationFileStreamWriteTask.ContinueWith(destinationFileStreamWriteAntecedent =>
{
try
{
// NOTE: we call wait here to observe any potential exceptions that might have occurred in EndWrite
destinationFileStreamWriteAntecedent.Wait();
}
finally
{
// Always close the destination file stream
destinationFileStream.Close();
destinationFileStream = null;
}
},
TaskContinuationOptions.AttachedToParent);
// Send to external system **concurrent** to writing to destination file system above
SendToWs(ref, convertedHtml);
},
TaskContinuationOptions.AttachedToParent);
});
Now, here's few notes:
This is sample code so I'm using a 1MB buffer to read/write files. This is excessive for HTML files and wasteful of system resources. You can either lower it to suit your max needs or implement chained reads/writes into a StringBuilder which is an excercise I leave up to you since I'd be writing ~500 more lines of code to do async chained reads/writes. :P
You'll note that on the continuations for the read/write tasks I have TaskContinuationOptions.AttachedToParent. This is very important as it will prevent the worker thread that the Parallel::ForEach starts the work with from completing until all the underlying async calls have completed. If this was not here you would kick off work for all 5000 items concurrently which would pollute the TPL subsystem with thousands of scheduled Tasks and not scale properly at all.
I call SendToWs concurrent to writing the file to the file share here. I don't know what is underlying the implementation of SendToWs, but it too sounds like a good candidate for making async. Right now it's assumed it's pure compute work and, as such, is going to burn a CPU thread while executing. I leave it as an excercise to you to figure out how best to leverage what I've shown you to improve throughput there.
This is all typed free form and my brain was the only compiler here and SO's syntax higlighting is all I used to make sure syntax was good. So, please forgive any syntax errors and let me know if I screwed up anything too badly that you can't make heads or tails of it and I'll follow up.

The good news is your logic could be easily separated into steps that go into a producer-consumer pipeline.
Step 1: Read file
Step 2: Parse file
Step 3: Write file
Step 4: SendToWs
If you are using .NET 4.0 you can use the BlockingCollection data structure as the backbone for the each step's producer-consumer queue. The main thread will enqueue each work item into step 1's queue where it will be picked up and processed and then forwarded on to step 2's queue and so on and so forth.
If you are willing to move on to the Async CTP then you can take advantage of the new TPL Dataflow structures for this as well. There is the BufferBlock<T> data structure, among others, that behaves in a similar manner to BlockingCollection and integrates well with the new async and await keywords.
Because your algorithm is IO bound the producer-consumer strategies may not get you the performance boost you are looking for, but at least you will have a very elegant solution that would scale well if you could increase the IO throughput. I am afraid steps 1 and 3 will be the bottlenecks and the pipeline will not balance well, but it is worth experimenting with.

Just a suggestion, but have you looked into the Consumer / Producer pattern ? A certain number of threads would read your files on disk and feed the content to a queue. Then another set of threads, known as the consumers, would "consume" the queue as its filled. http://zone.ni.com/devzone/cda/tut/p/id/3023

Your best bet in these kind of scenario is definitely the producer-consumer model. One thread to pull the data and a bunch of workers to process it. There's no easy way around the I/O so you might as well just focus on optimizing the computation itself.
I will now try to sketch a model:
// producer thread
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
lock(queue)
{
queue.Enqueue(ref);
event.Set();
}
// if the queue is limited, test if the queue is full and wait.
}
// consumer threads
while(true)
{
value = null;
lock(queue)
{
if(queue.Count > 0)
{
value = queue.Dequeue();
}
}
if(value != null)
// process value
else
event.WaitOne(); // event to signal that an item was placed in the queue.
}
You can find more details about producer/consumer in part 4 of Threading in C#: http://www.albahari.com/threading/part4.aspx

I think your approach to split up the list of files and process each file in one batch is ok.
My feeling is that you might get more performance gain if you play with degree of parallelism.
See: var refs = GetReferencesFromDB().AsParallel().WithDegreeOfParallelism(16); this would start processing 16 files at the same time. Currently you are processing probably 2 or 4 files depending on number of cores you have. This is only efficient when you have only computation without IO. For IO intensive tasks adjustment might bring incredible performance improvements reducing processor idle time.
If you are going to split up and join tasks back using producer-consumer look at this sample: Using Parallel Linq Extensions to union two sequences, how can one yield the fastest results first?

Related

Recommended approach in C# to read from multiple locations in a single file concurrently and asynchronously

I am working with read-only data files (e.g. 100s of gig) containing blocks of data ~64K each. I would like to build an in-memory cache to service the 10s-100s of block reads required to handle each service request.
A basic async non-threadsafe read looks like:
public async Task<byte[]> Read(int id)
{
FStream.Seek(CalcOffset(id), SeekOrigin.Begin);
var ba = new byte[64 * 1024];
await FStream.ReadAsync(ba, 0, ba.Length);
return ba;
}
I can't lock on the FStream to make the above thread-safe (C# error "Cannot await in the body of a lock statement"). I can't remove the await without losing the async behaviour. My current workaround has Read drawing from a cache of FileStreams:
private BufferBlock<FileStream> StreamRead;
public async Task<FileStream> GetReadStream()
{
return await StreamRead.ReceiveAsync(TimeSpan.FromMilliseconds(-1));
}
public async Task ReleaseReadStream(FileStream stream)
{
await StreamRead.SendAsync(stream);
}
Is this the best approach to building a multithreaded async-friendly cache? Any other suggestions?
I would like to build an in-memory cache
Are you sure? :)
Windows has invested considerable amounts of work over several decades to implement an extremely performant file cache built-in to the OS.
There are corner cases where you can do more efficient caching for a specific use case, but the vast majority of the time it's not worth the effort. I recommend measuring first.
I can't lock on the FStream to make the above thread-safe (C# error "Cannot await in the body of a lock statement").
My question pertains on whether/how to perform multiple concurrent reads on a FileStream in an async manner
You can use a SemaphoreSlim to act as an async-compatible lock. The syntax is a bit more awkward, but it works.
On a side note, I also recommend looking into memory mapped files.
What it seems like you desire, is to somehow seek and then read the file at the same time. The fancy term for this is performing an "atomic seek and read" operation.
Windows and Linux support this exact type of operation. On Linux, there's a function called pread, and on Windows, there's a function called ReadFile. All that's left is to go through the mess of hooking into these calls. Yeah, not fun.
I had this exact same problem, and so I made the solution into a library. Presenting, my library pread. It allows you to atomically seek and read at the same time, and is faster than locking on the FileStream.
using pread;
using var fileStream = new FileStream("my_file.txt", FileMode.OpenOrCreate);
var data = new byte[123];
var bytesWritten = P.Write(fileStream, (ReadOnlySpan<byte>)data, fileOffset: 0);
var bytesRead = P.Read(fileStream, (Span<byte>)data, fileOffset: 0);

How to use threading effective way in a console application .net

I have a 8 core systems and i am processing number of text files contains millions of lines say 23 files contain huge number of lines which takes 2 to 3 hours to finish.I am thinking of using TPL task for processing text files.As of now the code which i am using is sequentially processing text files one by one so i am thinking of split it like 5 text files in one thread 5 in another thread etc.Is it a good approach or any other way ? I am using .net 4.0 and code i am using is as shown below
foreach (DataRow dtr in ds.Tables["test"].Rows)
{
string filename = dtr["ID"].ToString() + "_cfg";
try
{
foreach (var file in
Directory.EnumerateFiles(Path.GetDirectoryName(dtr["FILE_PATH"].ToString()), "*.txt"))
{
id = file.Split('\\').Last();
if (!id.Contains("GMML"))
{
strbsc = id.Split('_');
id = strbsc[0];
}
else
{
strbsc = file.Split('-');
id = ("RC" + strbsc[1]).Replace("SC", "");
}
ProcessFile(file, id, dtr["CODE"].ToString(), dtr["DOR_CODE"].ToString(), dtr["FILE_ID"].ToString());
}
}
How to split text files in to batches and each batch should run in threads rather one by one.Suppose if 23 files then 7 in one thread 7 in one thread 7 in one thread and 2 in another thread. One more thing is i am moving all these data from text files to oracle database
EDIT
if i use like this will it worth,but how to seperate files in to batches
Task.Factory.StartNew(() => {ProcessFile(file, id, dtr["CODE"].ToString(), dtr["DOR_CODE"].ToString(), dtr["FILE_ID"].ToString()); });
Splitting the file into multiple chunks does not seem to be a good idea because its performance boost is related to how the file is placed on your disk. But because of the async nature of disk IO operations, I strongly recommend async access to the file. There are several ways to do this and you can always choose a combination of those.
At the lowest level you can use async methods such as StreamWriter.WriteAsync() or StreamReader.ReadAsync() to access the file on disk and cooperatively let the OS know that it can switch to a new thread for disk IO and let the thread out until the Disk IO operation is finished. While it's useful to make async calls at this level, it alone does not have a significant impact on the overall performance of your application, since your app is still waiting for the disk operation to finish and does nothing in the meanwhile! (These calls can have a big impact on your software's responsiveness when they are called from the UI thread)
So, I recommend splitting your software logic into at least two separate parts running on two separate threads; One to read data from the file, and one to process the read data. You can use the provider/consumer pattern to help these threads interact.
One great data structure provided by .net is System.Collections.Concurrent.ConcurrentQueue which is specially useful in implementing multithreaded provider/consumer pattern.
So you can easily do something like this:
System.Collections.Concurrent.ConcurrentQueue<string> queue = new System.Collections.Concurrent.ConcurrentQueue<string>();
bool readFinished = false;
Task tRead = Task.Run(async () =>
{
using (FileStream fs = new FileStream())
{
using (StreamReader re = new StreamReader(fs))
{
string line = "";
while (!re.EndOfStream)
queue.Enqueue(await re.ReadLineAsync());
}
}
});
Task tLogic = Task.Run(async () =>
{
string data ="";
while (!readFinished)
{
if (queue.TryDequeue(out data))
//Process data
else
await Task.Delay(100);
}
});
tRead.Wait();
readFinished = true;
tLogic.Wait();
This simple example uses StreamReader.ReadLineAsync() to read data from file, while a good practice can be reading a fixed-length of characters into a char[] buffer and adding that data to the queue. You can find the optimized buffer length after some tests.
All ,the real bottle neck was when i was doing mass insertion , i was checking whether the inserting data is present in the database or what,I have a status column where if data is present ,it will be 'Y' or 'N' by doing update statement.So the update statement in congestion with insert was the culprit.After doing indexing in database the result reduced from 4 hours to 10 minute, what an impact, but it wins:)

Program doesn't use all hardware resources

I'm working on one program that takes information from files and then stores them in MySQL database. This MySQL database is located in another dedicated server which is much more powerful than this server here. Data is being sent over LAN using 1gbps connection.
It is using 8 threads because my server has 8 cores, but somehow it runs so slowly.
CPU is: Intel Xeon E3-1270 v 3 # 3.50Ghz
RAM: 16 GB ECC
HDD: SATA 3 1TB
My program's CPU usage is only 0-5%
CPU affinity is all 8 cores
So, do you have any ideas what's wrong or how can I increase the speed of my program?
UPDATE:
I updated my code and it appears to be faster:
Parallel.For(0, this.data_files.Count, new ParallelOptions { MaxDegreeOfParallelism = this.MaxThreads }, i =>
{
this.ThreadCount++;
this.ParseFile(this.GetSource());
});
Here's a code snippet that deploys threads:
while (true)
{
if (this.ThreadCount < this.MaxThreads)
{
Task.Factory.StartNew(() =>
this.ParseFile(this.GetFile())
);
this.ThreadCount++;
}
else
{
Thread.Sleep(1);
}
this.UpdateConsole();
}
GetFile function:
private string GetFile()
{
string file = "";
string source = "";
while (true)
{
if (this.data_files.Count() != 0)
{
file = this.data_files[0];
this.data_files.RemoveAt(0);
if (File.Exists(file) == true)
{
source = File.ReadAllText(file);
File.Delete(file);
break;
}
}
}
return source;
}
I'm working on one program that takes information from files and then stores them in MySQL database.
Clearly your program is not CPU bound, it's IO bound. The bottlenecks are going to be based on your hard disk(s) and your network connection. Odds are even a single thread is going to be able to ensure proper utilization of these resources (in a well designed application). Adding extra threads generally won't help, it'll just create a bunch of threads that will spend their time waiting on various IO operations.
To use all the hardware resources is not the right goal for a program to have.
Instead, a better goal is to be as fast as possible. This is significantly different. While using more hardware resources can help, it is not always sufficient.
Sometimes, adding more resources to a problem doesn't help. In those cases, don't. Adding threads makes your program more complex, but not necessarily faster as you've seen.
C# already has good Asynchronous programming features with the TPL (which you are already using), so why not take advantage of that?
This will mean that the .NET framework will automatically manage the threads for you in an efficient way.
Here's what I propose:
foreach (var file in GetFilesToRead()) {
var task = PerformOperation(file);
// Keep a list of tasks, if you wish.
}
...
Task PerformOperation (string filename) {
var file = await ReadFile(file);
await ParseFile(file);
DoSomething();
}
Note that even in CPU-bound programs, threads (and tasks) may not help you if you're using locks.
Although locks help keep programs well-behaved, they come at a significant performance cost.
Within a lock, only one thread may be executing at a time.
This means that the first thread is locking your _lock instance, and then the other threads are waiting for that lock to be released.
In your program, only one thread is active at a time.
To solve this, don't use locks. Instead, write programs that do not need locks at all. Copy variables instead of sharing them. Use immutable collections instead of mutable collections and so on.
My program above uses exactly zero locks and, as such, will better utilize your threads.

How to handle large numbers of concurrent disk write requests as efficiently as possible

Say the method below is being called several thousand times by different threads in a .net 4 application. What’s the best way to handle this situation? Understand that the disk is the bottleneck here but I’d like the WriteFile() method to return quickly.
Data can be can be up to a few MB. Are we talking threadpool, TPL or the like?
public void WriteFile(string FileName, MemoryStream Data)
{
try
{
using (FileStream DiskFile = File.OpenWrite(FileName))
{
Data.WriteTo(DiskFile);
DiskFile.Flush();
DiskFile.Close();
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
If you want to return quickly and not really care that operation is synchronous you could create some kind of in memory Queue where you will be putting write requests , and while Queue is not filled up you can return from method quickly. Another thread will be responsible for dispatching Queue and writing files. If your WriteFile is called and queue is full you will have to wait until you can queue and execution will become synchronous again, but that way you could have a big buffer so if process file write requests is not linear , but is more spiky instead (with pauses between write file calls spikes) such change can be seen as an improvement in your performance.
UPDATE:
Made a little picture for you. Notice that bottleneck always exists, all you can possibly do is optimize requests by using a queue. Notice that queue has limits, so when its filled up , you cannot insta queue files into, you have to wait so there is a free space in that buffer too. But for situation presented on picture (3 bucket requests) its obvious you can quickly put buckets into queue and return, while in first case you have to do that 1 by one and block execution.
Notice that you never need to execute many IO threads at once, since they will all be using same bottleneck and you will just be wasting memory if you try to parallel this heavily, I believe 2 - 10 threads tops will take all available IO bandwidth easily, and will limit application memory usage too.
Since you say that the files don't need to be written in order nor immediately, the simplest approach would be to use a Task:
private void WriteFileAsynchronously(string FileName, MemoryStream Data)
{
Task.Factory.StartNew(() => WriteFileSynchronously(FileName, Data));
}
private void WriteFileSynchronously(string FileName, MemoryStream Data)
{
try
{
using (FileStream DiskFile = File.OpenWrite(FileName))
{
Data.WriteTo(DiskFile);
DiskFile.Flush();
DiskFile.Close();
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
The TPL uses the thread pool internally, and should be fairly efficient even for large numbers of tasks.
If data is coming in faster than you can log it, you have a real problem. A producer/consumer design that has WriteFile just throwing stuff into a ConcurrentQueue or similar structure, and a separate thread servicing that queue works great ... until the queue fills up. And if you're talking about opening 50,000 different files, things are going to back up quick. Not to mention that your data that can be several megabytes for each file is going to further limit the size of your queue.
I've had a similar problem that I solved by having the WriteFile method append to a single file. The records it wrote had a record number, file name, length, and then the data. As Hans pointed out in a comment to your original question, writing to a file is quick; opening a file is slow.
A second thread in my program starts reading that file that WriteFile is writing to. That thread reads each record header (number, filename, length), opens a new file, and then copies data from the log file to the final file.
This works better if the log file and the final file are are on different disks, but it can still work well with a single spindle. It sure exercises your hard drive, though.
It has the drawback of requiring 2X the disk space, but with 2-terabyte drives under $150, I don't consider that much of a problem. It's also less efficient overall than directly writing the data (because you have to handle the data twice), but it has the benefit of not causing the main processing thread to stall.
Encapsulate your complete method implementation in a new Thread(). Then you can "fire-and-forget" these threads and return to the main calling thread.
foreach (file in filesArray)
{
try
{
System.Threading.Thread updateThread = new System.Threading.Thread(delegate()
{
WriteFileSynchronous(fileName, data);
});
updateThread.Start();
}
catch (Exception ex)
{
string errMsg = ex.Message;
Exception innerEx = ex.InnerException;
while (innerEx != null)
{
errMsg += "\n" + innerEx.Message;
innerEx = innerEx.InnerException;
}
errorMessages.Add(errMsg);
}
}

What is the most efficient method to write data to the file?

I need to write a large amount of data in my application. Data arrive periodically and pushed into the queue.
//producer
queue_.push( new_data );// new_data is a 64kb memory range
PostThreadMessage(worker_thread_id_, MyMsg, 0, 0);//notify the worker thread
after that, i need to write this data to the file. I have another thread, that get data from the queue and writes it.
//consumer
while(GetMessage(msg, 0, 0))
{
data = queue_.pop();
WriteFile(file_handle_, data.begin(), data.size(), &io_bytes, NULL);
}
This is simple, but not efficient.
Another oportunity is to use IO completion ports.
//producer
//file must be associated with io completion port
WriteFile( file_handle_
, new_data.begin()
, new_data.size()
, &io_bytes
, new my_overlaped(new_data) );
consumer is just delete unused buffers
my_completion_key* ck = 0;
my_overlapped* op = 0;
DWORD res = GetQueuedIOCompletionStatus(completion_port_, &io_bytes, &ck, &op);
//process errors
delete op->data;
What is the preferred way, to perform large amount of file i/o operation on windows?
Hard to tell because it is not clear what kind of problem you are experiencing with your current approach. Going async in a dedicated "writer" thread shouldn't result in considerable performance improvement.
If you have small but frequent writes (so that frequent WriteFile's cost becomes a bottleneck) I would suggest implementing a kind of lazy writes - collect data and flush it to file when some volume threshold is reached.

Categories