I need to write a large amount of data in my application. Data arrive periodically and pushed into the queue.
//producer
queue_.push( new_data );// new_data is a 64kb memory range
PostThreadMessage(worker_thread_id_, MyMsg, 0, 0);//notify the worker thread
after that, i need to write this data to the file. I have another thread, that get data from the queue and writes it.
//consumer
while(GetMessage(msg, 0, 0))
{
data = queue_.pop();
WriteFile(file_handle_, data.begin(), data.size(), &io_bytes, NULL);
}
This is simple, but not efficient.
Another oportunity is to use IO completion ports.
//producer
//file must be associated with io completion port
WriteFile( file_handle_
, new_data.begin()
, new_data.size()
, &io_bytes
, new my_overlaped(new_data) );
consumer is just delete unused buffers
my_completion_key* ck = 0;
my_overlapped* op = 0;
DWORD res = GetQueuedIOCompletionStatus(completion_port_, &io_bytes, &ck, &op);
//process errors
delete op->data;
What is the preferred way, to perform large amount of file i/o operation on windows?
Hard to tell because it is not clear what kind of problem you are experiencing with your current approach. Going async in a dedicated "writer" thread shouldn't result in considerable performance improvement.
If you have small but frequent writes (so that frequent WriteFile's cost becomes a bottleneck) I would suggest implementing a kind of lazy writes - collect data and flush it to file when some volume threshold is reached.
Related
I have the following function to write to a file asynchronously from multiple threads in parallel->
static startOffset = 0; // This variable will store the offset at which the thread begins to write
static int blockSize = 10; // size of block written by each thread
static Task<long> WriteToFile(Stream dataToWrite)
{
var startOffset= getStartfOffset(); // Definition of this function is given later
using(var fs = new FileStream(fileName,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.ReadWrite))
{
fs.Seek(offset,SeekOrigin.Begin);
await dataToWrite.CopyToAsync(fs);
}
return startOffset;
}
/**
*I use reader writer lock here so that only one thread can access the value of the startOffset at
*a time
*/
static int getStartOffset()
{
int result = 0;
try
{
rwl.AcquireWriterLock();
result = startOffset;
startOffset+=blockSize; // increment the startOffset for the next thread
}
finally
{
rwl.ReleaseWriterLock();
}
return result;
}
I then access the above function using to write some strings from multiple threads.
var tasks = List<Task>();
for(int i=1;i<=4;i++)
{
tasks.Add(Task.Run( async() => {
String s = "aaaaaaaaaa"
byte[] buffer = new byte [10];
buffer = Encoding.Default.GetBytes(s);
Stream data = new MemoryStream(buffer);
long offset = await WriteToFile(data);
Console.WriteLine($"Data written at offset - {offset}");
});
}
Task.WaitAll(tasks.ToArray());
Now , this code executes well most of the times. But sometimes randomly, it write some Japanese characters or some other symbols in the file. Is there something that I am doing wrong in the multithreading?
Your calculation of startOffset assumes that each thread is writing exactly 10 bytes. There are several issues with this.
One, the data has unknown length:
byte[] buffer = new byte [10];
buffer = Encoding.Default.GetBytes(s);
The assignment doesn't put data into the newly allocated 10 byte array, it leaks the new byte[10] array (which will be garbage collected) and stores a reference to the return of GetBytes(s), which could have any length at all. It could overflow into the next Task's area. Or it could leave some content that existed in the file beforehand (you use OpenOrCreate) which lies in the area for the current Task, but past the end of the actual dataToWrite.
Two, you try to seek past the areas that other threads are expected to write to, but if those writes haven't completed, they haven't increased the file length. So you attempt to seek past the end of the file, which is allowed for the Windows API but might cause problems with the .NET wrappers. However, FileStream.Seek does indicate you are ok
When you seek beyond the length of the file, the file size grows
although this might not be precisely correct, since the Windows API says
It is not an error to set a file pointer to a position beyond the end of the file. The size of the file does not increase until you call the SetEndOfFile, WriteFile, or WriteFileEx function. A write operation increases the size of the file to the file pointer position plus the size of the buffer written, which results in the intervening bytes uninitialized.
I think that asynchronous file I/O is not usually meant to be utilized with multithreading. Just because something is asynchronous does not mean that an operation should have multiple threads assigned to it.
To quote the documentation for async file I/O: Asynchronous operations enable you to perform resource-intensive I/O operations without blocking the main thread. Basically, instead of using a bunch of threads on one operation, it dispatches a new thread to accomplish a less meaningful task. Eventually with a big enough application, nearly everything can be abstracted to be a not-so-meaningful task and computers can run massive apps pretty quickly utilizing multithreading.
What you are likely experiencing is undefined behavior due to multiple threads overwriting the same location in memory. These Japanese characters you are referring to are likely malformed ascii/unicode that your text editor is attempting to interpret.
If you would like to remedy the undefined behavior and remain using asynchronous operations, you should be able to await each individual task before the next one can start. This will prevent the offset variable from being in the incorrect position for the newest task. Although, logically it will run the same as a synchronous version.
I am using C#'s async Sockets and use BeginReceive to read data from the Socket into a byte[]-Buffer of 8192 bytes. But what happens, when new packet come in before BeginReceive is called again? My current setup handles about 3 incoming messages before it stops. I'm assuming that the Socket must store the incoming data somewhere before it can be processed by BeginReceive.
Do I have any control over how much data the Socket buffers before it stops?
Do I have to rely on processing the incoming messages fast enough in order not to "miss" any?
What happen, when the ProcessMessageBuffer method in the example below takes so long (for some reason) that the incoming data starts to pile up in the Socket?
public void ReadCallback(IAsyncResult ar)
{
// We have a new TCP Packet!
int bytesReceived = 0;
try
{
// The amount of bytes we have just received
bytesReceived = Socket.EndReceive(ar);
}
catch (SocketException ex)
{
// The client closed the connection
OnSocketException(new SocketExceptionEventArgs(ex));
}
if (bytesReceived > 0)
{
// We have received some data. Write it to the MessageBuffer
MessageBuffer.Write(ReceiveBuffer, 0, bytesReceived);
// Process the Messages that may be stored in the MessageBuffer
// What happens, if this takes too long?
ProcessMessageBuffer(MessageBuffer.ToArray());
// Get ready to receive more data
Socket.BeginReceive(ReceiveBuffer, 0, ReceiveBuffer.Length, SocketFlags.None, new AsyncCallback(ReadCallback), null);
}
}
Network I/O is buffered at every step of the way. So it's hard to know which "buffer" you are worried about.
From Socket.ReceiveBufferSize:
…The default is 8192.A larger buffer size potentially reduces the number of empty acknowledgements (TCP packets with no data portion), but might also delay the recognition of connection difficulties. Consider increasing the buffer size if you are transferring large files, or you are using a high bandwidth, high latency connection (such as a satellite broadband provider.)
From your question:
Do I have any control over how much data the Socket buffers before it stops?
You have at least a few strategies available:
Modify the ReceiveBufferSize property value. This will change the size of the buffer in the socket object.
Use a larger buffer in your call to BeginReceive(). This will provide the Socket object with more space into which it can write before it can no longer empty its own buffer. Note that the buffer you pass to Socket will be pinned until the receive operation completes, which can have implications on memory heap management.
Issue multiple BeginReceive() calls. This has a similar effect as providing a larger buffer, but gives you finer granularity of control over the buffers. It comes with the complication that, due to how Windows schedules threads, you may wind up executing the callback for the receive operation completions in a different order than you called BeginReceive() in the first place. The data will be in the right order, according to the order of the BeginReceive() calls and each buffer you passed, but those buffers may appear to your code to get filled out of order (they aren't really, but the thread handling a later-filled buffer might get to run before the thread handling an earlier-filled buffer).
See socket buffer size: pros and cons of bigger vs smaller for some additional details.
Do I have to rely on processing the incoming messages fast enough in order not to "miss" any?
No. TCP is reliable. If you don't process data quickly enough, all that will happen is that the remote endpoint will have to wait to send more data.
That said, you should work very hard to make your socket I/O code work as quickly as possible. If you have some processing that is slow enough to delay receive operations, you should off-load that processing to a completely independent thread, buffering the received data yourself (e.g. with a MemoryStream, FileStream, a queue of some sort, etc.).
If you do it that way, then you likely won't have to do anything beyond the simple, default handling of socket. You'll be able to have a single BeginReceive() outstanding at once, you won't have to adjust the socket's buffer, and you'll be able to allocate "normal-sized" byte[] objects (or keep a single one around for reuse).
I am pretty new to coding with some experience in ASM and C for PIC. I am still learning high level programming with C#.
Question
I have a Serial port data reception and processing program in C#. To avoid losing data and knowing when it was coming, I set a DataReceived event and loop into the handling method until there were no more bytes to read.
When I attempted this, the loop continued endlessly and blocked my program from other tasks (such as processing the retrieved data) when I continuously received data.
I read about threading in C#, I created a thread that constantly checks for SerialPort.Bytes2Read property so it will know when to retrieve available data.
I created a second thread that can process data while new data is still being read. If bytes have been read and ReadSerial() has more bytes to read and the timeout (restarted every time a new byte is read from the serial) they can still be processed and the frames assembled via a method named DataProcessing() which reads from the same variable being filled by ReadSerial().
This gave me the desired results, but I noticed that with my solution (both ReadSerial() and DataProcessing() threads alive), CPU Usage was skyrocketed all the way to 100%!
How do you approach this problem without causing such high CPU usage?
public static void ReadSerial() //Method that handles Serial Reception
{
while (KeepAlive) // Bool variable used to keep alive the thread. Turned to false
{ // when the program ends.
if (Port.BytesToRead != 0)
{
for (int i = 0; i < 5000; i++)
{
/* I Don't know any other way to
implement a timeout to wait for
additional characters so i took what
i knew from PIC Serial Data Handling. */
if (Port.BytesToRead != 0)
{
RxList.Add(Convert.ToByte(Port.ReadByte()));
i = 0;
if (RxList.Count > 20) // In case the method is stuck still reading
BufferReady = true; // signal the Data Processing thread to
} // work with that chunk of data.
BufferReady = true; // signals the DataProcessing Method to work
} // with the current data in RxList.
}
}
}
I can not understand completely what you are meaning with the "DataReceived" and the "loop". I am also working a lot with Serial Ports as well as other interfaces. In my application I am attaching to the DataReceived Event and also reading based on the Bytes to read, but I dont use a loop there:
int bytesToRead = this._port.BytesToRead;
var data = new byte[bytesToRead];
this._port.BaseStream.Read(data , 0, bytesToRead);
If you are using a loop to read the bytes I recommend something like:
System.Threading.Thread.Sleep(...);
Otherwise the Thread you are using to read the bytes is busy all the time. And this will lead to the fact that other threads cannot be processed or your CPU is at 100%.
But I think you don't have to use a loop for polling for the data if you are using the DataReceived event. If my undertanding is not correct or you need further information please ask.
I'm building a console application that have to process a bunch of data.
Basically, the application grabs references from a DB. For each reference, parse the content of the file and make some changes. The files are HTML files, and the process is doing a heavy work with RegEx replacements (find references and transform them into links). The results in then stored on the file system and sent to an external system.
If I resume the process, in a sequential way :
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
var filePath = GetFilePath(ref); // This method looks up in a previously loaded file list
var html = File.ReadAllText(filePath); // Read html locally, or from a network drive
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath); // Copy the result locally, or a network drive
SendToWs(ref, convertedHtml);
}
My program is working correctly but is quite slow. That's why I want to parallelise the process.
By now, I made a simple Parallelization adding AsParallel :
var refs = GetReferencesFromDB().AsParallel();
refs.ForAll(ref=>
{
var filePath = GetFilePath(ref);
var html = File.ReadAllText(filePath);
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath);
SendToWs(ref, convertedHtml);
});
This simple change decrease the duration of the process (25% less time). However, what I understand with parallelization is that there won't be much benefits (or worse, less benefits) if parallelyzing over resources relying on I/O, because the i/o won't magically doubles.
That's why I think I should change my approach not to parallelize the whole process, but to create dependent chained queued tasks.
I.E., I should create a flow like :
Queue read file. When finished, Queue ParseHtml. When finished, Queue both send to WS and write locally. When finished, log the result.
However, I don't know how to realize such think.
I feel it will ends in a set of consumer/producer queues, but I didn't find a correct sample.
And moreover, I'm not sure if there will be benefits.
thanks for advices
[Edit] In fact, I'm the perfect candidate for using c# 4.5... if only it was rtm :)
[Edit 2] Another thing making me thinking it's not correctly parallelized, is that in the resource monitor, I see graphs of CPU, network I/O and disk I/O not stable. when one is high, others are low to medium
You're not leveraging any async I/O APIs in any of your code. Everything you're doing is CPU bound and all your I/O operations are going to waste CPU resources blocking. AsParallel is for compute bound tasks, if you want to take advantage of async I/O you need to leverage the Asynchronous Programming Model (APM) based APIs today in <= v4.0. This is done by looking for BeginXXX/EndXXX methods on the I/O based classes you're using and leveraging those whenever available.
Read this post for starters: TPL TaskFactory.FromAsync vs Tasks with blocking methods
Next, you don't want to use AsParallel in this case anyway. AsParallel enables streaming which will result in an immediately scheduling a new Task per item, but you don't need/want that here. You'd be much better served by partitioning the work using Parallel::ForEach.
Let's see how you can use this knowledge to achieve max concurrency in your specific case:
var refs = GetReferencesFromDB();
// Using Parallel::ForEach here will partition and process your data on separate worker threads
Parallel.ForEach(
refs,
ref =>
{
string filePath = GetFilePath(ref);
byte[] fileDataBuffer = new byte[1048576];
// Need to use FileStream API directly so we can enable async I/O
FileStream sourceFileStream = new FileStream(
filePath,
FileMode.Open,
FileAccess.Read,
FileShare.Read,
8192,
true);
// Use FromAsync to read the data from the file
Task<int> readSourceFileStreamTask = Task.Factory.FromAsync(
sourceFileStream.BeginRead
sourceFileStream.EndRead
fileDataBuffer,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async read is completed
readSourceFileStreamTask.ContinueWith(readSourceFileStreamAntecedent =>
{
int soureFileStreamBytesRead;
try
{
// Determine exactly how many bytes were read
// NOTE: this will propagate any potential exception that may have occurred in EndRead
sourceFileStreamBytesRead = readSourceFileStreamAntecedent.Result;
}
finally
{
// Always clean up the source stream
sourceFileStream.Close();
sourceFileStream = null;
}
// This is here to make sure you don't end up trying to read files larger than this sample code can handle
if(sourceFileStreamBytesRead == fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement reading files larger than 1MB. :P");
}
// Convert the file data to a string
string html = Encoding.UTF8.GetString(fileDataBuffer, 0, sourceFileStreamBytesRead);
// Parse the HTML
string convertedHtml = ParseHtml(html);
// This is here to make sure you don't end up trying to write files larger than this sample code can handle
if(Encoding.UTF8.GetByteCount > fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement writing files larger than 1MB. :P");
}
// Convert the file data back to bytes for writing
Encoding.UTF8.GetBytes(convertedHtml, 0, convertedHtml.Length, fileDataBuffer, 0);
// Need to use FileStream API directly so we can enable async I/O
FileStream destinationFileStream = new FileStream(
destinationFilePath,
FileMode.OpenOrCreate,
FileAccess.Write,
FileShare.None,
8192,
true);
// Use FromAsync to read the data from the file
Task destinationFileStreamWriteTask = Task.Factory.FromAsync(
destinationFileStream.BeginWrite,
destinationFileStream.EndWrite,
fileDataBuffer,
0,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async write is completed
destinationFileStreamWriteTask.ContinueWith(destinationFileStreamWriteAntecedent =>
{
try
{
// NOTE: we call wait here to observe any potential exceptions that might have occurred in EndWrite
destinationFileStreamWriteAntecedent.Wait();
}
finally
{
// Always close the destination file stream
destinationFileStream.Close();
destinationFileStream = null;
}
},
TaskContinuationOptions.AttachedToParent);
// Send to external system **concurrent** to writing to destination file system above
SendToWs(ref, convertedHtml);
},
TaskContinuationOptions.AttachedToParent);
});
Now, here's few notes:
This is sample code so I'm using a 1MB buffer to read/write files. This is excessive for HTML files and wasteful of system resources. You can either lower it to suit your max needs or implement chained reads/writes into a StringBuilder which is an excercise I leave up to you since I'd be writing ~500 more lines of code to do async chained reads/writes. :P
You'll note that on the continuations for the read/write tasks I have TaskContinuationOptions.AttachedToParent. This is very important as it will prevent the worker thread that the Parallel::ForEach starts the work with from completing until all the underlying async calls have completed. If this was not here you would kick off work for all 5000 items concurrently which would pollute the TPL subsystem with thousands of scheduled Tasks and not scale properly at all.
I call SendToWs concurrent to writing the file to the file share here. I don't know what is underlying the implementation of SendToWs, but it too sounds like a good candidate for making async. Right now it's assumed it's pure compute work and, as such, is going to burn a CPU thread while executing. I leave it as an excercise to you to figure out how best to leverage what I've shown you to improve throughput there.
This is all typed free form and my brain was the only compiler here and SO's syntax higlighting is all I used to make sure syntax was good. So, please forgive any syntax errors and let me know if I screwed up anything too badly that you can't make heads or tails of it and I'll follow up.
The good news is your logic could be easily separated into steps that go into a producer-consumer pipeline.
Step 1: Read file
Step 2: Parse file
Step 3: Write file
Step 4: SendToWs
If you are using .NET 4.0 you can use the BlockingCollection data structure as the backbone for the each step's producer-consumer queue. The main thread will enqueue each work item into step 1's queue where it will be picked up and processed and then forwarded on to step 2's queue and so on and so forth.
If you are willing to move on to the Async CTP then you can take advantage of the new TPL Dataflow structures for this as well. There is the BufferBlock<T> data structure, among others, that behaves in a similar manner to BlockingCollection and integrates well with the new async and await keywords.
Because your algorithm is IO bound the producer-consumer strategies may not get you the performance boost you are looking for, but at least you will have a very elegant solution that would scale well if you could increase the IO throughput. I am afraid steps 1 and 3 will be the bottlenecks and the pipeline will not balance well, but it is worth experimenting with.
Just a suggestion, but have you looked into the Consumer / Producer pattern ? A certain number of threads would read your files on disk and feed the content to a queue. Then another set of threads, known as the consumers, would "consume" the queue as its filled. http://zone.ni.com/devzone/cda/tut/p/id/3023
Your best bet in these kind of scenario is definitely the producer-consumer model. One thread to pull the data and a bunch of workers to process it. There's no easy way around the I/O so you might as well just focus on optimizing the computation itself.
I will now try to sketch a model:
// producer thread
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
lock(queue)
{
queue.Enqueue(ref);
event.Set();
}
// if the queue is limited, test if the queue is full and wait.
}
// consumer threads
while(true)
{
value = null;
lock(queue)
{
if(queue.Count > 0)
{
value = queue.Dequeue();
}
}
if(value != null)
// process value
else
event.WaitOne(); // event to signal that an item was placed in the queue.
}
You can find more details about producer/consumer in part 4 of Threading in C#: http://www.albahari.com/threading/part4.aspx
I think your approach to split up the list of files and process each file in one batch is ok.
My feeling is that you might get more performance gain if you play with degree of parallelism.
See: var refs = GetReferencesFromDB().AsParallel().WithDegreeOfParallelism(16); this would start processing 16 files at the same time. Currently you are processing probably 2 or 4 files depending on number of cores you have. This is only efficient when you have only computation without IO. For IO intensive tasks adjustment might bring incredible performance improvements reducing processor idle time.
If you are going to split up and join tasks back using producer-consumer look at this sample: Using Parallel Linq Extensions to union two sequences, how can one yield the fastest results first?
Say the method below is being called several thousand times by different threads in a .net 4 application. What’s the best way to handle this situation? Understand that the disk is the bottleneck here but I’d like the WriteFile() method to return quickly.
Data can be can be up to a few MB. Are we talking threadpool, TPL or the like?
public void WriteFile(string FileName, MemoryStream Data)
{
try
{
using (FileStream DiskFile = File.OpenWrite(FileName))
{
Data.WriteTo(DiskFile);
DiskFile.Flush();
DiskFile.Close();
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
If you want to return quickly and not really care that operation is synchronous you could create some kind of in memory Queue where you will be putting write requests , and while Queue is not filled up you can return from method quickly. Another thread will be responsible for dispatching Queue and writing files. If your WriteFile is called and queue is full you will have to wait until you can queue and execution will become synchronous again, but that way you could have a big buffer so if process file write requests is not linear , but is more spiky instead (with pauses between write file calls spikes) such change can be seen as an improvement in your performance.
UPDATE:
Made a little picture for you. Notice that bottleneck always exists, all you can possibly do is optimize requests by using a queue. Notice that queue has limits, so when its filled up , you cannot insta queue files into, you have to wait so there is a free space in that buffer too. But for situation presented on picture (3 bucket requests) its obvious you can quickly put buckets into queue and return, while in first case you have to do that 1 by one and block execution.
Notice that you never need to execute many IO threads at once, since they will all be using same bottleneck and you will just be wasting memory if you try to parallel this heavily, I believe 2 - 10 threads tops will take all available IO bandwidth easily, and will limit application memory usage too.
Since you say that the files don't need to be written in order nor immediately, the simplest approach would be to use a Task:
private void WriteFileAsynchronously(string FileName, MemoryStream Data)
{
Task.Factory.StartNew(() => WriteFileSynchronously(FileName, Data));
}
private void WriteFileSynchronously(string FileName, MemoryStream Data)
{
try
{
using (FileStream DiskFile = File.OpenWrite(FileName))
{
Data.WriteTo(DiskFile);
DiskFile.Flush();
DiskFile.Close();
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
The TPL uses the thread pool internally, and should be fairly efficient even for large numbers of tasks.
If data is coming in faster than you can log it, you have a real problem. A producer/consumer design that has WriteFile just throwing stuff into a ConcurrentQueue or similar structure, and a separate thread servicing that queue works great ... until the queue fills up. And if you're talking about opening 50,000 different files, things are going to back up quick. Not to mention that your data that can be several megabytes for each file is going to further limit the size of your queue.
I've had a similar problem that I solved by having the WriteFile method append to a single file. The records it wrote had a record number, file name, length, and then the data. As Hans pointed out in a comment to your original question, writing to a file is quick; opening a file is slow.
A second thread in my program starts reading that file that WriteFile is writing to. That thread reads each record header (number, filename, length), opens a new file, and then copies data from the log file to the final file.
This works better if the log file and the final file are are on different disks, but it can still work well with a single spindle. It sure exercises your hard drive, though.
It has the drawback of requiring 2X the disk space, but with 2-terabyte drives under $150, I don't consider that much of a problem. It's also less efficient overall than directly writing the data (because you have to handle the data twice), but it has the benefit of not causing the main processing thread to stall.
Encapsulate your complete method implementation in a new Thread(). Then you can "fire-and-forget" these threads and return to the main calling thread.
foreach (file in filesArray)
{
try
{
System.Threading.Thread updateThread = new System.Threading.Thread(delegate()
{
WriteFileSynchronous(fileName, data);
});
updateThread.Start();
}
catch (Exception ex)
{
string errMsg = ex.Message;
Exception innerEx = ex.InnerException;
while (innerEx != null)
{
errMsg += "\n" + innerEx.Message;
innerEx = innerEx.InnerException;
}
errorMessages.Add(errMsg);
}
}