I have the following function to write to a file asynchronously from multiple threads in parallel->
static startOffset = 0; // This variable will store the offset at which the thread begins to write
static int blockSize = 10; // size of block written by each thread
static Task<long> WriteToFile(Stream dataToWrite)
{
var startOffset= getStartfOffset(); // Definition of this function is given later
using(var fs = new FileStream(fileName,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.ReadWrite))
{
fs.Seek(offset,SeekOrigin.Begin);
await dataToWrite.CopyToAsync(fs);
}
return startOffset;
}
/**
*I use reader writer lock here so that only one thread can access the value of the startOffset at
*a time
*/
static int getStartOffset()
{
int result = 0;
try
{
rwl.AcquireWriterLock();
result = startOffset;
startOffset+=blockSize; // increment the startOffset for the next thread
}
finally
{
rwl.ReleaseWriterLock();
}
return result;
}
I then access the above function using to write some strings from multiple threads.
var tasks = List<Task>();
for(int i=1;i<=4;i++)
{
tasks.Add(Task.Run( async() => {
String s = "aaaaaaaaaa"
byte[] buffer = new byte [10];
buffer = Encoding.Default.GetBytes(s);
Stream data = new MemoryStream(buffer);
long offset = await WriteToFile(data);
Console.WriteLine($"Data written at offset - {offset}");
});
}
Task.WaitAll(tasks.ToArray());
Now , this code executes well most of the times. But sometimes randomly, it write some Japanese characters or some other symbols in the file. Is there something that I am doing wrong in the multithreading?
Your calculation of startOffset assumes that each thread is writing exactly 10 bytes. There are several issues with this.
One, the data has unknown length:
byte[] buffer = new byte [10];
buffer = Encoding.Default.GetBytes(s);
The assignment doesn't put data into the newly allocated 10 byte array, it leaks the new byte[10] array (which will be garbage collected) and stores a reference to the return of GetBytes(s), which could have any length at all. It could overflow into the next Task's area. Or it could leave some content that existed in the file beforehand (you use OpenOrCreate) which lies in the area for the current Task, but past the end of the actual dataToWrite.
Two, you try to seek past the areas that other threads are expected to write to, but if those writes haven't completed, they haven't increased the file length. So you attempt to seek past the end of the file, which is allowed for the Windows API but might cause problems with the .NET wrappers. However, FileStream.Seek does indicate you are ok
When you seek beyond the length of the file, the file size grows
although this might not be precisely correct, since the Windows API says
It is not an error to set a file pointer to a position beyond the end of the file. The size of the file does not increase until you call the SetEndOfFile, WriteFile, or WriteFileEx function. A write operation increases the size of the file to the file pointer position plus the size of the buffer written, which results in the intervening bytes uninitialized.
I think that asynchronous file I/O is not usually meant to be utilized with multithreading. Just because something is asynchronous does not mean that an operation should have multiple threads assigned to it.
To quote the documentation for async file I/O: Asynchronous operations enable you to perform resource-intensive I/O operations without blocking the main thread. Basically, instead of using a bunch of threads on one operation, it dispatches a new thread to accomplish a less meaningful task. Eventually with a big enough application, nearly everything can be abstracted to be a not-so-meaningful task and computers can run massive apps pretty quickly utilizing multithreading.
What you are likely experiencing is undefined behavior due to multiple threads overwriting the same location in memory. These Japanese characters you are referring to are likely malformed ascii/unicode that your text editor is attempting to interpret.
If you would like to remedy the undefined behavior and remain using asynchronous operations, you should be able to await each individual task before the next one can start. This will prevent the offset variable from being in the incorrect position for the newest task. Although, logically it will run the same as a synchronous version.
Related
I have a library that is connected to some network service using TCP sockets and randomly receives a data from it. I need to process these data line by line and for that I have 2 options:
Create a new thread (I don't want to do that) in which I have never ending loop which calls StreamReader.ReadLine() (I don't want to spawn new threads as this is a library and threads should be fully under control of main program)
Using async callback which gets called every time some data arrives to stream buffer. I currently use this option, but I am having troubles getting lines only. I hacked out this simple solution:
// This function reset the callback after it's processed
private void resetCallback()
{
if (this.networkStream == null)
return;
if (!string.IsNullOrEmpty(this.lineBuffer) && this.lineBuffer.EndsWith("\n"))
{
this.processOutput(this.lineBuffer);
this.lineBuffer = "";
}
AsyncCallback callback = new AsyncCallback(OnReceive);
this.networkStream.BeginRead(buffer, 0, buffer.Length, callback, this.networkStream);
}
// This function gets called every time some data arrives to buffer
private void OnReceive(IAsyncResult data)
{
if (this.networkStream == null)
return;
int bytes = this.networkStream.EndRead(data);
string text = System.Text.Encoding.UTF8.GetString(buffer, 0, bytes);
if (!text.Contains("\n"))
{
this.lineBuffer += text;
}
else
{
List<string> parts = new List<string>(text.Split('\n'));
while (parts.Count > 0)
{
this.lineBuffer += parts[0];
if (parts.Count > 1)
{
this.processOutput(this.lineBuffer + "\n");
this.lineBuffer = "";
}
parts.RemoveAt(0);
}
}
this.resetCallback();
}
As you can see I am using very nasty solution where I am basically checking in every "packet" of data that are received on buffer:
Whether data in buffer are whole line (ends with new line)
Whether data in buffer contains more than 1 line (new line is somewhere in middle of data, or there are more than 1 new line)
Data in buffer contains only a part of a line (no new line in text)
The problem here is that callback function can be called any time when some data are received, and these data can be a line, part of a line, or even multiple lines.
Based on the new line I am storing data in another buffers and when I finally get a new line, I process it somehow.
This is actually working just fine, but I am still wondering if there isn't a better solution that is more clean and doesn't require such a hacking in order to read the stream line by line without using threads?
Please note commenter Damien_The_Unbeliever's point about the issue with partial UTF8 characters. As he says, there's nothing in TCP that would guarantee that you only receive whole characters; a sequence of bytes in the stream can be interrupted at any point, including mid-character.
The usual way to address this would be to using an instance of a Decoder (which you can retrieve from the appropriate Encoding subclass, e.g. Encoding.UTF8.GetDecoder()). A decoder instance will buffer characters for you, returning only whole characters as they are available.
But in your case, there is a much easier way: use the TextReader.ReadLineAsync() method.
For example, here's an asynchronous method which will process each line of text read from the stream, with the returned task for the method completing only when the stream itself has reached the end (i.e. graceful closure of the socket):
async Task ProcessLines()
{
using (StreamReader reader = new StreamReader(
this.networkStream, Encoding.UTF8, false, 1024, true))
{
string line;
while ((line = await reader.ReadLineAsync()) != null)
{
this.processOutput(line);
}
}
// Clean up here. I.e. send any remaining response to remote endpoint,
// call Socket.Shutdown(SocketShutdown.Both), and then close the
// socket.
}
You would call that (preferably awaiting the result in another async method, though that would depend on the exact context of the caller) from wherever you call resetCallback() now. Given the lack of a good, minimal, complete code example a more specific explanation than that can't be provided.
The key is that, being an async method, the method will return as soon as you call ReadLineAsync() (assuming the call doesn't complete immediately), and will resume execution later once that operation completes, i.e. a complete line of text is available and can be returned.
This is the standard idiom now in C# for dealing with this kind of asynchronous operation. It allows you to write the code practically as if you are doing everything synchronously, while the compiler rewrites the code for you to actually implement it asynchronously.
(As an aside: you may want to consider using the usual .NET conventions, i.e. Pascal casing, for method names, instead of the Java-style camel-casing. That will help readers more readily understand your code examples).
I am pretty new to coding with some experience in ASM and C for PIC. I am still learning high level programming with C#.
Question
I have a Serial port data reception and processing program in C#. To avoid losing data and knowing when it was coming, I set a DataReceived event and loop into the handling method until there were no more bytes to read.
When I attempted this, the loop continued endlessly and blocked my program from other tasks (such as processing the retrieved data) when I continuously received data.
I read about threading in C#, I created a thread that constantly checks for SerialPort.Bytes2Read property so it will know when to retrieve available data.
I created a second thread that can process data while new data is still being read. If bytes have been read and ReadSerial() has more bytes to read and the timeout (restarted every time a new byte is read from the serial) they can still be processed and the frames assembled via a method named DataProcessing() which reads from the same variable being filled by ReadSerial().
This gave me the desired results, but I noticed that with my solution (both ReadSerial() and DataProcessing() threads alive), CPU Usage was skyrocketed all the way to 100%!
How do you approach this problem without causing such high CPU usage?
public static void ReadSerial() //Method that handles Serial Reception
{
while (KeepAlive) // Bool variable used to keep alive the thread. Turned to false
{ // when the program ends.
if (Port.BytesToRead != 0)
{
for (int i = 0; i < 5000; i++)
{
/* I Don't know any other way to
implement a timeout to wait for
additional characters so i took what
i knew from PIC Serial Data Handling. */
if (Port.BytesToRead != 0)
{
RxList.Add(Convert.ToByte(Port.ReadByte()));
i = 0;
if (RxList.Count > 20) // In case the method is stuck still reading
BufferReady = true; // signal the Data Processing thread to
} // work with that chunk of data.
BufferReady = true; // signals the DataProcessing Method to work
} // with the current data in RxList.
}
}
}
I can not understand completely what you are meaning with the "DataReceived" and the "loop". I am also working a lot with Serial Ports as well as other interfaces. In my application I am attaching to the DataReceived Event and also reading based on the Bytes to read, but I dont use a loop there:
int bytesToRead = this._port.BytesToRead;
var data = new byte[bytesToRead];
this._port.BaseStream.Read(data , 0, bytesToRead);
If you are using a loop to read the bytes I recommend something like:
System.Threading.Thread.Sleep(...);
Otherwise the Thread you are using to read the bytes is busy all the time. And this will lead to the fact that other threads cannot be processed or your CPU is at 100%.
But I think you don't have to use a loop for polling for the data if you are using the DataReceived event. If my undertanding is not correct or you need further information please ask.
I've been trying to make a program to transfer a file with bandwidth throttling (after zipping it) to another computer on the same network.
I need to get its bandwidth throttled in order to avoid saturation (Kind of the way Robocopy does).
Recently, I found the ThrottledStream class, but It doesn't seem to be working, since I can send a 9MB with a limitation of 1 byte throttling and it still arrives almost instantly, so I need to know if there's some misapplication of the class.
Here's the code:
using (FileStream originStream = inFile.OpenRead())
using (MemoryStream compressedFile = new MemoryStream())
using (GZipStream zippingStream = new GZipStream(compressedFile, CompressionMode.Compress))
{
originStream.CopyTo(zippingStream);
using (FileStream finalDestination = File.Create(destination.FullName + "\\" + inFile.Name + ".gz"))
{
ThrottledStream destinationStream = new ThrottledStream(finalDestination, bpsLimit);
byte[] buffer = new byte[bufferSize];
int readCount = compressedFile.Read(buffer,0,bufferSize);
while(readCount > 0)
{
destinationStream.Write(buffer, 0, bufferSize);
readCount = compressedFile.Read(buffer, 0, bufferSize);
}
}
}
Any help would be appreciated.
The ThrottledStream class you linked to uses a delay calculation to determine how long to wait before perform the current write. This delay is based on the amount of data sent before the current write, and how much time has elapsed. Once the delay period has passed it writes the entire buffer in a single chunk.
The problem with this is that it doesn't do any checks on the size of the buffer being written in a particular write operation. If you ask it to limit throughput to 1 byte per second, then call the Write method with a 20MB buffer, it will write the entire 20MB immediately. If you then try to write another block of data that is 2 bytes long, it will wait for a very long time (20*2^20 seconds) before writing those two bytes.
In order to get the ThrottledStream class to work more smoothly, you have to call Write with very small blocks of data. Each block will still be written immediately, but the delays between the write operations will be smaller and the throughput will be much more even.
In your code you use a variable named bufferSize to determine the number of bytes to process per read/write in the internal loop. Try setting bufferSize to 256, which will result in many more reads and writes, but will give the ThrottledStream a chance to actually introduce some delays.
If you set bufferSize to be the same as bpsLimit you should see a single write operation complete every second. The smaller you set bufferSize the more write operations you'll get per second, the smoother the bandwidth throttling will work.
Normally we like to process as much of a buffer as possible in each operation to decrease the overheads, but in this case you're explicitly trying to add overheads to slow things down :)
I'm building a console application that have to process a bunch of data.
Basically, the application grabs references from a DB. For each reference, parse the content of the file and make some changes. The files are HTML files, and the process is doing a heavy work with RegEx replacements (find references and transform them into links). The results in then stored on the file system and sent to an external system.
If I resume the process, in a sequential way :
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
var filePath = GetFilePath(ref); // This method looks up in a previously loaded file list
var html = File.ReadAllText(filePath); // Read html locally, or from a network drive
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath); // Copy the result locally, or a network drive
SendToWs(ref, convertedHtml);
}
My program is working correctly but is quite slow. That's why I want to parallelise the process.
By now, I made a simple Parallelization adding AsParallel :
var refs = GetReferencesFromDB().AsParallel();
refs.ForAll(ref=>
{
var filePath = GetFilePath(ref);
var html = File.ReadAllText(filePath);
var convertedHtml = ParseHtml(html);
File.WriteAllText(destinationFilePath);
SendToWs(ref, convertedHtml);
});
This simple change decrease the duration of the process (25% less time). However, what I understand with parallelization is that there won't be much benefits (or worse, less benefits) if parallelyzing over resources relying on I/O, because the i/o won't magically doubles.
That's why I think I should change my approach not to parallelize the whole process, but to create dependent chained queued tasks.
I.E., I should create a flow like :
Queue read file. When finished, Queue ParseHtml. When finished, Queue both send to WS and write locally. When finished, log the result.
However, I don't know how to realize such think.
I feel it will ends in a set of consumer/producer queues, but I didn't find a correct sample.
And moreover, I'm not sure if there will be benefits.
thanks for advices
[Edit] In fact, I'm the perfect candidate for using c# 4.5... if only it was rtm :)
[Edit 2] Another thing making me thinking it's not correctly parallelized, is that in the resource monitor, I see graphs of CPU, network I/O and disk I/O not stable. when one is high, others are low to medium
You're not leveraging any async I/O APIs in any of your code. Everything you're doing is CPU bound and all your I/O operations are going to waste CPU resources blocking. AsParallel is for compute bound tasks, if you want to take advantage of async I/O you need to leverage the Asynchronous Programming Model (APM) based APIs today in <= v4.0. This is done by looking for BeginXXX/EndXXX methods on the I/O based classes you're using and leveraging those whenever available.
Read this post for starters: TPL TaskFactory.FromAsync vs Tasks with blocking methods
Next, you don't want to use AsParallel in this case anyway. AsParallel enables streaming which will result in an immediately scheduling a new Task per item, but you don't need/want that here. You'd be much better served by partitioning the work using Parallel::ForEach.
Let's see how you can use this knowledge to achieve max concurrency in your specific case:
var refs = GetReferencesFromDB();
// Using Parallel::ForEach here will partition and process your data on separate worker threads
Parallel.ForEach(
refs,
ref =>
{
string filePath = GetFilePath(ref);
byte[] fileDataBuffer = new byte[1048576];
// Need to use FileStream API directly so we can enable async I/O
FileStream sourceFileStream = new FileStream(
filePath,
FileMode.Open,
FileAccess.Read,
FileShare.Read,
8192,
true);
// Use FromAsync to read the data from the file
Task<int> readSourceFileStreamTask = Task.Factory.FromAsync(
sourceFileStream.BeginRead
sourceFileStream.EndRead
fileDataBuffer,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async read is completed
readSourceFileStreamTask.ContinueWith(readSourceFileStreamAntecedent =>
{
int soureFileStreamBytesRead;
try
{
// Determine exactly how many bytes were read
// NOTE: this will propagate any potential exception that may have occurred in EndRead
sourceFileStreamBytesRead = readSourceFileStreamAntecedent.Result;
}
finally
{
// Always clean up the source stream
sourceFileStream.Close();
sourceFileStream = null;
}
// This is here to make sure you don't end up trying to read files larger than this sample code can handle
if(sourceFileStreamBytesRead == fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement reading files larger than 1MB. :P");
}
// Convert the file data to a string
string html = Encoding.UTF8.GetString(fileDataBuffer, 0, sourceFileStreamBytesRead);
// Parse the HTML
string convertedHtml = ParseHtml(html);
// This is here to make sure you don't end up trying to write files larger than this sample code can handle
if(Encoding.UTF8.GetByteCount > fileDataBuffer.Length)
{
throw new NotSupportedException("You need to implement writing files larger than 1MB. :P");
}
// Convert the file data back to bytes for writing
Encoding.UTF8.GetBytes(convertedHtml, 0, convertedHtml.Length, fileDataBuffer, 0);
// Need to use FileStream API directly so we can enable async I/O
FileStream destinationFileStream = new FileStream(
destinationFilePath,
FileMode.OpenOrCreate,
FileAccess.Write,
FileShare.None,
8192,
true);
// Use FromAsync to read the data from the file
Task destinationFileStreamWriteTask = Task.Factory.FromAsync(
destinationFileStream.BeginWrite,
destinationFileStream.EndWrite,
fileDataBuffer,
0,
fileDataBuffer.Length,
null);
// Add a continuation that will fire when the async write is completed
destinationFileStreamWriteTask.ContinueWith(destinationFileStreamWriteAntecedent =>
{
try
{
// NOTE: we call wait here to observe any potential exceptions that might have occurred in EndWrite
destinationFileStreamWriteAntecedent.Wait();
}
finally
{
// Always close the destination file stream
destinationFileStream.Close();
destinationFileStream = null;
}
},
TaskContinuationOptions.AttachedToParent);
// Send to external system **concurrent** to writing to destination file system above
SendToWs(ref, convertedHtml);
},
TaskContinuationOptions.AttachedToParent);
});
Now, here's few notes:
This is sample code so I'm using a 1MB buffer to read/write files. This is excessive for HTML files and wasteful of system resources. You can either lower it to suit your max needs or implement chained reads/writes into a StringBuilder which is an excercise I leave up to you since I'd be writing ~500 more lines of code to do async chained reads/writes. :P
You'll note that on the continuations for the read/write tasks I have TaskContinuationOptions.AttachedToParent. This is very important as it will prevent the worker thread that the Parallel::ForEach starts the work with from completing until all the underlying async calls have completed. If this was not here you would kick off work for all 5000 items concurrently which would pollute the TPL subsystem with thousands of scheduled Tasks and not scale properly at all.
I call SendToWs concurrent to writing the file to the file share here. I don't know what is underlying the implementation of SendToWs, but it too sounds like a good candidate for making async. Right now it's assumed it's pure compute work and, as such, is going to burn a CPU thread while executing. I leave it as an excercise to you to figure out how best to leverage what I've shown you to improve throughput there.
This is all typed free form and my brain was the only compiler here and SO's syntax higlighting is all I used to make sure syntax was good. So, please forgive any syntax errors and let me know if I screwed up anything too badly that you can't make heads or tails of it and I'll follow up.
The good news is your logic could be easily separated into steps that go into a producer-consumer pipeline.
Step 1: Read file
Step 2: Parse file
Step 3: Write file
Step 4: SendToWs
If you are using .NET 4.0 you can use the BlockingCollection data structure as the backbone for the each step's producer-consumer queue. The main thread will enqueue each work item into step 1's queue where it will be picked up and processed and then forwarded on to step 2's queue and so on and so forth.
If you are willing to move on to the Async CTP then you can take advantage of the new TPL Dataflow structures for this as well. There is the BufferBlock<T> data structure, among others, that behaves in a similar manner to BlockingCollection and integrates well with the new async and await keywords.
Because your algorithm is IO bound the producer-consumer strategies may not get you the performance boost you are looking for, but at least you will have a very elegant solution that would scale well if you could increase the IO throughput. I am afraid steps 1 and 3 will be the bottlenecks and the pipeline will not balance well, but it is worth experimenting with.
Just a suggestion, but have you looked into the Consumer / Producer pattern ? A certain number of threads would read your files on disk and feed the content to a queue. Then another set of threads, known as the consumers, would "consume" the queue as its filled. http://zone.ni.com/devzone/cda/tut/p/id/3023
Your best bet in these kind of scenario is definitely the producer-consumer model. One thread to pull the data and a bunch of workers to process it. There's no easy way around the I/O so you might as well just focus on optimizing the computation itself.
I will now try to sketch a model:
// producer thread
var refs = GetReferencesFromDB(); // ~5000 Datarow returned
foreach(var ref in refs)
{
lock(queue)
{
queue.Enqueue(ref);
event.Set();
}
// if the queue is limited, test if the queue is full and wait.
}
// consumer threads
while(true)
{
value = null;
lock(queue)
{
if(queue.Count > 0)
{
value = queue.Dequeue();
}
}
if(value != null)
// process value
else
event.WaitOne(); // event to signal that an item was placed in the queue.
}
You can find more details about producer/consumer in part 4 of Threading in C#: http://www.albahari.com/threading/part4.aspx
I think your approach to split up the list of files and process each file in one batch is ok.
My feeling is that you might get more performance gain if you play with degree of parallelism.
See: var refs = GetReferencesFromDB().AsParallel().WithDegreeOfParallelism(16); this would start processing 16 files at the same time. Currently you are processing probably 2 or 4 files depending on number of cores you have. This is only efficient when you have only computation without IO. For IO intensive tasks adjustment might bring incredible performance improvements reducing processor idle time.
If you are going to split up and join tasks back using producer-consumer look at this sample: Using Parallel Linq Extensions to union two sequences, how can one yield the fastest results first?
I need to write a large amount of data in my application. Data arrive periodically and pushed into the queue.
//producer
queue_.push( new_data );// new_data is a 64kb memory range
PostThreadMessage(worker_thread_id_, MyMsg, 0, 0);//notify the worker thread
after that, i need to write this data to the file. I have another thread, that get data from the queue and writes it.
//consumer
while(GetMessage(msg, 0, 0))
{
data = queue_.pop();
WriteFile(file_handle_, data.begin(), data.size(), &io_bytes, NULL);
}
This is simple, but not efficient.
Another oportunity is to use IO completion ports.
//producer
//file must be associated with io completion port
WriteFile( file_handle_
, new_data.begin()
, new_data.size()
, &io_bytes
, new my_overlaped(new_data) );
consumer is just delete unused buffers
my_completion_key* ck = 0;
my_overlapped* op = 0;
DWORD res = GetQueuedIOCompletionStatus(completion_port_, &io_bytes, &ck, &op);
//process errors
delete op->data;
What is the preferred way, to perform large amount of file i/o operation on windows?
Hard to tell because it is not clear what kind of problem you are experiencing with your current approach. Going async in a dedicated "writer" thread shouldn't result in considerable performance improvement.
If you have small but frequent writes (so that frequent WriteFile's cost becomes a bottleneck) I would suggest implementing a kind of lazy writes - collect data and flush it to file when some volume threshold is reached.