Multithread Write byte[] into file - c#

I hoping someone can help me, if have a question about writing into a file using multiple threads/Tasks. See my code sample below...
AddFile return a array of longs holding the values, blobNumber, the offset inside the blob and the size of the data writing into the blob
public long[] AddFile(byte[] data){
long[] values = new long[3];
values[0] = WorkingIndex = getBlobIndex(data); //blobNumber
values[1] = blobFS[WorkingIndex].Position; //Offset
values[2] = length = data.length; //size
//BlobFS is a filestream
blobFS[WorkingIndex].Write(data, 0, data.Length);
return values;
}
So lets say I use the AddFile function inside a foreach loop like the one below.
List<Task> tasks = new List<Task>(System.Environment.ProcessorCount);
foreach(var file in Directory.GetFiles(#"C:\Documents"){
var task = Task.Factory.StartNew(() => {
byte[] data = File.ReadAllBytes(file);
long[] info = blob.AddFile(data);
return info
});
task.ContinueWith(// do some stuff);
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray);
return result;
I can imagine that this will totally fail, in the way that files will override each other inside the blob due to the fact that the Write function hasn't finished writing file1 and an other task is writing file2 at the same time.
So what is the best way to solve this problem? Maybe using asynchronous write functions...
Your help would be appreciated!
Kind regards,
Martijn

My suggestion here would be to not run these tasks in parallel. It's likely that disk IO will be the bottleneck for any file-based operation, and so running them in parallel will just cause each thread to be blocked accessing the disk. Ultimately, you'll quite possibly find that your code runs significantly slower as you've written it than it would run in serial.
Is there a particular reason that you want these in parallel? Can you handle the disk writes serially and just call ContinueWith() on separate threads instead? This would have the benefit of removing the problem you're posting about, too.
EDIT: an example naive reimplementation of your for loop:
foreach(var file in Directory.GetFiles(#"C:\Documents"){
byte[] data = File.ReadAllBytes(file); // this happens on the main thread
// processing of each file is handled in multiple threads in parallel to disk IO
var task = Task.Factory.StartNew(() => {
long[] info = blob.AddFile(data);
return info
});
task.ContinueWith(// do some stuff);
tasks.Add(task);
}

Related

Spawn new thread inside each foreach(), but do not return until all complete

I have a foreach() that loops through 15 reports and generates a PDF for each. The PDF generation process is slow (3 seconds each). But if I could generate them all concurrently with threads, maybe all 15 could be done in 4-5 seconds total. One constraint is that the function must not return until ALL pdfs have generated. Also, will 15 concurrent worker threads cause problems or instability for dotnet/windows?
Here is my pseudocode:
private void makePDFs(string path) {
string[] folders = Directory.GetDirectories(path);
foreach(string folderPath in folders) {
generatePDF(...);
}
// DO NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
}
What is the simplest way to achieve this?
The most straightforward approach is to use Parallel.ForEach:
private void makePDFs(string path)
{
string[] folders = Directory.GetDirectories(path);
Parallel.ForEach(folders, (folderPath) =>
{
generatePDF(folderPath);
};
//WILL NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
This way you avoid having to create, keep track of, and await each separate task; the TPL does it all for you.
You need to get a list of tasks and then use Task.WhenAll to wait for completion
var tasks = folders.Select(folder => Task.Run(() => generatePDF(...)));
await Task.WhenAll(tasks);
If you can't or don't want to use async/await you can use:
Task.WaitAll(tasks);
It will block current thread until all tasks are completed. So I'd recommend to use the 1st approach if you can.
You can also run your PDF generation in parallel using Parallel C# class:
Parallel.ForEach(folders, folder => generatePDF(...));
Please see this answer to choose which approach works the best for your problem.
.NET has a handy method just for this: Task.WhenAll(IEnumerable<Task>)
It will wait for all tasks in the IEnumerable to finish before continuing. It is an async method, so you need to await it.
var tasks = new List<Task>();
foreach(string folderPath in folders) {
tasks.Add(Task.Run(() => generatePdf()));
}
await Task.WhenAll(tasks);

What is the fastest possible way to read a serial port in .net?

I need a serial port program to read data coming in at 4800 baud. Right now I have a simulator sending 15 lines of data every second. The output of it seems to get "behind" and can't keep up with the speed/amount of data coming in.
I have tried using ReadLine() with a DataReceieved event, which did not seem to be reliable, and now I am using an async method with serialPort.BaseStream.ReadAsync:
okToReadPort = true;
Task readTask = new Task(startAsyncRead);
readTask.Start();
//this method starts the async read process and the "nmeaList" is what
// is used by the other thread to display data
public async void startAsyncRead()
{
while (okToReadPort)
{
Task<string> task = ReadLineAsync(serialPort);
string line = await task;
NMEAMsg tempMsg = new NMEAMsg(line);
if (tempMsg.sentenceType != null)
{
nmeaList[tempMsg.sentenceType] = tempMsg;
}
}
public static async Task<string> ReadLineAsync(
this SerialPort serialPort)
{
// Console.WriteLine("Entering ReadLineAsync()...");
byte[] buffer = new byte[1];
string ret = string.Empty;
while (true)
{
await serialPort.BaseStream.ReadAsync(buffer, 0, 1);
ret += serialPort.Encoding.GetString(buffer);
if (ret.EndsWith(serialPort.NewLine))
return ret.Substring(0, ret.Length - serialPort.NewLine.Length);
}
}
This still seems inefficient, does anyone know of a better way to ensure that every piece of data is read from the port and accounted for?
Generally speaking, your issue is that you are performing IO synchronously with data processing. It doesn't help that your data processing is relatively expensive (string concatenation).
To fix the general problem, when you read a byte put it into a processing buffer (BlockingCollection works great here as it solves Producer/Consumer) and have another thread read from the buffer. That way the serial port can immediately begin reading again instead of waiting for your processing to finish.
As a side note, you would likely see a benefit by using StringBuilder in your code instead of string concatenation. You should still process via queue though.

Reading a lot of files "at the same time"

I'm using FileSystemWatcher in order to catch every created, changed, deleted and renamed change over whichever file in a folder.
Over this changes I need to perform a simple checksum of the contents of these files. Simply, I'm opening a filestream and pass it to MD5 class:
private byte[] calculateChecksum(string frl)
{
using (FileStream stream = File.Open(frl, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
return this.md5.ComputeHash(stream);
}
}
The problem is according the amount of files I need to handle. For example, imagine I have 200 files created along the time in a folder, and then I copy all of them and paste them on the same folder. This action is going to cause 200 event and 200 calculateChecksum() performings.
How could I solve this kind of problems?
In FileSystemWatcher handler put tasks to queue that will processed by some worker. Worker can process checksum calc tasks with targeted speed or/and frequency. Probably one worker will be better because many readers can slow down hdd with many read seeks.
Try read about BlockingCollection:
https://msdn.microsoft.com/ru-ru/library/dd997371(v=vs.110).aspx
and Producer-Consumer Dataflow Pattern
https://msdn.microsoft.com/ru-ru/library/hh228601(v=vs.110).aspx
var workerCount = 2;
BlockingCollection<String>[] filesQueues= new BlockingCollection<String>[workerCount];
for(int i = 0; i < workerCount; i++)
{
filesQueues[i] = new BlockingCollection<String>(500);
// Worker
Task.Run(() =>
{
while (!filesQueues[i].IsCompleted)
{
string url;
try
{
url= filesQueues[i].Take();
}
catch (InvalidOperationException) { }
if (!string.IsNullOrWhiteSpace(url))
{
calculateChecksum(url);
}
}
}
}
// inside of FileSystemWatcher handler
var queueIndex = hash(filename) % workersCount
// Warning!!
// Blocks if numbers.Count == dataItems.BoundedCapacity
filesQueues[queueIndex].Add(fileName);
filesQueues[queueIndex].CompleteAdding();
Also you can make multiple consumers, just call Take or TryTake concurrently - each item will only be consumed by a single consumer. But take into account in that case one file can be processed by many workers, and multiple hdd readers can slow down hdd.
UPD in case of multiple workers, it would be better to make multiple BlockingCollections, and push files in queue with index:
I've scketched a cosumer-producer pattern to solve that, and I've tried to use a thread pool in order to smooth the big amount of work, sharing a BlockingCollection
BlockingCollection & ThreadPool:
private BlockingCollection<Index.ResourceIndexDocument> documents;
this.pool = new SmartThreadPool(SmartThreadPool.DefaultIdleTimeout, 4);
this.documents = new BlockingCollection<string>();
As you cann see, I've created a I treadPool setting concurrency to 4. So, there is going to work only 4 thread at the same time regasdless of whether there is x > 4 work's units to handle in the pool.
Producer:
public void warn(string channel, string frl)
{
this.pool.QueueWorkItem<string, string>(
(file) => this.files.Add(file),
channel,
frl
);
}
Consumer:
Task.Factory.StartNew(() =>
{
Index.ResourceIndexDocument document = null;
while (this.documents.TryTake(out document, TimeSpan.FromSeconds(1)))
{
IEnumerable<Index.ResourceIndexDocument> documents = this.documents.Take(this.documents.Count);
Index.IndexEngine.Instance.index(documents);
}
},
TaskCreationOptions.LongRunning
);

Advice on processing giant text file and processing URL's

I'm currently trying to loop through a text file that is about 1.5gb's in size and then use the URL's that are grabbed from it to pull down the html from the site.
For speed I'm trying to process all the HTTP request on a new thread but since C# is not my strongest language but a requirement for what I'm doing I'm a bit confused on good thread practice.
This is how I'm processing the list
private static void Main()
{
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead("dump.txt"))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
var progress = 0;
while ((line = streamReader.ReadLine()) != null)
{
var stuff = line.Split('|');
getHTML(stuff[3]);
progress += 1;
Console.WriteLine(progress);
}
}
}
And I'm pulling down the HTML as so
private static void getHTML(String url)
{
new Thread(() =>
{
var client = new DecompressGzipResponse();
var html = client.DownloadString(url);
}).Start();
}
Though the speeds are fast doing this initially, after about 20 thousand they slow down and eventually after 32 thousand the application will hang and crash. I was under the impression C# threads terminated when the function completed?
Can anyone give any examples/ suggestions on how to do this better?
One very reliable way to do this is by using the producer-consumer pattern. You create a thread-safe queue of URLs (for example, BlockingCollection<Uri>). Your main thread is the producer, which adds items to the queue. You then have multiple consumer threads, each of which reads Urls from the queue and does the HTTP requests. See BlockingCollection.
Setting it up isn't terribly difficult:
BlockingCollection<Uri> UrlQueue = new BlockingCollection<Uri>();
// Main thread starts the consumer threads
Task t1 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
Task t2 = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// create more tasks if you think necessary.
// Now read your file
foreach (var line in File.ReadLines(inputFileName))
{
var theUri = ExtractUriFromLine(line);
UrlQueue.Add(theUri);
}
// when done adding lines to the queue, mark the queue as complete
UrlQueue.CompleteAdding();
// now wait for the tasks to complete.
t1.Wait();
t2.Wait();
// You could also use Task.WaitAll if you have an array of tasks
The individual threads process the urls with this method:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// code here to do a web request on that url
}
}
That's a simple and reliable way to do things, but it's not especially quick. You can do much better by using a second queue of WebCient objects that make asynchronous requests For example, say you want to have 15 asynchronous requests. You start the same way with a BlockingCollection, but you only have one persistent consumer thread.
const int MaxRequests = 15;
BlockingCollection<WebClient> Clients = new BlockingCollection<WebClient>();
// start a single consumer thread
var ProcessingThread = Task.Factory.StartNew(() => ProcessUrls, TaskCreationOptions.LongRunning);
// Create the WebClient objects and add them to the queue
for (var i = 0; i < MaxRequests; ++i)
{
var client = new WebClient();
// Add an event handler for the DownloadDataCompleted event
client.DownloadDataCompleted += DownloadDataCompletedHandler;
// And add this client to the queue
Clients.Add(client);
}
// add the code from above that reads the file and populates the queue
Your processing function is somewhat different:
void ProcessUrls()
{
foreach (var uri in UrlQueue.GetConsumingEnumerable())
{
// Wait for an available client
var client = Clients.Take();
// and make an asynchronous request
client.DownloadDataAsync(uri, client);
}
// When the queue is empty, you need to wait for all of the
// clients to complete their requests.
// You know they're all done when you dequeue all of them.
for (int i = 0; i < MaxRequests; ++i)
{
var client = Clients.Take();
client.Dispose();
}
}
Your DownloadDataCompleted event handler does something with the data that was downloaded, and then adds the WebClient instance back to the queue of clients.
void DownloadDataCompleteHandler(Object sender, DownloadDataCompletedEventArgs e)
{
// The data downloaded is in e.Result
// be sure to check the e.Error and e.Cancelled values to determine if an error occurred
// do something with the data
// And then add the client back to the queue
WebClient client = (WebClient)e.UserState;
Clients.Add(client);
}
This should keep you going with 15 concurrent requests, which is about all you can do without getting a bit more complicated. Your system can likely handle many more concurrent requests, but the way that WebClient starts asynchronous requests requires some synchronous work up front, and that overhead makes 15 about the maximum number you can handle.
You might be able to have multiple threads initiating the asynchronous requests. In that case, you could potentially have as many threads as you have processor cores. So on a quad core machine, you could have the main thread and three consumer threads. With three consumer threads this technique could give you 45 concurrent requests. I'm not certain that it scales that well, but it might be worth a try.
There are ways to have hundreds of concurrent requests, but they're quite a bit more complicated to implement.
You need thread management.
My advice is to use Tasks instead of creating your own Threads.
By using the Task Parallel Library, you let the runtime deal with the thread management. By default, it will allocate your tasks on threads from the ThreadPool, and will allow a level of concurrency which is contingent on the number of CPU cores you have. It will also reuse existing Threads when they become available instead of wasting time creating new ones.
If you want to get more advanced, you can create your own task scheduler to manage the scheduling aspect yourself.
See also What is difference between Task and Thread?

Parallel.ForEach behaving like a regular for each towards the end of the iteration

I am having this issue when I ran something like this:
Parallel.ForEach(dataTable.AsEnumerable(), row =>
{
//do processing
}
Assuming that there are 500+ records say 870. Once the Parallel.ForEach completes 850, it seems to be running sequentially i.e. only 1 operation at a time. It completed 850 operations very fast but when it comes close to the end of the iteration it becomes very slow and seems to be performing like a regular for each. I even tried for 2000 records.
Is something wrong in my code? Please give suggestions.
Below is the code I am using
Sorry I just posted the wrong example. This is the correct code:
Task newTask = Task.Factory.StartNew(() =>
{
Parallel.ForEach(dtResult.AsEnumerable(), dr =>
{
string extractQuery = "";
string downLoadFileFullName = "";
lock (foreachObject)
{
string fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery)) throw new Exception("Extract Query not found. Please check the configuration");
string newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
//create folder if it doesn't exist
if (!Directory.Exists(newDownLoadPath)) Directory.CreateDirectory(newDownLoadPath);
downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
}
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
});
My guess:
This looks to have a high degree of potential IO from:
Database+Disk
Network communication to DB and back
Writing results to disk
Therefore a lot of time is going to be spent waiting for IO. My guess is that the waiting is only getting worse as more threads are being added to the mix and IO is being further stressed. For instance a disk only has one set of heads, so you cannot write to it concurrently. If you have a large number of threads trying to write concurrently, performance degrades.
Try limiting the maximum number of threads you are using:
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
//Do stuff
});
Update
After your code edit, I would suggest the following which has a couple of changes:
Reduce maximum number of threads - this can be experimented with.
Only perform directory check and creation once.
Code:
private static bool isDirectoryCreated;
//...
var options = new ParallelOptions { MaxDegreeOfParallelism = 2 };
Parallel.ForEach(dtResult.AsEnumerable(), options, dr =>
{
string fileName, extractQuery, newDownLoadPath;
lock (foreachObject)
{
fileName = extractorConfig.EncodeFileName(dr);
extractQuery = extractorConfig.GetExtractQuery(dr);
if (string.IsNullOrEmpty(extractQuery))
throw new Exception("Extract Query not found. Please check the configuration");
newDownLoadPath = CommonUtil.GetFormalizedDataPath(sDownLoadPath, uKey.CobDate);
if (!isDirectoryCreated)
{
if (!Directory.Exists(newDownLoadPath))
Directory.CreateDirectory(newDownLoadPath);
isDirectoryCreated = true;
}
}
string downLoadFileFullName = Path.Combine(newDownLoadPath, fileName);
Interlocked.Increment(ref index);
ExtractorClass util = new ExtractorClass(SourceDbConnStr);
util.LoadToFile(extractQuery, downLoadFileFullName);
Interlocked.Increment(ref uiTimerIndex);
});
It’s hard to give details without the relevant code but in general this is the expected behaviour. .NET tries to schedule the tasks such that every processor is evenly busy.
But this can only ever be approximated sind not all of the tasks take the same amount of time. At the end some processors will be done working and some won’t, and re-distributing the work is costly and not always beneficial.
I don’t know details about the load balancing used by PLinq but the bottom line is that this behaviour can never be fully prevented.

Categories