Parallel programming in C#

Parallel programming in C# - c#

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.
In other words there are 4 operations:
1. Scan directory for jpegs
2. Parse files for xif and generate copy-location
3. Check for file existence and if needed generate new path
4. Copy files
And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.
Any input would be appreciated.

There are few a options:
Parallel LINQ: Running Queries On Multi-Core Processors
Task Parallel Library (TPL): Optimize Managed Code For Multi-Core Machines
If you are interested in basic threading primitives and concepts: Threading in C#
[But as #John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]

This is the reference I use for C# thread: http://www.albahari.com/threading/
As a single PDF: http://www.albahari.com/threading/threading.pdf
For your second approach:
I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.
In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...
It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.

Interesting problem.
I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.
The first one iterates through the files in parallel.
The second one generates asynchronously the files from the directory.
Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)
string direcory = "Mydirectory";
var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");
// -- PLinq --------------------------------------------
jpegFiles
.AsParallel()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Do(fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
})
.Run();
// -----------------------------------------------------
//-- Rx Framework ---------------------------------------------
var resetEvent = new AutoResetEvent(false);
var doTheWork =
jpegFiles.ToObservable()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Subscribe( fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
},() => resetEvent.Set());
resetEvent.WaitOne();
doTheWork.Dispose();
// -----------------------------------------------------

Related

Is it possible to have any dataflow block type send multiple intermediate results as a result of a single input?

Is it possible to get TransformManyBlocks to send intermediate results as they are created to the next step instead if waiting for the entire IEnumerable<T> to be filled?
All testing I've done shows that TransformManyBlock only sends a result to the next block when it is finished; the next block then reads those items one at a time.
It seems like basic functionality but I can't find any examples of this anywhere.
The use case is processing chunks of a file as they are read. In my case there's a modulus of so many lines needed before I can process anything so a direct stream won't work.
They kludge I've come up with is to create two pipelines:
a "processing" dataflow network the processes the chunks of data as the become available
"producer" dataflow network that ends where the file is broken into
chunks then posted to the start of the "processing" network that actually transforms the data.
The "producer" network needs to be seeded with the starting point of the "processing" network.
Not a good long term solution since additional processing options will be needed and it's not flexible.
Is it possible to have any dataflow block type to send multiple intermediate results as created to a single input? Any pointers to working code?

You probably need to create your IEnumerables by using an iterator. This way an item will be propagated downstream after every yield command. The only problem is that yielding from lambda functions is not supported in C#, so you'll have to use a local function instead. Example:
var block = new TransformManyBlock<string, string>(filePath => ReadLines(filePath));
IEnumerable<string> ReadLines(string filePath)
{
string[] lines = File.ReadAllLines(filePath);
foreach (var line in lines)
{
yield return line; // Immediately offered to any linked block
}
}

Parallelizing producer and consumer with internal state

I'd like to know if the following approach is a good way to implement a producer and consumer pattern in C# .NET 4.6.1
Description of what I want to do:
I want to read files, perform calculation on the data within and save the result. Each file has an origin (a device e.g. data logger) and depending on that origin different calculations as well as output formats should be used. The file contains different values, e.g. temperature readings of several sensors. It is important that the calculations have a state. For instance this could be the last value of the previous calculation, e.g. if I want to sum all values of one origin.
I want to parallelize the processing per origin. All files from one origin need to be processed sequentially (or more specific chronologically) and cannot be parallelized.
I think the TPL Dataflow might be an appropriate solution for this.
This is the process I came up with:
The reading would be done by an TransformBlock. Next I would create instances of the classes performing operations on the data for each origin. They get initialized with the neccessary parameters, so that they know how to process files for their origin.
Then I would create TransformBlocks for each created object (so basically for each origin). Each TransformBlocks would execute a function of the corresponding object. The TransformBlock reading the files would be linked to a BufferBlock, which is linked to each TransformBlock for the processing per origin. The linking would be conditional, so that only data that is meant to reach the processing TranformBlock of an origin would be received. The output of the processing Blocks would be linked with an ActionBlock for writing the output files.
The maxDegreeOfParallelism is set to 1 for every Block.
Is that a viable solution? I thought about implementing this with Tasks and the BlockingCollection, but it seems this would be the easier approach.
Additional Information:
The amount of files processed may be to large in size or number to be loaded at once.
Reading and writing should happen concurrent to the processing. As I/O takes time and because data needs to be collected after processing to form an output file, buffering is essential.

Since the origins are independent and the items for each origin are fully dependent this problem has an easy solution:
var origins = (from f in files
group f by f.origin into g
orderby g.Count() descending
select g);
var results =
Partitioner.Create(origins) //disable chunking
.AsParallel()
.AsOrdered() //try process the biggest groups first
.Select(originGroup => {
foreach (var x in originGroup.OrderBy(...)) Process(x);
return someResult;
})
.ToList();
Process each origin sequentially and origins in parallel.
If you have a need to limit IO is some way you can throw in a SemaphoreSlim to guard the IO paths.

What is the best way to measure inserts per second in MongoDB?

I have a multithreaded (using a threadpool) C# program that reads from a text file containing logs and batch inserts them into a MongoDB collection. I want a consistent and precise way to measure how long it takes to insert the whole file into the collection.
I can't really call a thread.join (because it's a threadpool), and I can't use a stopwatch because they're running on separate threads.
What's the next best thing?
The current way I'm doing it is the timer on my smartphone. I repeatedly call db.collection.stats() and wait till the count is the same as the number of logs in the file...

If you're using C# 4.0+, I'd recommend you use the CountdownEvent class. Using that class, you can just create an instance with the number of logs for example as the counter:
var countdown = new CountdownEvent(numberOfLogs);
Then, each time you complete a write to MongoDB, you signal from the worker thread:
countdown.Signal(); // decrement counter
And then, in your main process (or another thread):
countdown.Wait(); // returns when the count is zero
// All writes complete

With mongostat (command line tool) you see exactly what goes on in the MongoDB server. It will give you inserts/queries etc per second. It won't automatically stop when it's done inserting, but it will definitely give you an insight to performance. The "inserts" will drop to 0 once you're done importing.

C# read text file lines multi thread

I want to write a fast multi thread program using c# that read a file.
so the file must be split into some parts and each part process in different thread. for ex:
Line1
Line2
Line3
Line4
must split to 4 lines like this:
Line1 => thread 1
Line2 => thread 2
Line3 => thread 3
Line4 = > thread 4
i used the StreamReader.readLine() but it cant read specify line.
Comment: its necessary to speedup the program so i want to read file in separate threads.

Unless you're using fixed-length lines, this isn't possible.
Why? Because in order to determine where the "lines" split, you need to find the newline characters... which means you need to read the file first.
Now, if you simply want to perform some extra "processing" after you read in each line - that is possible and relatively straight-forward using a ThreadPool.

You should read the file in a single thread - but then spawn the processing of each line to a different thread, e.g. by adding it to a producer/consumer queue.
Even if you could seek to a specific line in a text file (which in general you can't) you really don't want the disk thrashing around - that'll only slow things down. The fastest way to get the data off the disk is to read it sequentially. By all means defer everything about handling the line beyond "decoding the binary data to text" to other threads, but you really don't want the IO to be in multiple threads.

AFAIK .NET doesn't support parallel stream reading. If you want to process every line you may use File.ReadAllLines. It returns an array of strings. Then use you can use PLINQ.
var result = File.ReadAllLine("path")
.AsParallel()
.Select(s => DoSthWithString(s))
.ToList();

You're not going to be able to speed up the actual reading because you're going to have tremendous locking issues keeping everything straight.
Since a text file is an unstructured file, ie. each line can be of different length, you have no choice but to read each line after the other, one by one.
Now, what you can do is process those lines on different threads, but the actual reading, keep that to one thread.
But, before you do that, are you sure you even have to do this? Is this a bottleneck? If not, fix the bottleneck first and see how far you get.

Your StreamReader is connected to a stream class. Using the stream class you can .Seek to a particular byte location.
Like others have said, this probably isn't a good idea, but it can be done.

I would split the file before hand. Say the file is 1000 lines. Split it into 10 files of 100 lines. Have a thread process each file.

c# multithreading file reading and page parsing

I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?

Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.
Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.

Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.
using System.Threading.Tasks;
Parallel.ForEach(enumerableList, p =>{
parseEngine.Parse(p);
});
Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.

A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.
You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.
I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.
You really only want one reader to the file.

You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.