Parallelizing producer and consumer with internal state

Parallelizing producer and consumer with internal state - c#

I'd like to know if the following approach is a good way to implement a producer and consumer pattern in C# .NET 4.6.1
Description of what I want to do:
I want to read files, perform calculation on the data within and save the result. Each file has an origin (a device e.g. data logger) and depending on that origin different calculations as well as output formats should be used. The file contains different values, e.g. temperature readings of several sensors. It is important that the calculations have a state. For instance this could be the last value of the previous calculation, e.g. if I want to sum all values of one origin.
I want to parallelize the processing per origin. All files from one origin need to be processed sequentially (or more specific chronologically) and cannot be parallelized.
I think the TPL Dataflow might be an appropriate solution for this.
This is the process I came up with:
The reading would be done by an TransformBlock. Next I would create instances of the classes performing operations on the data for each origin. They get initialized with the neccessary parameters, so that they know how to process files for their origin.
Then I would create TransformBlocks for each created object (so basically for each origin). Each TransformBlocks would execute a function of the corresponding object. The TransformBlock reading the files would be linked to a BufferBlock, which is linked to each TransformBlock for the processing per origin. The linking would be conditional, so that only data that is meant to reach the processing TranformBlock of an origin would be received. The output of the processing Blocks would be linked with an ActionBlock for writing the output files.
The maxDegreeOfParallelism is set to 1 for every Block.
Is that a viable solution? I thought about implementing this with Tasks and the BlockingCollection, but it seems this would be the easier approach.
Additional Information:
The amount of files processed may be to large in size or number to be loaded at once.
Reading and writing should happen concurrent to the processing. As I/O takes time and because data needs to be collected after processing to form an output file, buffering is essential.

Since the origins are independent and the items for each origin are fully dependent this problem has an easy solution:
var origins = (from f in files
group f by f.origin into g
orderby g.Count() descending
select g);
var results =
Partitioner.Create(origins) //disable chunking
.AsParallel()
.AsOrdered() //try process the biggest groups first
.Select(originGroup => {
foreach (var x in originGroup.OrderBy(...)) Process(x);
return someResult;
})
.ToList();
Process each origin sequentially and origins in parallel.
If you have a need to limit IO is some way you can throw in a SemaphoreSlim to guard the IO paths.

Related

Is it possible to have any dataflow block type send multiple intermediate results as a result of a single input?

Is it possible to get TransformManyBlocks to send intermediate results as they are created to the next step instead if waiting for the entire IEnumerable<T> to be filled?
All testing I've done shows that TransformManyBlock only sends a result to the next block when it is finished; the next block then reads those items one at a time.
It seems like basic functionality but I can't find any examples of this anywhere.
The use case is processing chunks of a file as they are read. In my case there's a modulus of so many lines needed before I can process anything so a direct stream won't work.
They kludge I've come up with is to create two pipelines:
a "processing" dataflow network the processes the chunks of data as the become available
"producer" dataflow network that ends where the file is broken into
chunks then posted to the start of the "processing" network that actually transforms the data.
The "producer" network needs to be seeded with the starting point of the "processing" network.
Not a good long term solution since additional processing options will be needed and it's not flexible.
Is it possible to have any dataflow block type to send multiple intermediate results as created to a single input? Any pointers to working code?

You probably need to create your IEnumerables by using an iterator. This way an item will be propagated downstream after every yield command. The only problem is that yielding from lambda functions is not supported in C#, so you'll have to use a local function instead. Example:
var block = new TransformManyBlock<string, string>(filePath => ReadLines(filePath));
IEnumerable<string> ReadLines(string filePath)
{
string[] lines = File.ReadAllLines(filePath);
foreach (var line in lines)
{
yield return line; // Immediately offered to any linked block
}
}

How to limit consuming sequence with Reactive?

We have an application, wherein we have a materialized array of items which we are going to process through a Reactive pipeline. It looks a little like this
EventLoopScheduler eventLoop = new EventLoopScheduler();
IScheduler concurrency = new TaskPoolScheduler(
new TaskFactory(
new LimitedConcurrencyLevelTaskScheduler(threadCount)));
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
// 1. transform on single thread
IConnectableObservable<byte[]> source =
numbers.Select(Transform).ToObservable(eventLoop).Publish();
// 2. naive parallelization, restricts parallelization to Work
// only; chunk up sequence into smaller sequences and process
// in parallel, merging results
IObservable<int> final = source.
Buffer(10).
Select(
batch =>
batch.
ToObservable(concurrency).
Buffer(10).
Select(
concurrentBatch =>
concurrentBatch.
Select(Work).
ToArray().
ToObservable(eventLoop)).
Merge()).
Merge();
final.Subscribe();
source.Connect();
Await(final).Wait();
If you are really curious to play with this, the stand-in methods look like
private async static Task Await(IObservable<int> final)
{
await final.LastOrDefaultAsync();
}
private static byte[] Transform(int number)
{
if (number == itemCount)
{
Console.WriteLine("numbers exhausted.");
}
byte[] buffer = new byte[1000000];
Buffer.BlockCopy(bloat, 0, buffer, 0, bloat.Length);
return buffer;
}
private static int Work(byte[] buffer)
{
Console.WriteLine("t {0}.", Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(50);
return 1;
}
A little explanation. Range(1, itemCount) simulates raw inputs, materialized from a data-source. Transform simulates an enrichment process each input must go through, and results in a larger memory footprint. Work is a "lengthy" process which operates on the transformed input.
Ideally, we want to minimize the number of transformed inputs held concurrently by the system, while maximizing throughput by parallelizing Work. The number of transformed inputs in memory should be batch size (10 above) times concurrent work threads (threadCount).
So for 5 threads, we should retain 50 Transform items at any given time; and if, as here, the transform is a 1MB byte buffer, then we would expect memory consumption to be at about 50MB throughout the run.
What I find is quite different. Namely that Reactive is eagerly consuming all numbers, and Transform them up front (as evidenced by numbers exhausted. message), resulting in a massive memory spike up front (#1GB for 1000 itemCount).
My basic question is: Is there a way to achieve what I need (ie minimized consumption, throttled by multi-threaded batching)?
UPDATE: sorry for reversal James; at first, i did not think paulpdaniels and Enigmativity's composition of Work(Transform) applied (this has to do with the nature of our actual implementation, which is more complex than the simple scenario provided above), however, after some further experimentation, i may be able to apply the same principles: ie defer Transform until batch executes.

You have made a couple of mistakes with your code that throws off all of your conclusions.
First up, you've done this:
IEnumerable<int> numbers = Enumerable.Range(1, itemCount);
You've used Enumerable.Range which means that when you call numbers.Select(Transform) you are going to burn through all of the numbers as fast as a single thread can take it. Rx hasn't even had a chance to do any work because up till this point your pipeline is entirely enumerable.
The next issue is in your subscriptions:
final.Subscribe();
source.Connect();
Await(final).Wait();
Because you call final.Subscribe() & Await(final).Wait(); you are creating two separate subscriptions to the final observable.
Since there is a source.Connect() in the middle the second subscription may be missing out on values.
So, let's try to remove all of the cruft that's going on here and see if we can work things out.
If you go down to this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.Select(bs => Work(bs));
Things work well. The numbers get exhausted right at the end, and processing 20 items on my machine takes about 1 second.
But this is processing everything in sequence. And the Work step provides back-pressure on Transform to slow down the speed at which it consumes the numbers.
Let's add concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs)));
This processes 20 items in 0.284 seconds, and the numbers exhaust themselves after 5 items are processed. There is no longer any back-pressure on the numbers. Basically the scheduler is handing all of the work to the Observable.Start so it is ready for the next number immediately.
Let's reduce the concurrency.
IObservable<int> final =
Observable
.Range(1, itemCount)
.Select(n => Transform(n))
.SelectMany(bs => Observable.Start(() => Work(bs), concurrency));
Now the 20 items get processed in 0.5 seconds. Only two get processed before the numbers are exhausted. This makes sense as we've limited concurrency to two threads. But still there's no back pressure on the consumption of the numbers so they get chewed up pretty quickly.
Having said all of this, I tried to construct a query with the appropriate back pressure, but I couldn't find a way. The crux comes down to the fact that Transform(...) performs far faster than Work(...) so it completes far more quickly.
So then the obvious move for me was this:
IObservable<int> final =
Observable
.Range(1, itemCount)
.SelectMany(n => Observable.Start(() => Work(Transform(n)), concurrency));
This doesn't complete the numbers until the end, and it limits processing to two threads. It appears to do the right thing for what you want, except that I've had to do Work(Transform(...)) together.

The very fact that you want to limit the amount of work you are doing suggests you should be pulling data, not having it pushed at you. I would forget using Rx in this scenario, as fundamentally, what you have described is not a reactive application. Also, Rx is best suited processing items serially; it uses sequential event streams.
Why not just keep your data source enumerable, and use PLinq, Parallel.ForEach or DataFlow? All of those sound better suited for your problem.

As #JamesWorld said it may very well be that you want to use PLinq to perform this task, it really depends on if you are actually reacting to data in your real scenario or just iterating through it.
If you choose to go the Reactive route you can use Merge to control the level of parallelization occurring:
var source = numbers
.Select(n =>
Observable.Defer(() => Observable.Start(() => Work(Transform(n)), concurrency)))
//Maximum concurrency
.Merge(10)
//Schedule all the output back onto the event loop scheduler
.ObserveOn(eventLoop);
The above code will consume all the numbers first (sorry no way to avoid that), however, by wrapping the processing in a Defer and following it up with a Merge that limits parallelization, only x number of items can be in flight at a time. Start() takes a scheduler as the second argument which it uses to execute to the provided method. Finally, Since you are basically just pushing the values of Transform into Work I composed them within the Start method.
As a side note, you can await an Observable and it will be equivalent to the code you have, i.e:
await source; //== await source.LastAsync();

ThreadPool with speed execution control

I need proccess several lines from a database (can be millions) in parallel in c#. The processing is quite quick (50 or 150ms/line) but I can not know this speed before runtime as it depends on hardware/network.
The ThreadPool or the newer TaskParallelLibrary seems to be what feets my needs as I am new to threading and want to get the most efficient way to process the data.
However these methods does not provide a way to control the speed execution of my tasks (lines/minute) : I want to be able to set a maximum speed limit for the processing or run it full speed.
Please note that setting the number of thread of the ThreadPool/TaskFactory does not provide sufficient accuracy for my needs as I would like to be able to set a speed limit below the 'one thread speed'.
Using a custom sheduler for the TPL seems to be a way to do that, but I did not find a way to implement it.
Furthermore, I'm worried about the efficiency cost that would take such a setup.
Could you provide me a way or advices how to achieve this work ?
Thanks in advance for your answers.

The TPL provides a convenient programming abstraction on top of the Thread Pool. I would always select TPL when that is an option.
If you wish to throttle the total processing speed, there's nothing built-in that would support that.
You can measure the total processing speed as you proceed through the file and regulate speed by introducing (non-spinning) delays in each thread. The size of the delay can be dynamically adjusted in your code based on observed processing speed.

I am not seeing the advantage of limiting a speed, but I suggest you look into limiting max degree of parallalism of the operation. That can be done via MaxDegreeOfParallelism in the ParalleForEach options property as the code works over the disparate lines of data. That way you can control the slots, for lack of a better term, which can be expanded or subtracted depending on the criteria which you are working under.
Here is an example using the ConcurrentBag to process lines of disperate data and to use 2 parallel tasks.
var myLines = new List<string> { "Alpha", "Beta", "Gamma", "Omega" };
var stringResult = new ConcurrentBag<string>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 2;
Parallel.ForEach( myLines, parallelOptions, line =>
{
if (line.Contains( "e" ))
stringResult.Add( line );
} );
Console.WriteLine( string.Join( " | ", stringResult ) );
// Outputs Beta | Omega
Note that parallel options also has a TaskScheduler property which you can refine more of the processing. Finally for more control, maybe you want to cancel the processing when a specific threshold is reached? If so look into CancellationToken property to exit the process early.

C# read text file lines multi thread

I want to write a fast multi thread program using c# that read a file.
so the file must be split into some parts and each part process in different thread. for ex:
Line1
Line2
Line3
Line4
must split to 4 lines like this:
Line1 => thread 1
Line2 => thread 2
Line3 => thread 3
Line4 = > thread 4
i used the StreamReader.readLine() but it cant read specify line.
Comment: its necessary to speedup the program so i want to read file in separate threads.

Unless you're using fixed-length lines, this isn't possible.
Why? Because in order to determine where the "lines" split, you need to find the newline characters... which means you need to read the file first.
Now, if you simply want to perform some extra "processing" after you read in each line - that is possible and relatively straight-forward using a ThreadPool.

You should read the file in a single thread - but then spawn the processing of each line to a different thread, e.g. by adding it to a producer/consumer queue.
Even if you could seek to a specific line in a text file (which in general you can't) you really don't want the disk thrashing around - that'll only slow things down. The fastest way to get the data off the disk is to read it sequentially. By all means defer everything about handling the line beyond "decoding the binary data to text" to other threads, but you really don't want the IO to be in multiple threads.

AFAIK .NET doesn't support parallel stream reading. If you want to process every line you may use File.ReadAllLines. It returns an array of strings. Then use you can use PLINQ.
var result = File.ReadAllLine("path")
.AsParallel()
.Select(s => DoSthWithString(s))
.ToList();

You're not going to be able to speed up the actual reading because you're going to have tremendous locking issues keeping everything straight.
Since a text file is an unstructured file, ie. each line can be of different length, you have no choice but to read each line after the other, one by one.
Now, what you can do is process those lines on different threads, but the actual reading, keep that to one thread.
But, before you do that, are you sure you even have to do this? Is this a bottleneck? If not, fix the bottleneck first and see how far you get.

Your StreamReader is connected to a stream class. Using the stream class you can .Seek to a particular byte location.
Like others have said, this probably isn't a good idea, but it can be done.

I would split the file before hand. Say the file is 1000 lines. Split it into 10 files of 100 lines. Have a thread process each file.

Parallel programming in C#

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.
In other words there are 4 operations:
1. Scan directory for jpegs
2. Parse files for xif and generate copy-location
3. Check for file existence and if needed generate new path
4. Copy files
And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.
Any input would be appreciated.

There are few a options:
Parallel LINQ: Running Queries On Multi-Core Processors
Task Parallel Library (TPL): Optimize Managed Code For Multi-Core Machines
If you are interested in basic threading primitives and concepts: Threading in C#
[But as #John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]

This is the reference I use for C# thread: http://www.albahari.com/threading/
As a single PDF: http://www.albahari.com/threading/threading.pdf
For your second approach:
I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.
In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...
It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.

Interesting problem.
I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.
The first one iterates through the files in parallel.
The second one generates asynchronously the files from the directory.
Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)
string direcory = "Mydirectory";
var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");
// -- PLinq --------------------------------------------
jpegFiles
.AsParallel()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Do(fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
})
.Run();
// -----------------------------------------------------
//-- Rx Framework ---------------------------------------------
var resetEvent = new AutoResetEvent(false);
var doTheWork =
jpegFiles.ToObservable()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Subscribe( fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
},() => resetEvent.Set());
resetEvent.WaitOne();
doTheWork.Dispose();
// -----------------------------------------------------

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.