Parallel.ForEach poor performance

Parallel.ForEach poor performance - c#

I wrote a little program that converts a bunch of files to pdf.
The program does the following:
Get an Array of FileInfo objects from a Folder (10'000 docs)
For each FileInfo
Create a backup copy with FileInfo.CopyTo(),
Convert the Document to PDF by using some Aspose Libraries
After conversion, copy the PDF to a new destination
Inside the foreach an Event is raised and handled by a WinForm UI to show some progress
Depending on the size of the Document the conversion of a Document can take 0-3 seconds.
I thought that would be a perfect candidate for Parallel.ForEach, so I modified the program.
However the conversion took instead of 1 hour with conventional foreach 1.5 hours with Parallel.Foreach (The Server I've tried it has 2 x Intel Xeon Procs).
What did I do wrong or what do I need to consider to get better performance?

I recommend checking if your operation is CPU bound or I/O bound by looking at the CPU in taskmanager and Disk I/O response time/queue length in Resource Monitor and/or looking at the various available performance counters.
I suspect your problem is most likely that you are now doing multiple file copies (both for creating the backup and writing the converted file) at the same time. Hard disks are much faster for sequential access (if you only write/read one file at a time) compared to random access.

I can think on several problems that can cause the Parallel.Foreach to be slower:
Running more threads than processors.
Aspose Libraries isn't support multithread.
Multiple approaches to GUI thread that is threadsafe and cannot be reached from different threads at the same times.
also I recommend you to read my previous answer about Task parallel library - Parallelism on single core
It talks about single core but it can reflect on your problem.

It would depend on quite a few things. I would certainly try setting MaxDegreeOfParallelism to 2, in the hope that if the conversion is is CPU-bound and single-threaded, then having one per core should be close to ideal, though certainly experiment further.
But your very approach assumes that the conversion doesn't itself make good use of multiple cores. If it does, and it's CPU-bound, then it's already doing to sort of parallel use of cores that you are trying to introduce, and you're likely just going to make the whole thing less efficient for that reason.
Edit: Thought made clearer in light of svick's comment. If the library doesn't support multi-threaded use then it's unlikely to have gotten this far without erroring, but its support for multi-threading may involve a lot of internal locking that could be fine when there are occasional concurrent calls, but very expensive if you've got long-term heavy concurrency.

Related

Parallel Programing with Threads

Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.

First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.

There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)

First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.

Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.

It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.

Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Why do .NET threads have inferior performance to separate .NET processes?

Lately I've been observing an interesting phenomenon, and before I reengineer my whole software architecture based on it, I'd like to know why this happens, and if it's perhaps possible to make thread performance on par with process performance.
Generally, the task is to download certain data. If we make one process with 6 threads, based on the Parallel library, the downloads take around 10s.
If we, however, make 6 processes, each being single threaded, and download the same data, the whole thing will only take around 6s.
The numbers are thoroughly verified and statistically significant, so do take them for granted.
The observation holds over a large (100s of trials) dataset and I've observed no deviation from this behavior.
Basically, the question is, why a non-synchronizing multithreaded process is slower than a few separate processes with the exact same working code, and how it can be fixed?
Thanks in advance!
Note: I've read similar questions but the answers haven't been satisfactory and practical.

My guess is the same as svick's: you probably have some kind of bottleneck inserted by the runtime.
In general, you can use a tool like Fiddler or Wireshark to see how the 10 downloads are interleaving. In your case, I would expect that there would only be two active at any one time and that once one finishes, another will start immediately.
Before you go and change the setting, you should understand why it's there. It is written into the HTTP spec as suggested client behavior so as to not overwhelm the server. If your code is going to be distributed out to hundreds/thousands/millions of machines, you should consider the effects of 10 simultaneous downloads per client.

Multithreaded application does not reach 100% of processor usage

My multithreaded application take some files from the HD and then process the data in this files. I reuse the same instance of a class (dataProcessing)) to create threads (I just change the parameters of the calling method).
processingThread[i] = new Thread(new ThreadStart(dataProcessing.parseAll));
I am wondering if the cause could be all threads reading from the same memory.
It takes about half a minute to process each file. The files are quickly read since they are just 200 KB. After I process the files I write all the results in a single destination file. I dont think the problem is reading or writing to the disk. All the threads are working on the task, but for some reason the processor is not being fully used. I try adding more threads to see if I could reach 100% of processor usage, but it comes to a point where it slows down and decresease the processing usage instead of fully use it. Anyone do have an idea what could be wrong?

Here some points you might want to consider:
most CPUs today are Hyper threaded. Even though the OS assumes that each hyper threaded core has 2 pipe lines this is not the case and very dependent on the CPU and the arithmetic operations you are performing. While on most CPUs there are 2 integer units on each pipe-line, there is only one FP so most FP operations are not gaining any befit from the hyper-threaded architecture.
Since the file is only 200k I can only assume that it is all copied to the cache so this is not a memory/disk issue.
Are you using external DLLs? some operations, like reading/saving JPEG files using native Bitmap class, are not parallel and you won't see any speed-up if you are doing multiple executions at once.
Performance decrease as you are reaching a point that switching between the threads costs more than the operation they are doing.
Are you only reading the data or are you also modifying it? If each thread also modify the data then there are many locks on the cache. It would be better for each thread to gather its own data in its own memory and combine all the data together only after all the threads have does their job.

How to compile C# for multiple processor machines? (With VS 2010 or csc.exe)

Greetings!
I've searched for compiler (csc.exe) options at MSDN and I found an answer here, at Stackoverflow, about compiling with multiple processors. But my problem is about compiling for multiple processors, as follows.
The university where I'm graduating has a 11 machine cluster (which has 6 quad-cores and 5 four-core bi-processed machines). It runs under linux, but I can install MONO there. And instead of compiling with multiple processors or cores, I want to compile for multiple processors machine. So:
Is there any particular detail on how to do it or the CLR on that system should handle the execution to spread it across the cores?
If there's a way to do this, how can I do it with VS2010 or with csc.exe command line compiler?
Thanks in advance and I'm sorry if this question makes no sense. I really don't know how to handle multiple cores, as I'm a mere physicist, not a computer scientist! :)

You don't need to compile any differently to take account of multiple cores. You need to write your code differently though, to use multiple threads. If you can use classes from .NET 4 in your environment (a recent version of Mono should support this) you can use the Task Parallel Library which makes this a bit easier.
Basically you don't get concurrency for free - you have to think about which bits of your code can sensibly run in parallel. You might want to read the output of the Patterns and Practices group for parallel programming. (The book is a very good starting point.)

Your supposition is correct; your question makes no sense.
It is not possible to magically parallelize arbitrary code; you need to modify the code to use multiple threads.
You can use explicitly multiple cores in C# by using the Thread or ThreadPool classes, or by using Parallel LINQ or the TPL.
There is no special compiler involved.

The CLR, by default, will not do anything special to spread the work out across multiple cores. YOU, in developing the application, are responsible for making the best use of your machine's resources. The .NET Framework does have several libraries and technologies that make multithreaded operations simple to implement: look up the Thread class, Delegate.BeginInvoke/EndInvoke, and the Task Parallel Library.

Since it is a cluster, you have to rely on some form of a message-passing parallelism, no compiler will transform your code automatically. At least, a good old MPI is supported: http://osl.iu.edu/research/mpi.net/

The answer to your question comes in the form of two seemingly-contradictory statements:
1: It already does
and
2: You can't
Modern operating systems and, thus, development environments, use threads. A thread, fundamentally, represents a single series of sequential steps (and words that don't start with "S") that the processor will execute. These threads are managed by the operating system and by the processor architecture, wherein the processor will execute some portion (or all) of a thread, save its state, then switch to another thread.
In the presence of multiple cores (whether by multi-core processors or simply multiple processors or both), it's actually possible for the computer to execute two threads at the same time, assuming that they read and write different locations in memory (threads that utilize the same resources require synchronization, which is a complex ballgame inside this one), by distributing threads across cores.
At the risk of using an overly simplistic simile, think of it this way: your code, as it stands right now, is just a very long list of steps to execute to accomplish a particular task. You've now taken this list of instructions into a room full of people (each representing a processing core), and you'd like to use each of these people as efficiently as possible. While a room full of PhD students might have the context and subject-matter knowledge to figure out how to break out your instructions into tasks for each individual person, you've taken your list to a room full of people who are excellent at following directions but entirely stupid when it comes to deduction. In this case, you need to bring a different set of instructions for each person that when all of them are executed, you end up with the same result.
Put simply, in order to have your code take advantage of multiple cores or processors, you have to break your work down into small, preferably atomic chunks of code. The specific method that you use to break up your code into multiple threads can vary; using System.Threading.ThreadPool or the more recently introduced Task Parallel Library can make some of these things easier, though there's always a tradeoff (as with everything) in either efficiency or control.
Going into much more detail than that would require looking at your actual code. You'd do better finding someone with experience writing solid, performant multithreaded code (if possible, someone with recent .NET experience doing this, as this will help make the determination about which libraries would be appropriate).

C# - Moving files - to queue or multi-thread

I have an app that moves a project and its files from preview to production using a Flex front-end and a .NET web service. Currently, the process takes about 5-10 mins/per project. Aside from latency concerns, it really shouldn't take that long. I'm wondering whether or not this is a good use-case for multi-threading. Also, considering the user may want to push multiple projects or one right after another, is there a way to queue the jobs.
Any suggestions and examples are greatly appreciated.
Thanks!

Something that does heavy disk IO typically isn't a good candidate for multithreading since the disks can really only do one thing at a time. However, if you're pushing to multiple servers or the servers have particularly good disk subsystems some light threading may be beneficial.

As a note - regardless of whether or not you decide to queue the jobs, you will use multi-threading. Queueing is just one way of handling what is ultimately solved using multi-threading.
And yes, I'd recommend you build a queue to push out each project.

You should compare the speed of your code compared to just copying in Windows (i.e., explorer or command line) vs copying with something advanced like TeraCopy. If your code is significantly slower than Window then look at parts in your code to optimize using a profiler. If your code is about as fast as Windows but slower than TeraCopy, then multithreading could help.
Multithreading is not generally helpful when the operation I/O bound, but copying files involves reading from the disk AND writing over the network. This is two I/O operations, so if you separate them onto different threads, it could increase performance. For something like this you need a producer/consumer setup where you have a Circular queue with one thread reading from disk and writing to the queue, and another thread reading from the queue and writing to the network. It'll be important to keep in mind that the two threads will not run at the same speed, so if the queue gets full, wait before writing more data and if it's empty, wait before writing. Also the locking strategy could have a big impact on performance here and could cause the performance to degrade to slower than a single-threaded implementation.

If you're moving things between just two computers, the network is going to be the bottleneck, so you may want to queue these operations.
Likewise, on the same machine, the I/O is going to be the bottleneck, so you'd want to queue there, too.

You should try using the ThreadPool.
ThreadPool.QueueUserWorkItem(MoveProject, project);

Agreed with everyone over the limited performance of running the tasks in parallel.
If you have full control over your deployment environment, you could use Rhino Queues:
http://ayende.com/Blog/archive/2008/08/01/Rhino-Queues.aspx
This will allow you to produce a queue of jobs asynchronously (say from a WCF service being called from your Silverlight/Flex app) and consume them synchronously.
Alternatively you could use WCF and MSMQ, but the learning curve is greater.

When dealing with multiple files using multiple threads usually IS a good idea in concerns of performance.The main reason is that most disks nowadays support native command queuing.
I wrote an article recently about reading/writing files with multiple files on ddj.com.
See http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=220300055.
Also see related question
Will using multiple threads with a RandomAccessFile help performance?
In particular i made the experience that when dealing with very many files it IS a good idea to use a number of threads. In contrary using many thread in many cases does not slow down applications as much as commonly expected.
Having said that i'd say there is no other way to find out than trying all possible different approaches. It depends on very many conditions: Hardware, OS, Drivers etc.

The very first thing you should do is point any kind of profiling tool towards your software. If you can't do that (like, if you haven't got such a tool), insert logging code.
The very first thing you need to do is figure out what is taking a long time to complete, and then why is it taking a long time to complete. That your "copy" operation as a whole takes a long time to complete isn't good enough, you need to pinpoint the reason for this down to a method or a set of methods.
Until you do that, all the other things you can do to your code will likely be guesswork. My experience has taught me that when it comes to performance, 9 out of 10 reasons for things running slow comes as surprises to the guy(s) that wrote the code.
So measure first, then change.
For instance, you might discover that you're in fact reporting progress of copying the file on a byte-per-byte basis, to a GUI, using a synchronous call to the UI, in which case it wouldn't matter how fast the actual copying can run, you'll still be bound by message handling speed.
But that's just conjecture until you know, so measure first, then change.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.