Multithreaded application does not reach 100% of processor usage

Multithreaded application does not reach 100% of processor usage - c#

My multithreaded application take some files from the HD and then process the data in this files. I reuse the same instance of a class (dataProcessing)) to create threads (I just change the parameters of the calling method).
processingThread[i] = new Thread(new ThreadStart(dataProcessing.parseAll));
I am wondering if the cause could be all threads reading from the same memory.
It takes about half a minute to process each file. The files are quickly read since they are just 200 KB. After I process the files I write all the results in a single destination file. I dont think the problem is reading or writing to the disk. All the threads are working on the task, but for some reason the processor is not being fully used. I try adding more threads to see if I could reach 100% of processor usage, but it comes to a point where it slows down and decresease the processing usage instead of fully use it. Anyone do have an idea what could be wrong?

Here some points you might want to consider:
most CPUs today are Hyper threaded. Even though the OS assumes that each hyper threaded core has 2 pipe lines this is not the case and very dependent on the CPU and the arithmetic operations you are performing. While on most CPUs there are 2 integer units on each pipe-line, there is only one FP so most FP operations are not gaining any befit from the hyper-threaded architecture.
Since the file is only 200k I can only assume that it is all copied to the cache so this is not a memory/disk issue.
Are you using external DLLs? some operations, like reading/saving JPEG files using native Bitmap class, are not parallel and you won't see any speed-up if you are doing multiple executions at once.
Performance decrease as you are reaching a point that switching between the threads costs more than the operation they are doing.
Are you only reading the data or are you also modifying it? If each thread also modify the data then there are many locks on the cache. It would be better for each thread to gather its own data in its own memory and combine all the data together only after all the threads have does their job.

Related

Parallel Programing with Threads

Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.

First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.

There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)

First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.

Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.

It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.

Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

Parallel.ForEach poor performance

I wrote a little program that converts a bunch of files to pdf.
The program does the following:
Get an Array of FileInfo objects from a Folder (10'000 docs)
For each FileInfo
Create a backup copy with FileInfo.CopyTo(),
Convert the Document to PDF by using some Aspose Libraries
After conversion, copy the PDF to a new destination
Inside the foreach an Event is raised and handled by a WinForm UI to show some progress
Depending on the size of the Document the conversion of a Document can take 0-3 seconds.
I thought that would be a perfect candidate for Parallel.ForEach, so I modified the program.
However the conversion took instead of 1 hour with conventional foreach 1.5 hours with Parallel.Foreach (The Server I've tried it has 2 x Intel Xeon Procs).
What did I do wrong or what do I need to consider to get better performance?

I recommend checking if your operation is CPU bound or I/O bound by looking at the CPU in taskmanager and Disk I/O response time/queue length in Resource Monitor and/or looking at the various available performance counters.
I suspect your problem is most likely that you are now doing multiple file copies (both for creating the backup and writing the converted file) at the same time. Hard disks are much faster for sequential access (if you only write/read one file at a time) compared to random access.

I can think on several problems that can cause the Parallel.Foreach to be slower:
Running more threads than processors.
Aspose Libraries isn't support multithread.
Multiple approaches to GUI thread that is threadsafe and cannot be reached from different threads at the same times.
also I recommend you to read my previous answer about Task parallel library - Parallelism on single core
It talks about single core but it can reflect on your problem.

It would depend on quite a few things. I would certainly try setting MaxDegreeOfParallelism to 2, in the hope that if the conversion is is CPU-bound and single-threaded, then having one per core should be close to ideal, though certainly experiment further.
But your very approach assumes that the conversion doesn't itself make good use of multiple cores. If it does, and it's CPU-bound, then it's already doing to sort of parallel use of cores that you are trying to introduce, and you're likely just going to make the whole thing less efficient for that reason.
Edit: Thought made clearer in light of svick's comment. If the library doesn't support multi-threaded use then it's unlikely to have gotten this far without erroring, but its support for multi-threading may involve a lot of internal locking that could be fine when there are occasional concurrent calls, but very expensive if you've got long-term heavy concurrency.

Parallel Concurrent Binary Readers

I Have a Parallel.Foreach Loop creating Binary Readers on the same group of large Data Files
I was just wondering if it hurts performance that these readers are reading the same files in a Parallel Fashion (i.e, if they were reading exclusively different files would it go faster ?)
I am asking because there is a lot of I/O Disk access involved (I guess...)
Edit : I forgot to mention : I am using an Amazon EC2 instance and data is on the C:\ Disk assigned to it. I have no Idea how it affects this issue.
Edit 2: I'll make measurements duplicating the data folder and reading from 2 different sources and see what it gives.

It's not a good idea to read from the same disk using multiple threads. Since the disk's mechanical head needs to spin every time to seek the next reading location, you are basically bouncing it around with multiple threads, thus hurting performance.
The best approach is actually to read the files sequentially using a single thread and then handing the chunks to a group of threads to process them in parallel.

It depends on where your files are. If you're using one mechanical hard-disk, then no - don't read files in parallel, it's going to hurt performance. You may have other configurations, though:
On a single SDD, reading files in parallel will probably not hurt performance, but I don't expect you'll gain anything.
On two mirrored disks using RAID 1 and a half-decent RAID controller, you can read two files at once and gain considerable performance.
If your files are stored on a SAN, you can most definitely read a few at a time and improve performance.
You'll have to try it, but you have to be careful with this - if the files aren't large enough, the OS caching mechanisms are going to affect your measurements, and the second test run is going to be really fast.

Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.
Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.
My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.
My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.
I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.

I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.
Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.

Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.
The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.

This will eliminate bottlenecks of having a single thread do the reading:
open file
for each thread n=0,1,2,3:
seek to file offset 1/n*filesize
scan to next complete line
process all lines in your part of the file

My experience is with Java, not C#, so apologies if these solutions don't apply.
The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.
Probably better solutions exist, but hopefully that will do the job. :-)
ETA: Sounds a lot like Wolfbyte's second solution. :-)
ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!

Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.

#lomaxx
#Derek & Mark: I wish there was a way to accept 2 answers. I'm going to have to end up going with Wolfbyte's solution because if I split the file into n sections there is the potential for a thread to come across a batch of "slow" transactions, however if I was processing a file where each process was guaranteed to require an equal amount of processing then I really like your solution of just splitting the file into chunks and assigning each chunk to a thread and being done with it.
No worries. If clustered "slow" transactions is a issue, then the queuing solution is the way to go. Depending on how fast or slow the average transaction is, you might also want to look at assigning multiple lines at a time to each worker. This will cut down on synchronization overhead. Likewise, you might need to optimize your buffer size. Of course, both of these are optimizations that you should probably only do after profiling. (No point in worrying about synchronization if it's not a bottleneck.)

If the text that you are parsing is made up of repeated strings and tokens, break the file into chunks and for each chunk you could have one thread pre-parse it into tokens consisting of keywords, "punctuation", ID strings, and values. String compares and lookups can be quite expensive and passing this off to several worker threads can speed up the purely logical / semantic part of the code if it doesn't have to do the string lookups and comparisons.
The pre-parsed data chunks (where you have already done all the string comparisons and "tokenized" it) can then be passed to the part of the code that would actually look at the semantics and ordering of the tokenized data.
Also, you mention you are concerned with the size of your file occupying a large amount of memory. There are a couple things you could do to cut back on your memory budget.
Split the file into chunks and parse it. Read in only as many chunks as you are working on at a time plus a few for "read ahead" so you do not stall on disk when you finish processing a chunk before you go to the next chunk.
Alternatively, large files can be memory mapped and "demand" loaded. If you have more threads working on processing the file than CPUs (usually threads = 1.5-2X CPU's is a good number for demand paging apps), the threads that are stalling on IO for the memory mapped file will halt automatically from the OS until their memory is ready and the other threads will continue to process.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.