Parallel Programing with Threads - c#

Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.

First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.

There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)

First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.

Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.

It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.

Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Related

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.
You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.
The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).
A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?
If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).
Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

Multithreaded application does not reach 100% of processor usage

My multithreaded application take some files from the HD and then process the data in this files. I reuse the same instance of a class (dataProcessing)) to create threads (I just change the parameters of the calling method).
processingThread[i] = new Thread(new ThreadStart(dataProcessing.parseAll));
I am wondering if the cause could be all threads reading from the same memory.
It takes about half a minute to process each file. The files are quickly read since they are just 200 KB. After I process the files I write all the results in a single destination file. I dont think the problem is reading or writing to the disk. All the threads are working on the task, but for some reason the processor is not being fully used. I try adding more threads to see if I could reach 100% of processor usage, but it comes to a point where it slows down and decresease the processing usage instead of fully use it. Anyone do have an idea what could be wrong?
Here some points you might want to consider:
most CPUs today are Hyper threaded. Even though the OS assumes that each hyper threaded core has 2 pipe lines this is not the case and very dependent on the CPU and the arithmetic operations you are performing. While on most CPUs there are 2 integer units on each pipe-line, there is only one FP so most FP operations are not gaining any befit from the hyper-threaded architecture.
Since the file is only 200k I can only assume that it is all copied to the cache so this is not a memory/disk issue.
Are you using external DLLs? some operations, like reading/saving JPEG files using native Bitmap class, are not parallel and you won't see any speed-up if you are doing multiple executions at once.
Performance decrease as you are reaching a point that switching between the threads costs more than the operation they are doing.
Are you only reading the data or are you also modifying it? If each thread also modify the data then there are many locks on the cache. It would be better for each thread to gather its own data in its own memory and combine all the data together only after all the threads have does their job.

Balance between Number of Threads and Web Requests

I have a program that executes multiple threads. Each thread simply executes a HTTPWebRequest and then screen scrapes the page looking for some text. I am a race against other users to find this text. I could execute 1000000 threads, all looking for the same thing.
My thought on that is that would put a lot of work on my processor and would actually cause the requests to execute slower. How can I find a balance between the number of threads to execute and the performance of the web requests. Basically what I want to do is find the optimal number of threads to spawn off so that the amount of data they pull down is greatest.
The application is using .NET4 and written in C#.
You are right to assume that 1000000 threads will put undue pressure on your CPU. The work that your CPU would have to do to manage and switch between that many threads would probably cause your system to be very slow indeed.
Obviously you are not serious about 1000000 threads, but it demonstrates that you cannot simply throw more threads at the problem. You dont really want to write your own load balancer - that will not be easy and will not perform as well as the classes that come with the base class library. Have a look at using ThreadPool threads - the CLR will manage them for you. You can also look at the Parallel Task Library that is new in .NET 4.0 (since you mention that is what you are using).
ALso check out this great article about multi-threading:
http://www.albahari.com/threading/
C# has a ThreadPool. Submit your web-scraping tasks to the pool. You can tweak the number of threads in the pool to tune your app - you will probably need to increase it well above the default for best performance with such a requirement as yours.
Huge numbers of threads are wasteful, as posted by #M Babcock.
I'm not sure if the number of threads in a C# ThreadPool can be changed at run-time, (I see no reason why not, but M$...). If it is tweakable during the run, tuning will be even easier!
you need to use Parallel.Foreach to manage your threads properly...
You are asking performance question and not providing any estimates on your actual requirements... so let me try doing it for you.
How much data can you pull in - assuming awesome network and regular network card - 100Mb/s at max, probably less than 10Mb/sec. This give about less than 10000 requests per second (assuming ~10K requests/response pairs).
Can one thread handle that much data - searching through 100Mb a second should not be a problem even for single thread. Super easy to prototype/measure.
How many threads I need to read data - likely 1 - starting asynchronous request is fast, reading response OR posting response in a queue for processing is fast for 10000 items a second.
So my estimates - 1 thread for simple code, (1 + one thread per core) if you have more cores and willing to run processing in parallel.

Increase performance in Long Operations

I have a file encryption program. When the program is encrypting files, it doesn't exceed 25% CPU usage, hence it is slow.
How can I make the OS assign to it more CPU load? (Such as WinRAR, when it compresses files, it reaches 100% from CPU load).
[Edit]: As my cores are 4, it doesn't use more than one core. How can I make it use the rest of cores?
Unless you are otherwise throttling the application it will use as much CPU as the OS allows it to - which should be up to 100% by default. I would guess that some other resource is the bottleneck.
Are you streaming the data to encrypt from a remote location? From a disk that is for some reason quite slow?
If your tool is single threaded program, then it only consumes one core! And the performance will reach 100% on that core in case your program only do a for loop or other kind of loop. If the tool must do I/O then it never has maximum performance. And 25% you see is per all cpu cores. As I remember there some posts show you how to display the percentage of consumption on each cpu core!
Just in case, if you are using v4.0, rather than assigning more CPU load, try using Parallel Framework(PFX). It is optimized for multi core processors..
Parallel.Invoke(() => DoCompress());
Also, Threading in C# is the best threading related resource in the universe.
Sometimes people think a high CPU percent means an efficient program.
If that were so, an infinite loop would be the most efficient of all.
When you have a program that basically processes files off a mechanical hard drive, ideally it should be IO bound, because reading the file simply has to be done.
i.e. The CPU part should be efficient enough that it takes a low percent of time compared to moving the file off disk.
Anything you can do to reduce CPU time will reduce the CPU percent, because I/O takes a larger percentage of the total, and vice-versa.
If you can go back-and-forth between the two, reducing first CPU (program tuning), then I/O (ex. solid-state drive), you can make it really fly.
Then, if the CPU part is still taking longer than you would like, by all means, farm it out over multiple cores.
This has nothing to do with the processor and assigning resources: the tool you use is simply not designed to use all (I guess) 4 cores of the cpu.

Will Multi threading increase the speed of the calculation on Single Processor

On a single processor, Will multi-threading increse the speed of the calculation. As we all know that, multi-threading is used for Increasing the User responsiveness and achieved by sepating UI thread and calculation thread. But lets talk about only console application. Will multi-threading increases the speed of the calculation. Do we get culculation result faster when we calculate through multi-threading.
what about on multi cores, will multi threading increse the speed or not.
Please help me. If you have any material to learn more about threading. please post.
Edit:
I have been asked a question, At any given time, only one thread is allowed to run on a single core. If so, why people use multithreading in a console application.
Thanks in advance,
Harsha
In general terms, no it won't speed up anything.
Presumably the same work overall is being done, but now there is the overhead of additional threads and context switches.
On a single processor with HyperThreading (two virtual processors) then the answer becomes "maybe".
Finally, even though there is only one CPU perhaps some of the threads can be pushed to the GPU or other hardware? This is kinda getting away from the "single processor" scenario but could technically be way of achieving a speed increase from multithreading on a single core PC.
Edit: your question now mentions multithreaded apps on a multicore machine.
Again, in very general terms, this will provide an overall speed increase to your calculation.
However, the increase (or lack thereof) will depend on how parallelizable the algorithm is, the contention for memory and cache, and the skill of the programmer when it comes to writing parallel code without locking or starvation issues.
Few threads on 1 CPU:
may increase performance in case you continue with another thread instead of waiting for I/O bound operation
may decrease performance if let say there are too many threads and work is wasted on context switching
Few threads on N CPUs:
may increase performance if you are able to cut job in independent chunks and process them in independent manner
may decrease performance if you rely heavily on communication between threads and bus becomes a bottleneck.
So actually it's very task specific - you can parallel one things very easy while it's almost impossible for others. Perhaps it's a bit advanced reading for new person but there are 2 great resources on this topic in C# world:
Joe Duffy's web log
PFX team blog - they have a very good set of articles for parallel programming in .NET world including patterns and practices.
What is your calculation doing? You won't be able to speed it up by using multithreading if it a processor bound, but if for some reason your calculation writes to disk or waits for some other sort of IO you may be able to improve performance using threading. However, when you say "calculation" I assume you mean some sort of processor intensive algorithm, so adding threads is unlikely to help, and could even slow you down as the context switch between threads adds extra work.
If the task is compute bound, threading will not make it faster unless the calculation can be split in multiple independent parts. Even so you will only be able to achieve any performance gains if you have multiple cores available. From the background in your question it will just add overhead.
However, you may still want to run any complex and long running calculations on a separate thread in order to keep the application responsive.
No, no and no.
Unless you write parallelizing code to take advantage of multicores, it will always be slower if you have no other blocking functions.
Exactly like the user input example, one thread might be waiting for a disk operation to complete, and other threads can take that CPU time.
As described in the other answers, multi-threading on a single core won't give you any extra performance (hyperthreading notwithstanding). However, if your machine sports an Nvidia GPU you should be able to use the CUDA to push calculations to the GPU. See http://www.hoopoe-cloud.com/Solutions/CUDA.NET/Default.aspx and C#: Perform Operations on GPU, not CPU (Calculate Pi).
Above mention most.
Running multiple threads on one processor can increase performance, if you can manage to get more work done at the same time, instead of let the processor wait between different operations. However, it could also be a severe loss of performance due to for example synchronization or that the processor is overloaded and cant step up to the requirements.
As for multiple cores, threading can improve the performance significantly. However, much depends on finding the hotspots and not overdo it. Using threads everywhere and the need of synchronization can even lower the performance. Optimizing using threads with multiple cores takes a lot of pre-studies and planning to get a good result. You need for example to think about how many threads to be use in different situations. You do not want the threads to sit and wait for information used by another thread.
http://www.intel.com/intelpress/samples/mcp_samplech01.pdf
https://computing.llnl.gov/tutorials/parallel_comp/
https://computing.llnl.gov/tutorials/pthreads/
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Simultaneous_multithreading
I have been doing some intensive C++ mathematical simulation runs using 24 core servers. If I run 24 separate simulations in parallel on the 24 cores of a single server, then I get a runtime for each of my simulations of say X seconds.
The bizarre thing I have noticed is that, when running only 12 simulations, using 12 of the 24 cores, with the other 12 cores idle, then each of the simulations runs at a runtime of Y seconds, where Y is much greater than X! When viewing the task manager graph of the processor usage, it is obvious that a process does not stick to only one core, but alternates between a number of cores. That is to say, the switching between cores to use all the cores slows down the calculation process.
The way I maintained the runtime when running only 12 simulations, is to run another 12 "junk" simulations on the side, using the remaining 12 cores!
Conclusion: When using multi-cores, use them all at 100%, for lower utilisation, the runtime increases!
For single core CPU,
Actually the performance depends on the job you are referring.
In your case, for calculation done by CPU, in that case OverClocking would help if your parentBoard supports it. Otherwise there is no way for CPU to do calculations that are faster than the speed of CPU.
For the sake of Multicore CPU
As the above answers say, if properly designed the performance may increase, if all cores are fully used.
In single core CPU, if the threads are implemented in User Level then multithreading wont matter if there are blocking system calls in the thread, like an I/O operation. Because kernel won't know about the userlevel threads.
So if the process does I/O then you can implement the threads in Kernel space and then you can implement different threads for different job.
(The answer here is on theory based.)
Even a CPU bound task might run faster multi-threaded if properly designed to take advantage of cache memory and pipelineing done by the processor. Modern processors spend a lot of time
twiddling their thumbs, even when nominally fully "busy".
Imagine a process that used a small chunk of memory very intensively. Processing
the same chunk of memory 1000 times would be much faster than processing 1000 chunks
of similar memory.
You could certainly design a multi threaded program that would be faster than a single thread.
Treads don't increase performance. Threads sacrifice performance in favor of keeping parts of the code responsive.
The only exception is if you are doing a computation that is so parallelizeable that you can run different threads on different cores (which is the exception, not the rule).

Categories