Multiprocessor and Performance

Multiprocessor and Performance - c#

I'm facing a really strange problem with a .Net service.
I developed a multithreaded x64 windows service.
I tested this service in a x64 server with 8 cores. The performance was great!
Now I moved the service to a production server (x64 - 32 cores). During the tests I found out the performance is, at least, 10 times worst than in the test server.
I've checked loads of performance counters trying to find some reason for this poor performance, but I couldn't find a point.
Could be a GC problem? Have you ever faced a problem like this?
Thank you in advance!
Alexandre

This is a common problem which people are generally unaware of, because very few people have experience on many-CPU machines.
The basic problem is contention.
As the CPU count increases, contention increases in all shared data structures. For low CPU counts, contention is low and the fact you have multiple CPUs improves performance. As the CPU count becomes significantly larger, contention begins to drown out your performance improvements; as the CPU count becomes large, contention actually starts reducing performance below that of a lower number of CPUs.
You are basically facing one of the aspects of the scalability problem.
I'm not sure however where this problem lies; in your data structures, or in the operating systems data structures. The former you can address - lock-free data structures are an excellent, highly scalable approach. The latter is difficult, since it essentially requires avoiding certain OS functionality.

There are way too many variables to know why one machine is slower than the other. 32 core machines are usually more specialized where an eight core could just be a dual proc quad core machine. Are there vm's or other things running at the same time? Usually with that many cores, IO bandwidth becomes the limiting factor (even if the cpu's still have plenty of bandwidth).
To start off, you should probably add lots of timers in your code (or profiling or whatever) to figure out what part of your code is taking up the most time.
Performance troublshooting 101: what is the bottleneck ( where in the code and what subsystem (memory, disk, cpu) )

There are so many factors here:
are you actually using the cores?
are your extra threads causing locking issues to be more obvious?
do you not have enough memory to support all the extra stacks / data you can process?
can your IO (disk/network/database) stack keep up with the throughput?
etc

Could it be down to differences in memory or the disk? If there were the bottleneck, you'd not get the value for the additional processing power. Can't really tell without more details of your application/configuration.

With that many threads running concurrently, you're going to have to be really careful to get around issues of threads fighting with each other to access your data. Read up on Non-blocking synchronization.

How many threads are you using? Using to many thread pool threads could cause thread starvation which would make your program slower.
Some articles:
http://www2.sys-con.com/ITSG/virtualcd/Dotnet/archives/0112/gomez/index.html
http://codesith.blogspot.com/2007/03/thread-starvation-in-shared-thread-pool.html
(search for thread starvation in them)
You could use a .net profiler to find your bottle necks, here are a good free one:
http://www.eqatec.com/tools/profiler

I agree with Blank, it's likely to be some form of contention. It's likely to be very hard to track down, unfortunately. It could be in your application code, the framework, the OS, or some combination thereof. Your application code is the most likely culprit, since Microsoft has expended significant effort on making the CLR and the OS scale on 32P boxes.
The contention could be in some hot locks, but it could be that some processor cache lines are sloshing back and forth between CPUs.
What's your metric for 10x worse? Throughput?
Have you tried booting the 32-proc box with fewer CPUs? Use the /NUMPROC option in boot.ini or BCDedit.
Do you achieve 100% CPU utilization? What's your context switch rate like? And how does this compare to the 8P box?

Related

Benchmarking RAM performance - UWP and C#

I'm developing a benchmarking application using Universal Windows Platform that evaluates CPU and RAM performance of a Windows 10 system.
Although I found different algorithms to benchmark a CPU, I still didn't found any solid algorithm or solution to evaluate the write and read speeds of memory.
How can I achieve this in C#?
Thanks in advance :)

I don't see why this would not be possible from managed code. Array access code turns into normal x86 memory instructions. It's a thin abstraction. In particular I don't see why you would need a customized OS.
You should be able to test sequential memory speed by performing memcpy on big arrays. They must be bigger than the last level cache size.
You can test random access by randomly indexing into a big array. The index calculation must be cheap, unpredictable and there must be a dependency chain that serializes the memory instructions so that the CPU cannot parallelize them.
Honestly I don't think its possible. RAM benchmarks usually run off of dedicated OS's
RAM testing is different from RAM benchmarking.
C# doesn't give you that kind of control over RAM
Of course, just new up a big array and access it. Also, understand the overheads that are present. The only overhead is a range check.
The GC has no impact during the benchmark. It might be triggered by an allocation.

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:
So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour.
I've made sure to use Server GC mode.
I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than
Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
Performs not complex calculations
Stores data in RAM
I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.
I'm stuck, what's wrong with my scenario?
I use .Net 4 Task Parallel Library.

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.
You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:
don't use hyperthreading (because the two threads share the same core resource)
tie every thread to a specific core (otherwise the OS will juggle the
threads between cores)
don't use more threads than there are cores (the OS will swap in and
out)
stay inside the core's own caches - nowadays the L1 & L2 caches
don't venture into the L3 cache or RAM unless it is absolutely
necessary
minimize/economize on critical section/synchronization usage
If you've come this far you've probably profiled and hand-tuned your code too.
Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.
Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.
Also I should say that my threading experience is mostly from Windows NT-engine machines.
_______EDIT_______
Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

A general decrease in return with more threads could indicate some kind of bottle neck.
Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?
The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.
If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.
Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.
Since your added that threads query for data very often, that indicates a very large risk of waiting.
Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

If you are doing memory intensive stuff, you could be hitting cache capacity.
You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.
If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

Are your threads reading and writing to items close together in memory? Then you're probably running into false sharing. If thread 1 works with data[1] and thread2 works with data[2], then even though in an ideal world we know that two consecutive reads of data[2] by thread2 will always produce the same result, in the actual world, if thread1 updates data[1] sometime between those two reads, then the CPU will mark the cache as dirty and update it. http://msdn.microsoft.com/en-us/magazine/cc872851.aspx. To solve it, make sure the data each thread is working with is adequately far away in memory from the data the other threads are working with.
That could give you a performance boost, but likely won't get you to 16x—there are lots of things going on under the hood and you'll just have to knock them out one-by-one. And really it's not that your algorithm is running at 30% speed when multithreaded; it's more that your single-threaded algorithm is running at 300% speed, enabled by all sorts of CPU and caching awesomeness that running multithreaded has a harder time taking advantage of. So there's nothing to be "embarrassed" about. But with some diligence, you can perhaps get the multithreaded version working at nearly 300% speed as well.
Also, if you're counting hyperthreaded cores as real cores, well, they're not. They only allow threads to swap really fast when one is blocked. But they'll never let you run at double speed unless your threads are getting blocked half the time anyway, in which case that already means you have opportunity for speedup.

how much cpu should a single thread application use?

I have a single thread console application.
I am confused with the concept of CPU usage. Should a good single thread application use ~100% of cpu usage (since it is available) or it should not use lots of cpu usage (since it can cause the computer to slow down)?
I have done some research but haven't found an answer to my confusion. I am a student and still learning so any feedback will be appreciated. Thanks.

It depends on what the program needs the CPU for. If it has to do a lot of work, it's common to use all of one core for some period of time. If it spends most of its time waiting for input, it will naturally tend to use the CPU less frequently. I say "less frequently" instead of "less" because:
Single threaded programs are, at any given time, either running, or they're not, so they are always using either 100% or 0% of one CPU core. Programs that appear to be only using 50% or 30% or whatever are actually just balancing periods of computational work with periods of waiting for input. Devices like hard drives are very slow compared to the CPU, so a program that's reading a lot of data from disk will use less CPU resources than one that crunches lots of numbers.
It's normal for a program to use 100% of the CPU sometimes, often even for a long time, but it's not polite to use it if you don't need it (i.e. busylooping). Such behavior crowds out other programs that could be using the CPU.
The same goes with the hard drive. People forget that the hard drive is a finite resource too, mostly because the task manager doesn't have a hard drive usage by percentage. It's difficult to gauge hard drive usage as a percentage of the total since disk accesses don't have a fixed speed, unlike the processor. However, it takes much longer to move 1GB of data on disk than it does to use the CPU to move 1GB of data in memory, and the performance impacts of HDD hogging are as bad or worse than those of CPU hogging (they tend to slow your system to a crawl without looking like any CPU usage is going on. You have probably seen it before)
Chances are that any small academic programs you write at first will use all of one core for a short period of time, and then wait. Simple stuff like prompting for a number at the command prompt is the waiting part, and doing whatever operation ad academia on it afterwards is the active part.

It depends on what it's doing. Different types of operations have different needs.
There is no non-subjective way to answer this question that apples across the boards.
The only answer that's true is "it should use only the amount of CPU necessary to do the job, and no more."
In other words, optimize as much as you can and as is reasonable. In general, the lower the CPU the better, the faster it will perform, and the less it will crash, and the less it will annoy your users.

Typically an algoritmically heavy task such as predicting weather will have to be managed by the os, because it will need all of the cpu for as much time as it will be allowed to run (untill it's done).
On the other hand, a graphical application with a static user interface, like a windows forms application for storing a bit of data for record-keeping should require very low cpu usage, since it's mainly waiting for the user to do something.

Increase performance in Long Operations

I have a file encryption program. When the program is encrypting files, it doesn't exceed 25% CPU usage, hence it is slow.
How can I make the OS assign to it more CPU load? (Such as WinRAR, when it compresses files, it reaches 100% from CPU load).
[Edit]: As my cores are 4, it doesn't use more than one core. How can I make it use the rest of cores?

Unless you are otherwise throttling the application it will use as much CPU as the OS allows it to - which should be up to 100% by default. I would guess that some other resource is the bottleneck.
Are you streaming the data to encrypt from a remote location? From a disk that is for some reason quite slow?

If your tool is single threaded program, then it only consumes one core! And the performance will reach 100% on that core in case your program only do a for loop or other kind of loop. If the tool must do I/O then it never has maximum performance. And 25% you see is per all cpu cores. As I remember there some posts show you how to display the percentage of consumption on each cpu core!

Just in case, if you are using v4.0, rather than assigning more CPU load, try using Parallel Framework(PFX). It is optimized for multi core processors..
Parallel.Invoke(() => DoCompress());
Also, Threading in C# is the best threading related resource in the universe.

Sometimes people think a high CPU percent means an efficient program.
If that were so, an infinite loop would be the most efficient of all.
When you have a program that basically processes files off a mechanical hard drive, ideally it should be IO bound, because reading the file simply has to be done.
i.e. The CPU part should be efficient enough that it takes a low percent of time compared to moving the file off disk.
Anything you can do to reduce CPU time will reduce the CPU percent, because I/O takes a larger percentage of the total, and vice-versa.
If you can go back-and-forth between the two, reducing first CPU (program tuning), then I/O (ex. solid-state drive), you can make it really fly.
Then, if the CPU part is still taking longer than you would like, by all means, farm it out over multiple cores.

This has nothing to do with the processor and assigning resources: the tool you use is simply not designed to use all (I guess) 4 cores of the cpu.

Will Multi threading increase the speed of the calculation on Single Processor

On a single processor, Will multi-threading increse the speed of the calculation. As we all know that, multi-threading is used for Increasing the User responsiveness and achieved by sepating UI thread and calculation thread. But lets talk about only console application. Will multi-threading increases the speed of the calculation. Do we get culculation result faster when we calculate through multi-threading.
what about on multi cores, will multi threading increse the speed or not.
Please help me. If you have any material to learn more about threading. please post.
Edit:
I have been asked a question, At any given time, only one thread is allowed to run on a single core. If so, why people use multithreading in a console application.
Thanks in advance,
Harsha

In general terms, no it won't speed up anything.
Presumably the same work overall is being done, but now there is the overhead of additional threads and context switches.
On a single processor with HyperThreading (two virtual processors) then the answer becomes "maybe".
Finally, even though there is only one CPU perhaps some of the threads can be pushed to the GPU or other hardware? This is kinda getting away from the "single processor" scenario but could technically be way of achieving a speed increase from multithreading on a single core PC.
Edit: your question now mentions multithreaded apps on a multicore machine.
Again, in very general terms, this will provide an overall speed increase to your calculation.
However, the increase (or lack thereof) will depend on how parallelizable the algorithm is, the contention for memory and cache, and the skill of the programmer when it comes to writing parallel code without locking or starvation issues.

Few threads on 1 CPU:
may increase performance in case you continue with another thread instead of waiting for I/O bound operation
may decrease performance if let say there are too many threads and work is wasted on context switching
Few threads on N CPUs:
may increase performance if you are able to cut job in independent chunks and process them in independent manner
may decrease performance if you rely heavily on communication between threads and bus becomes a bottleneck.
So actually it's very task specific - you can parallel one things very easy while it's almost impossible for others. Perhaps it's a bit advanced reading for new person but there are 2 great resources on this topic in C# world:
Joe Duffy's web log
PFX team blog - they have a very good set of articles for parallel programming in .NET world including patterns and practices.

What is your calculation doing? You won't be able to speed it up by using multithreading if it a processor bound, but if for some reason your calculation writes to disk or waits for some other sort of IO you may be able to improve performance using threading. However, when you say "calculation" I assume you mean some sort of processor intensive algorithm, so adding threads is unlikely to help, and could even slow you down as the context switch between threads adds extra work.

If the task is compute bound, threading will not make it faster unless the calculation can be split in multiple independent parts. Even so you will only be able to achieve any performance gains if you have multiple cores available. From the background in your question it will just add overhead.
However, you may still want to run any complex and long running calculations on a separate thread in order to keep the application responsive.

No, no and no.
Unless you write parallelizing code to take advantage of multicores, it will always be slower if you have no other blocking functions.

Exactly like the user input example, one thread might be waiting for a disk operation to complete, and other threads can take that CPU time.

As described in the other answers, multi-threading on a single core won't give you any extra performance (hyperthreading notwithstanding). However, if your machine sports an Nvidia GPU you should be able to use the CUDA to push calculations to the GPU. See http://www.hoopoe-cloud.com/Solutions/CUDA.NET/Default.aspx and C#: Perform Operations on GPU, not CPU (Calculate Pi).

Above mention most.
Running multiple threads on one processor can increase performance, if you can manage to get more work done at the same time, instead of let the processor wait between different operations. However, it could also be a severe loss of performance due to for example synchronization or that the processor is overloaded and cant step up to the requirements.
As for multiple cores, threading can improve the performance significantly. However, much depends on finding the hotspots and not overdo it. Using threads everywhere and the need of synchronization can even lower the performance. Optimizing using threads with multiple cores takes a lot of pre-studies and planning to get a good result. You need for example to think about how many threads to be use in different situations. You do not want the threads to sit and wait for information used by another thread.
http://www.intel.com/intelpress/samples/mcp_samplech01.pdf
https://computing.llnl.gov/tutorials/parallel_comp/
https://computing.llnl.gov/tutorials/pthreads/
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Simultaneous_multithreading

I have been doing some intensive C++ mathematical simulation runs using 24 core servers. If I run 24 separate simulations in parallel on the 24 cores of a single server, then I get a runtime for each of my simulations of say X seconds.
The bizarre thing I have noticed is that, when running only 12 simulations, using 12 of the 24 cores, with the other 12 cores idle, then each of the simulations runs at a runtime of Y seconds, where Y is much greater than X! When viewing the task manager graph of the processor usage, it is obvious that a process does not stick to only one core, but alternates between a number of cores. That is to say, the switching between cores to use all the cores slows down the calculation process.
The way I maintained the runtime when running only 12 simulations, is to run another 12 "junk" simulations on the side, using the remaining 12 cores!
Conclusion: When using multi-cores, use them all at 100%, for lower utilisation, the runtime increases!

For single core CPU,
Actually the performance depends on the job you are referring.
In your case, for calculation done by CPU, in that case OverClocking would help if your parentBoard supports it. Otherwise there is no way for CPU to do calculations that are faster than the speed of CPU.
For the sake of Multicore CPU
As the above answers say, if properly designed the performance may increase, if all cores are fully used.
In single core CPU, if the threads are implemented in User Level then multithreading wont matter if there are blocking system calls in the thread, like an I/O operation. Because kernel won't know about the userlevel threads.
So if the process does I/O then you can implement the threads in Kernel space and then you can implement different threads for different job.
(The answer here is on theory based.)

Even a CPU bound task might run faster multi-threaded if properly designed to take advantage of cache memory and pipelineing done by the processor. Modern processors spend a lot of time
twiddling their thumbs, even when nominally fully "busy".
Imagine a process that used a small chunk of memory very intensively. Processing
the same chunk of memory 1000 times would be much faster than processing 1000 chunks
of similar memory.
You could certainly design a multi threaded program that would be faster than a single thread.

Treads don't increase performance. Threads sacrifice performance in favor of keeping parts of the code responsive.
The only exception is if you are doing a computation that is so parallelizeable that you can run different threads on different cores (which is the exception, not the rule).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.