ParallelQuery core balancing - c#

I have 200,000 tasks that will run in parallel for speed gains. I'm using ParallelEnumerable.Range(0, 200000).Sum( a => /*do_something*/ ).
as the task counter goes from 0 to 200,000 the required number of iterations decreases. The task with a=0 requires most number of iterations, while tasks with a>100,000 finish with one or no iteration.
due to this, my quad-core machine isn't reaching max cpu utilization peak as the tasks progress. It seems like the workload is distributed to all 4 cores at start, and some cores go idle earlier as their portion was mainly tasks with high as. cpu utilization starts with 100%, but drops gradually to 75~50~25%. How can I achieve full cpu utilization from start to end?

very simple self-answer : replacing ParallelEnumerable.Range(0, 200000) with Enumerable.Range(...).AsParallel() was enough to solve this problem. The it seems that the workload is distributed dynamically in Enumerable.Range(...).AsParallel().

Related

Limit CPU usage in conjunction with Parallel.ForEach

What i need :
I need to run an array of methods in multithreading with maximum 1 thread per core. Each of these thread need to have a throttling feature to be able to limit CPU usage of all threads. If i specify 10% on 4 core running 3 threads (because i choose 3 was the max) all 3 threads need to not take more than 10% CPU usage per core. Something exactly how some application like Steam where you can limit the download speed but imagine you would have multiple internet connection to represent the multi core.
What i tried :
I have tried using Task but i couldn't figure out how to manage the maximum amount of threads it create and even less which core they each uses.
What i end up having :
I ended up finding Parallel.ForEach with combinaison of ParallelOptions.MaxDegreeOfParallelism gives me exactly the control that i want over the amount of threads. I know how much cores i have and i can easily make something to let the user pick the amount of core to use. My test concluded that Parallel.ForEach attribute work on each core in a sequence and once all core have received something it loop back and reassign again. Which is perfect. If i set parallelism of 3 on a 4 core i get 3 thread on 3 different core, while setting parallelism of 5 on a 4 core then 3 core will have 1 thread and 1 core will have 2. That behavior is exactly what i need. This method is not set in stone i am open to changes.
CPU limitation
Now using the Parallel.ForEach make the cpu work 100% of the time which is a big problem in that tight environment where any usage over 40-50% over a long period of time over many computers increase the wattage drawn so much that it trigger a critical alarm in the building and we need to be below that. Anyhow i have not found any solution other than doing a thread.sleep in the tight loop but that still make the cpu run 100% for couple seconds then nearly 0% for longer period and then it does that over and over again. This did not work. I ran a small test in one of the room with barely 150 computers and half of them spiked at 100% at the nearly the same exact time drawing WAY too much power. My second test was to use only the 8 cores or more and run on 1 core only and that manage to reduce the usage by a lot.
Priority
I have finally tested with Process priority and it's not better. Yes the process receive way less CPU time but the cpu still run at 100% when i does get power. Unless i have miss a critical option when dealing with the priority that can limit the CPU.
Finally
I have the feeling i am very close with Parallel.ForEach i am just missing the CPU limitation per core. Am i missing just 1 thing ? I know that using Thread.Sleep is completely stupid as it actually limit the time the cpu is used and not the amount.
For those who want code
again this is prototype so the code is pretty much useless. All it does it run code on many cpu until it finishes
static void Main(string[] args)
{
var items = Enumerable.Range(0, int.MaxValue - 100);
Parallel.ForEach(
items,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
value => { Calculate(value); }
);
}
private static void Calculate(int value)
{
System.Threading.Thread.SpinWait(100000);
var test = value;
test *= -1;
Console.WriteLine(test);
}
You can use Windows Job Objects to limit the CPU usage of an entire process (or process tree). You can limit your whole process or start a child worker process that you limit.
This requires PInvoke (which is not too hard here) and does not allow you to control specific threads (which seems not needed here).
Alternatively, you need to do your own throttling (since you stated that merely controlling the number of threads does not work for you). I would base it in PauseToken. Any easy way to throttle to 75% would be this:
var pauseToken = new PauseToken();
//Use pauseToken on all threads that consume a lot of CPU
Task.Run(async () => {
while (true) {
await Task.Delay(250); //run
pauseToken.Pause();
await Task.Delay(750); //pause
pauseToken.Resume();
}
});
You could even adapt the percentage of throttling dynamically. Read the power usage of the device each second. If it is above a certain threshold, increase pause percentage by 10%. If it is below, decrease pause percentage by 5%. That way you can very easily target a specific wattage.

Why would a fully CPU bound process work better with hyperthreading?

Given:
A fully CPU bound very large (i.e. more than a few CPU cycles) job, and
A CPU with 4 physical and total 8 logical cores,
is it possible that 8, 16 and 28 threads perform better than 4 threads? My understanding is that 4 threads would have lesser context switches to perform and will have lesser overhead in any sense than 8, 16 or 28 threads would have on a 4 physical core machine. However, the timings are -
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.
After reading the answers that various users have provided and information given in the comments, I am able to finally boil down the question to what I wrote above. If the question above gives you the complete context, you can skip the original question below.
Original Question
What does it mean when we say
Hyper-threading works by duplicating certain sections of the
processor—those that store the architectural state—but not duplicating
the main execution resources. This allows a hyper-threading processor
to appear as the usual "physical" processor and an extra "logical"
processor to the host operating system
?
This question is asked on SO today and it basically tests the performance of multiple threads doing the same work. It has the following code:
private static void Main(string[] args)
{
int threadCount;
if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))
threadCount = Environment.ProcessorCount;
int load;
if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))
load = 1;
Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
int i1 = i;
threads.Add(new Thread(() => DoWork(i1, threadCount, load)));
}
var timer = Stopwatch.StartNew();
foreach (var thread in threads) thread.Start();
foreach (var thread in threads) thread.Join();
timer.Stop();
Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0);
}
static void DoWork(int seed, int threadCount, int load)
{
var mtx = new double[3,3];
for (var i = 0; i < ((10000000 * load)/threadCount); i++)
{
mtx = new double[3,3];
for (int k = 0; k < 3; k++)
for (int l = 0; l < 3; l++)
mtx[k, l] = Math.Sin(j + (k*3) + l + seed);
}
}
(I have cut out a few braces to bring the code in a single page for quick readability.)
I ran this code on my machine for replicating the issue. My machine has 4 physical cores and 8 logical ones. The method DoWork() in the code above is completely CPU bound. I felt that hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU bound threads as the physical cores (i.e. 4)). But it nearly does attain 64% performance gain. When I ran this code for 4 threads, it took about 82 seconds and when I ran this code for 8, 16 and 28 threads, it ran in all the cases in about 50 seconds.
To summarize the timings:
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%? After all my processor has only 4 physical cores. And the CPU usage was ~100% for 8 and 16 threads.
If somebody can explain the quoted text at the start, I hope to understand hyperthreading better with it and in turn hope to get the answer to Why would a fully CPU bound process work better with hyperthreading?.
For the sake of completion,
I have Intel Core i7-4770 CPU # 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
I ran the code in Release mode.
I know that the way timings are measured is bad. This will only give the time for slowest thread. I took the code as it is from the other question. However, what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?
CPU pipeline
Each instruction has to go through several steps in the pipeline to be fully executed. At the very least, it must be decoded, sent to execution unit, then actually executed there. There are several execution units on modern CPUs, and they can execute instructions completely in parallel. By the way, the execution units are not interchangeable: some operations can only be done on a single execution unit. For example, memory loads are usually specialized to one or two units, memory stores are exclusively sent to another unit, all the calculations are done by some other units.
Knowing about the pipeline, we may wonder: how can CPU work so fast, if we write purely sequental code and each instruction has to go through so many pipeline stages? Here is the answer: processor executes instructions in out-of-order fashion. It has a large reorder buffer (e.g. for 200 instructions), and it pushes many instructions through its pipeline in parallel. If at any moment some instruction cannot be executed for any reason (waits for data from slow memory, depends on other instruction not yet finished, whatever), then it is delayed for some cycles. During this time processor executes some new instructions, which are located after the delayed instructions in our code, given that they do not depend on the delayed instructions in any way.
Now we can see the problem of latency. Even if an instruction is decoded and all of its inputs are already available, it would take it several cycles to be executed completely. This delay is called instruction latency. However, we know that at this moment processor can execute many other independent instructions, if there are any.
If an instruction loads data from L2 cache, it has to wait about 10 cycles for the data to be loaded. If the data is located only in RAM, then it would take hundreds of cycles to load it to processor. In this case we can say that the instruction has high latency. It is important for maximum performance to have some other independent operations to execute at this moment. This is sometimes called latency hiding.
At the very end, we have to admit that most of real code is sequental in its nature. It has some independent instructions to execute in parallel, but not too many. Having no instructions to execute causes pipeline bubbles, and it leads to inefficient usage of processor's transistors. On the other hand, instructions of two different threads are automatically independent in almost all cases. This leads us directly to the idea of hyper-threading.
P.S. You might want to read Agner Fog's manual to better understand internals of modern CPUs.
Hyper-threading
When two threads are executed in hyper-threading mode on a single core, the processor can interleave their instructions, allowing to fill bubbles from the first thread with instructions of the second thread. This allows to better utilize processor's resources, especially in case of ordinary programs. Note that HT may help not only when you have a lot of memory accesses, but also in heavily sequental code. A well-optimized computational code may fully utilize all resources of CPU, in which case you will see no profit from HT (e.g. dgemm routine from well-optimized BLAS).
P.S. You might want to read Intel's detailed explanation of hyper-threading, including info about which resources are duplicated or shared, and discussion about performance.
Context switches
The context is an internal state of CPU, which at least includes all the registers. When execution thread changes, OS has to do a context switch (detailed description here). According to this answer, context switch takes about 10 microseconds, while the time quant of scheduler is 10 milliseconds or more (see here). So context switches do not affect total time much, because they are done seldom enough. Note that competition for CPU caches between threads can increase the effective cost of switches in some cases.
However, in case of hyper-threading each core has two states internally: two sets of registers, shared caches, one set of execution units. As a result, the OS has no need to do any context switches when you run 8 threads on 4 physical cores. When you run 16 threads on quad-core, the context switches are performed, but they take small part of the overall time, as explained above.
Process manager
Speaking of CPU utilization that you see in the process manager, it does not measure the internals of CPU pipeline. Windows can only notice when a thread returns execution to OS in order to: sleep, wait for mutex, wait for HDD, and do other slow things. As a result, it thinks that a core is fully used if there is a thread working on it, which does not sleep or wait for anything. For instance, you may check that running endless loop while (true) {} leads to full utilization of CPU.
I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%?
No, it shouldn't.
what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?
This is simply how CPU utilization is reported in Windows (and on at least some other OS's too, by the way). A HT CPU shows up as two cores to the operating system, and is reported as such.
Thus, Windows sees an eight-core machine, when you have four HT CPUs. You'll see eight different CPU graphs if you look at the "Performance" tab in Task Manager, and the total CPU utilization is computed with 100% utilization being the full utilization of these eight cores.
If you are only using four threads, then these threads cannot fully utilize the available CPU resources and that explains the timings. They can, at most, use four of the eight cores available and so of course your utilization will max out at 50%. Once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.
By the way…
HyperThreading has improved quite a lot from the old days of shared cache and other limitations, but it will still never provide the same throughput benefit that a full CPU could, as there remains some contention within the CPU. So even ignoring OS overhead, your 35% improvement in speed seems pretty good to me. I often see no more than a 20% speed-up adding the extra HT cores to a computationally-bottlenecked process.
I can't explain the sheer volume of speed-up that you observed: 100% seems way too much of an improvement for Hyperthreading. But I can explain the principles in place.
The main benefit to Hyperthreading is when a processor has to switch between threads. Whenever there are more threads than there are CPU cores (true 99.9997% of the time) and the OS decides to switch to a different thread, it has to perform (most of) the following steps:
Save the state of the current thread: this includes the stack, the state of the registers, and the program counter. where they get saved depends on the architecture, but generally speaking they'll either get saved in cache or in memory. Either way, this step takes time.
Put the Thread into "Ready" state (as opposed to "Running" state).
Load the state of the next thread: again, including the stack, the registers, and the program counter, which once again, is a step that takes time.
Flip the Thread into "Running" state.
In a normal (non-HT) CPU, the number of cores it has is the quantity of processing units. Each of these contain registers, program counters (registers), stack counters (registers), (usually) individual cache, and complete processing units. So if a normal CPU has 4 cores, it can run 4 threads simultaneously. When a thread is done (or the OS has decided that it's taking too much time and needs to wait its turn to start again), the CPU needs to follow those four steps to unload the thread and load in the new one before execution of the new one can begin.
In a HyperThreading CPU, on the other hand, the above holds true, but in addition, Each core has a duplicated set of Registers, Program Counters, Stack Counters, and (sometimes) cache. What this means is that a 4-core CPU can still only have 4 threads running simultaneously, but the CPU can have "preloaded" threads on the duplicated registers. So 4 threads are running, but 8 threads are loaded onto the CPU, 4 active, 4 inactive. Then, when it's time for the CPU to switch threads, instead of having to perform the loading/unloading at the moment the threads need to switch out, it simply "toggles" which thread is active, and performs the unloading/loading in the background on the newly "inactive" registers. Remember the two steps I suffixed with "these steps take time"? In a Hyperthreaded system, steps 2 and 4 are the only ones that need to be performed in real-time, whereas steps 1 and 3 are performed in the background in the hardware (divorced from any concept of threads or processes or CPU cores).
Now, this process doesn't completely speed up multithreaded software, but in an environment where threads often have extremely small workloads that they perform very frequently, the quantity of thread-switches can be expensive. Even in environments that don't conform to that paradigm, there can be benefits from Hyperthreading.
Let me know if you need any clarifications. It's been a few years since CS250, so I may be mixing up terminology here-or-there; let me know if I'm using the wrong terms for something. I'm 99.9997% certain that everything I'm describing is accurate in terms of the logic of how it all works.
Hyper-threading works by interleaving instructions in the processor execution pipeline. While the processor is performing read-write operations on one 'thread' it is performing logical evaluation on the other 'thread', keeping them separate and giving you a perceived doubling in performance.
The reason you get such a big speedup is because there is no branching logic in your DoWork method. It is all a big loop with a very predictable execution sequence.
A processor execution pipeline has to go through several clock cycles to execute a single calculation. The processor attempts to optimise the performance by pre-loading the execution buffer with the next few instructions. If the instruction loaded is actually a conditional jump (such as an if statement), this is bad news, because the processor has to flush the entire pipeline and fetch instructions from a different part of memory.
You may find that if you put if statements in your DoWork method, you will not get 100% speedup...

CPU on IIS server pegged at 100% in BufferAllocator.GetBuffer()

I have an IIS7 application pool that handles a large number of RESTful requests. When things get heated, the CPU usage hits 100% and the requests take many seconds to be processed.
When profiling with ANTS, we found that very often most of the CPU time goes here:
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification
System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper
(Unmanaged code)
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification
System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper
System.Web.HttpRuntime.ProcessRequestNotificationPrivate
System.Web.HttpApplication.BeginProcessRequestNotification
System.Web.HttpApplication+PipelineStepManager.ResumeSteps
System.Web.HttpApplication.ExecuteStep
System.Web.HttpApplication+CallFilterExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute
System.Web.HttpResponse.FilterOutput
System.Web.HttpWriter.FilterIntegrated
System.Web.Hosting.IIS7WorkerRequest.GetBufferedResponseChunks
System.Web.Hosting.RecyclableArrayHelper.GetIntegerArray
>> System.Web.BufferAllocator.GetBuffer
(Thread blocked)
>> System.Web.BufferAllocator.ReuseBuffer
(Thread blocked)
There are actually several different stack traces, but they all invariably end in GetBuffer() or ReuseBuffer().
Both GetBuffer() and ReuseBuffer() begin with a lock(), so I figure the CPU spends a lot of time in a spinlock (my understanding is that lock spins for a bit before putting the thread to sleep).
My question - is this a common place for the CPU to be spending its time on? This is all entirely in IIS code, so what can I do to reduce the CPU load? Is that a configuration issue, or is this the result of actions that our application did earlier?
The machines are pretty beefy, they have 4 quadcores. I don't have the number of threads running currently available.
It was pretty much what we suspected - the threads were all spending a stupid amount of time in the spinlock, since they all compete for the same lock. We had hundreds of threads.
The fix was to have more processes with fewer threads - now the CPU usage is reasonable.

Why CPU usage of my Asp.net often is about 25%

I have a ASP.Net project and many reports.Some of my reports have heavy calculation that I calculate them in memory using Linq. When I test this reports on my client CPU usage is about 25%.
My question is why cpu usage does not increase to 80% or more?
When I publish this project on the server does it has this behaviour?
You have 4 cores (or 2 hyper-threader cores), meaning each single thread can take up to 25% of the total computing power (which is shown as 25% CPU in the Task Manager).
Your calculation is probably single threaded.
Can you possibly break your calculation into several threads? That'll spread the load across the cores of your CPU a little more evenly.

MaxDegreeOfParallelism deciding on optimal value

Simple question.
How do you decide what the optimal value for MaxDegreeOfParallelism is for any given algorithm? What are the factors to consider and what are the trade-offs?
I think it depends, if all your tasks are "CPU bound" it would be probably equal to the number of CPUs in your machine. Nevertheless if you tasks are "IO bound" you can to start to increase the number.
When the CPU has to switch from one thread to other (context switch) it has a cost, so if you use too much threads and the CPU is switching all the time, you decrease the performance. In the other hand, if you limit that parameter too much, and the operations are long "IO bound" operations, and the CPU is idle a lot of time waiting for those tasks to complete... you are not doing the most with your machine's resources (and that is what multithreading is about)
I think it depends on each algorithm as #Amdahls's Law has pointed out, and there is not a master rule of thumb you can follow, you will have to plan it and tun it :D
Cheers.
For local compute intensive processes you should try two values;
number of physical cores or processors
number of virtual cores (physical including hyperthreading)
One of these is optimal in my experience. Sometimes using hyperthreading slows down, usually it helps. In C# use Environment.ProcessorCount to find the number of cores including hyperthreading fake cores. Actual physical cores is more difficult to determine. Check other questions for this.
For processes that have to wait for resources (db queries, file retrieval) scaling up can help. If you have to wait 80% of the time a process is busy the rule of thumb is to start 5 threads for it so one thread will be busy at any time. Maximizing the 5x20% each process requires locally.

Categories