Related
Given:
A fully CPU bound very large (i.e. more than a few CPU cycles) job, and
A CPU with 4 physical and total 8 logical cores,
is it possible that 8, 16 and 28 threads perform better than 4 threads? My understanding is that 4 threads would have lesser context switches to perform and will have lesser overhead in any sense than 8, 16 or 28 threads would have on a 4 physical core machine. However, the timings are -
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.
After reading the answers that various users have provided and information given in the comments, I am able to finally boil down the question to what I wrote above. If the question above gives you the complete context, you can skip the original question below.
Original Question
What does it mean when we say
Hyper-threading works by duplicating certain sections of the
processor—those that store the architectural state—but not duplicating
the main execution resources. This allows a hyper-threading processor
to appear as the usual "physical" processor and an extra "logical"
processor to the host operating system
?
This question is asked on SO today and it basically tests the performance of multiple threads doing the same work. It has the following code:
private static void Main(string[] args)
{
int threadCount;
if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))
threadCount = Environment.ProcessorCount;
int load;
if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))
load = 1;
Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
int i1 = i;
threads.Add(new Thread(() => DoWork(i1, threadCount, load)));
}
var timer = Stopwatch.StartNew();
foreach (var thread in threads) thread.Start();
foreach (var thread in threads) thread.Join();
timer.Stop();
Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0);
}
static void DoWork(int seed, int threadCount, int load)
{
var mtx = new double[3,3];
for (var i = 0; i < ((10000000 * load)/threadCount); i++)
{
mtx = new double[3,3];
for (int k = 0; k < 3; k++)
for (int l = 0; l < 3; l++)
mtx[k, l] = Math.Sin(j + (k*3) + l + seed);
}
}
(I have cut out a few braces to bring the code in a single page for quick readability.)
I ran this code on my machine for replicating the issue. My machine has 4 physical cores and 8 logical ones. The method DoWork() in the code above is completely CPU bound. I felt that hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU bound threads as the physical cores (i.e. 4)). But it nearly does attain 64% performance gain. When I ran this code for 4 threads, it took about 82 seconds and when I ran this code for 8, 16 and 28 threads, it ran in all the cases in about 50 seconds.
To summarize the timings:
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%? After all my processor has only 4 physical cores. And the CPU usage was ~100% for 8 and 16 threads.
If somebody can explain the quoted text at the start, I hope to understand hyperthreading better with it and in turn hope to get the answer to Why would a fully CPU bound process work better with hyperthreading?.
For the sake of completion,
I have Intel Core i7-4770 CPU # 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
I ran the code in Release mode.
I know that the way timings are measured is bad. This will only give the time for slowest thread. I took the code as it is from the other question. However, what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?
CPU pipeline
Each instruction has to go through several steps in the pipeline to be fully executed. At the very least, it must be decoded, sent to execution unit, then actually executed there. There are several execution units on modern CPUs, and they can execute instructions completely in parallel. By the way, the execution units are not interchangeable: some operations can only be done on a single execution unit. For example, memory loads are usually specialized to one or two units, memory stores are exclusively sent to another unit, all the calculations are done by some other units.
Knowing about the pipeline, we may wonder: how can CPU work so fast, if we write purely sequental code and each instruction has to go through so many pipeline stages? Here is the answer: processor executes instructions in out-of-order fashion. It has a large reorder buffer (e.g. for 200 instructions), and it pushes many instructions through its pipeline in parallel. If at any moment some instruction cannot be executed for any reason (waits for data from slow memory, depends on other instruction not yet finished, whatever), then it is delayed for some cycles. During this time processor executes some new instructions, which are located after the delayed instructions in our code, given that they do not depend on the delayed instructions in any way.
Now we can see the problem of latency. Even if an instruction is decoded and all of its inputs are already available, it would take it several cycles to be executed completely. This delay is called instruction latency. However, we know that at this moment processor can execute many other independent instructions, if there are any.
If an instruction loads data from L2 cache, it has to wait about 10 cycles for the data to be loaded. If the data is located only in RAM, then it would take hundreds of cycles to load it to processor. In this case we can say that the instruction has high latency. It is important for maximum performance to have some other independent operations to execute at this moment. This is sometimes called latency hiding.
At the very end, we have to admit that most of real code is sequental in its nature. It has some independent instructions to execute in parallel, but not too many. Having no instructions to execute causes pipeline bubbles, and it leads to inefficient usage of processor's transistors. On the other hand, instructions of two different threads are automatically independent in almost all cases. This leads us directly to the idea of hyper-threading.
P.S. You might want to read Agner Fog's manual to better understand internals of modern CPUs.
Hyper-threading
When two threads are executed in hyper-threading mode on a single core, the processor can interleave their instructions, allowing to fill bubbles from the first thread with instructions of the second thread. This allows to better utilize processor's resources, especially in case of ordinary programs. Note that HT may help not only when you have a lot of memory accesses, but also in heavily sequental code. A well-optimized computational code may fully utilize all resources of CPU, in which case you will see no profit from HT (e.g. dgemm routine from well-optimized BLAS).
P.S. You might want to read Intel's detailed explanation of hyper-threading, including info about which resources are duplicated or shared, and discussion about performance.
Context switches
The context is an internal state of CPU, which at least includes all the registers. When execution thread changes, OS has to do a context switch (detailed description here). According to this answer, context switch takes about 10 microseconds, while the time quant of scheduler is 10 milliseconds or more (see here). So context switches do not affect total time much, because they are done seldom enough. Note that competition for CPU caches between threads can increase the effective cost of switches in some cases.
However, in case of hyper-threading each core has two states internally: two sets of registers, shared caches, one set of execution units. As a result, the OS has no need to do any context switches when you run 8 threads on 4 physical cores. When you run 16 threads on quad-core, the context switches are performed, but they take small part of the overall time, as explained above.
Process manager
Speaking of CPU utilization that you see in the process manager, it does not measure the internals of CPU pipeline. Windows can only notice when a thread returns execution to OS in order to: sleep, wait for mutex, wait for HDD, and do other slow things. As a result, it thinks that a core is fully used if there is a thread working on it, which does not sleep or wait for anything. For instance, you may check that running endless loop while (true) {} leads to full utilization of CPU.
I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%?
No, it shouldn't.
what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?
This is simply how CPU utilization is reported in Windows (and on at least some other OS's too, by the way). A HT CPU shows up as two cores to the operating system, and is reported as such.
Thus, Windows sees an eight-core machine, when you have four HT CPUs. You'll see eight different CPU graphs if you look at the "Performance" tab in Task Manager, and the total CPU utilization is computed with 100% utilization being the full utilization of these eight cores.
If you are only using four threads, then these threads cannot fully utilize the available CPU resources and that explains the timings. They can, at most, use four of the eight cores available and so of course your utilization will max out at 50%. Once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.
By the way…
HyperThreading has improved quite a lot from the old days of shared cache and other limitations, but it will still never provide the same throughput benefit that a full CPU could, as there remains some contention within the CPU. So even ignoring OS overhead, your 35% improvement in speed seems pretty good to me. I often see no more than a 20% speed-up adding the extra HT cores to a computationally-bottlenecked process.
I can't explain the sheer volume of speed-up that you observed: 100% seems way too much of an improvement for Hyperthreading. But I can explain the principles in place.
The main benefit to Hyperthreading is when a processor has to switch between threads. Whenever there are more threads than there are CPU cores (true 99.9997% of the time) and the OS decides to switch to a different thread, it has to perform (most of) the following steps:
Save the state of the current thread: this includes the stack, the state of the registers, and the program counter. where they get saved depends on the architecture, but generally speaking they'll either get saved in cache or in memory. Either way, this step takes time.
Put the Thread into "Ready" state (as opposed to "Running" state).
Load the state of the next thread: again, including the stack, the registers, and the program counter, which once again, is a step that takes time.
Flip the Thread into "Running" state.
In a normal (non-HT) CPU, the number of cores it has is the quantity of processing units. Each of these contain registers, program counters (registers), stack counters (registers), (usually) individual cache, and complete processing units. So if a normal CPU has 4 cores, it can run 4 threads simultaneously. When a thread is done (or the OS has decided that it's taking too much time and needs to wait its turn to start again), the CPU needs to follow those four steps to unload the thread and load in the new one before execution of the new one can begin.
In a HyperThreading CPU, on the other hand, the above holds true, but in addition, Each core has a duplicated set of Registers, Program Counters, Stack Counters, and (sometimes) cache. What this means is that a 4-core CPU can still only have 4 threads running simultaneously, but the CPU can have "preloaded" threads on the duplicated registers. So 4 threads are running, but 8 threads are loaded onto the CPU, 4 active, 4 inactive. Then, when it's time for the CPU to switch threads, instead of having to perform the loading/unloading at the moment the threads need to switch out, it simply "toggles" which thread is active, and performs the unloading/loading in the background on the newly "inactive" registers. Remember the two steps I suffixed with "these steps take time"? In a Hyperthreaded system, steps 2 and 4 are the only ones that need to be performed in real-time, whereas steps 1 and 3 are performed in the background in the hardware (divorced from any concept of threads or processes or CPU cores).
Now, this process doesn't completely speed up multithreaded software, but in an environment where threads often have extremely small workloads that they perform very frequently, the quantity of thread-switches can be expensive. Even in environments that don't conform to that paradigm, there can be benefits from Hyperthreading.
Let me know if you need any clarifications. It's been a few years since CS250, so I may be mixing up terminology here-or-there; let me know if I'm using the wrong terms for something. I'm 99.9997% certain that everything I'm describing is accurate in terms of the logic of how it all works.
Hyper-threading works by interleaving instructions in the processor execution pipeline. While the processor is performing read-write operations on one 'thread' it is performing logical evaluation on the other 'thread', keeping them separate and giving you a perceived doubling in performance.
The reason you get such a big speedup is because there is no branching logic in your DoWork method. It is all a big loop with a very predictable execution sequence.
A processor execution pipeline has to go through several clock cycles to execute a single calculation. The processor attempts to optimise the performance by pre-loading the execution buffer with the next few instructions. If the instruction loaded is actually a conditional jump (such as an if statement), this is bad news, because the processor has to flush the entire pipeline and fetch instructions from a different part of memory.
You may find that if you put if statements in your DoWork method, you will not get 100% speedup...
I am developing an application which analyses real-time financial data. Currently my main computational cycle has the following design:
long cycle_counter=0;
while (process_data)
{
(analyse data, issue instruction - 5000 lines of straightforwasrd code with computations)
cycle_counter++;
Thread.Sleep(5);
}
When I run this application on my notebook (one Core i5) processor, the cycle runs 200-205 times per second - a sort of as expected (if you don't bother about why it runs more than 200 times a second).
But when I deploy the application on "real" workstation, which has 2 6-core Xeon processors and 24 GB of fast RAM, and which loads Win7 in about 3 seconds, the application runs the cycle about 67 times per second.
My questions are:
why is this happening?
how can I influence the number of runs per second in this situation?
are there any better solutions for running the cycle 200-1000 times per second? I am now thinking about just removing Thread.Sleep() (the way I use it here is criticised a lot). With 12 cores I have no problems using one core just for this cycle. But there my be some downside to such solution?
Thank you for your ideas.
The approach you're taking is simply fundamentally broken. Polling strategies are in general a bad way to go, and any time you do a Sleep for a reason other than "I want to give the rest of my timeslice back to the operating system", you're probably doing something wrong.
A better way to approach the problem is:
Make a threadsafe queue of unprocessed work
Make one thread that puts new work in the queue
Make n threads that take work out of the queue and do the work. n should be the number of CPUs you have minus one. If you have more than n threads then at least two threads are trading off CPU time, which is making them both slower!
The worker threads do nothing but sit in a loop taking work out of the queue and doing the work.
If the queue is empty then the "take work out" blocks.
When new work arrives, one of the blocked threads is reactivated.
How to build a queue with these properties is a famous problem called The Producer/Consumer Problem. There are lots of articles on how to do it any many implementations of blocking producer-consumer queues. I recommend finding an existing debugged one rather than trying to write your own; getting it right can be tricky.
Windows is not a RTOS (Real Time Operating System), so you cannot precisely determine when your thread will resume. Thread.Sleep(5) really means "wake me up no sooner then 5ms". The actual sleep time is determined by the specific hardware and mostly by the system load. You can try to workaround the system load issue by running your application on a higher priority.
BTW, System.Threading.Timer is a better approach (above comments still apply though).
The resolution of Sleep is dictated by the current timer tick interval and is usually either 10 or 15 milliseconds depending on the edition of Windows. This can be changed, however, by issuing a timeBeginPeriod command. See this answer.
Check your timer's actual frequency: many hardware timers have actual resolution
65536 ticks per hour = 65536 / 3600 = 18.204 ticks per second
So called "18.2" constant, that's why the actual timer's resolution is 1/18.2 = 55 ms; in the case of Sleep(5) it means that is could be either Sleep(0) or Sleep(55) depending on round up.
Not sure it is the best approach but another approach.
Try BlockingCollection and all you do in the producer is add and sleep.
The consumer then has the option to work full time if needed.
This still does not explain why the higher powered PC ran less cycles.
Is it OK for you to run your loop 200 times per second on average?
var delay = TimeSpan.FromMillseconds(5);
while (process_data) {
Console.WriteLine("do work");
var now = DateTime.Now;
if (now < nextDue)
System.Threading.Thread.Sleep(nextDue - now);
nextDue = nextDue.Add(delay);
}
Using this technique, your loop will execute somewhat stumbling, but it should be OK on average, as the code depends neither on the resolution of Sleep nor on the resolution of DateTime.Now.
You might even combine this approach with a Timer.
Okay, I am bit confuse on what and how should I do. I know the theory of Parallel Programming and Threading, but here is my case:
We have number of log files in given folder. We read these log files in database. Usually reading these files take couple of hours to read, as we do it in serial method, i.e. we iterate through each file, then open a SQL transaction for each file and insert the log in database, then read another and do the same.
Now, I am thinking of using Parallel programming so I can consume all core of CPU, however I am still not clear if I use Thread for each file, will that make any difference to system? I mean if I create say 30 threads then will they run on single core or they run on Parallel ? How can I use both of them? if they are not already doing that?
EDIT: I am using Single Server, with 10K HDD Speed, and 4 Core CPU, with 4 GB RAM, no network operation, SQL Server is on same machine with Windows 2008 as OS. [can change OS if that help too :)].
EDIT 2: I run some test to be sure based on your feedbacks, here is what I found on my i3 Quad Core CPU with 4 GB RAM
CPU remains at 24-50% CPU1, CPU2 remain under 50% usage, CPU3 remain at 75% usage and CPU4 remains around 0%. Yes I have Visual studio, eamil client and lot of other application open, but this tell me that application is not using all core, as CPU4 remain 0%;
RAM remain constantly at 74% [it was around 50% before test], that is how we design the read. So, nothing to worry
HDD remain READ/Write or usage value remain less than 25% and even it spike to 25% in sine wave, as our SQL transaction first stored in memory and then it write to disk when memory is getting threshold, so again,
So all resources are under utilized here, and hence I think I can distribute work to make it efficient. Your thoughts again. Thanks.
First of all, you need to understand your code and why is it slow. If you're thinking something like “my code is slow and uses one CPU, so I'll just make it use all 4 CPUs and it will be 4 times faster”, then you're most likely wrong.
Using multiple threads makes sense if:
Your code (or at least a part of it) is CPU bound. That is, it's not slowed down by your disk, your network connection or your database server, it's slowed down by your CPU.
Or your code has multiple parts, each using a different resource. E.g. one part reads from a disk, another part converts the data, which requires lots of CPU and last part writes the data to a remote database. (Parallelizing this doesn't actually require multiple threads, but it's usually the simplest way to do it.)
From your description, it sounds like you could be in situation #2. A good solution for that is the producer consumer pattern: Stage 1 thread reads the data from the disk and puts it into a queue. Stage 2 thread takes the data from the queue, processes them and puts them into another queue. Stage 3 thread takes the processed data from the second queue and saves them to the database.
In .Net 4.0, you would use BlockingCollection<T> for the queue between the threads. And when I say “thread”, I pretty much mean Task. In .Net 4.5, you could use blocks from TPL Dataflow instead of the threads.
If you do it this way then you can get up to three times faster execution (if each stage takes the same time). If Stage 2 is the slowest part, then you can get another speedup by using more than one thread for that stage (since it's CPU bound). The same could also apply to Stage 3, depending on your network connection and your database.
There is no definite answer to this question and you'll have to test because as mentionned in my comments:
if the bottleneck is the disk I/O then you won't gain a lot by adding more threads and you might even worsen performance because more threads will be fighting to get access to the disk
if you think disk I/O is OK but CPU loads is the issue then you can add some threads, but no more than the number of cores because here again things will worsen due to context switching
if you can do more disk and network I/Os and CPU load is not high (very likely) then you can oversubscribe with (far) more threads than cores: typically if your threads are spending much of their time waiting for the database
So you should profile first, and then (or directly if you're in a hurry) test different configurations, but chances are you'll be in the third case. :)
First, you should check what is taking the time. If the CPU actually is the bottleneck, parallel processing will help. Maybe it's the network and a faster network connection will help. Maybe buying a faster disc will help.
Find the problem before thinking about a solution.
Your problem is not using all CPU, your action are mainly I/O (reading file , sending data to DB).
Using Thread/Parallel will make your code run faster since you are processing many files at the same time.
To answer your question , the framework/OS will optimize running your code over the different cores.
It varies from machine to machine but speaking generally if you have a dual core processor and you have 2 threads the Operating System will pass one thread to one core and the other thread to the other. It doesn't matter how many cores you use what matters is whether your equation is the fastest. If you want to make use of Parallel programming you need a way of sharing the workload in a way that logically makes sense. Also you need to consider where your bottleneck is actually occurring. Depending on the size of the file it may be simply the max speed of your read/write of the storage medium that is taking so long.As a test I suggest you log where the most time in your code is being consumed.
A simple way to test whether a non-serial approach will help you is to sort your files in some order divide the workload between 2 threads doing the same job simultaneously and see if it makes a difference. If a second thread doesn't help you then I guarantee 30 threads will only make it take longer due to the OS having to switch threads back and fourth.
Using the latest constructs in .Net 4 for parallel programming, threads are generally managed for you... take a read of getting started with parallel programming
(pretty much the same as what has happened more recently with async versions of functions to use if you want it async)
e.g.
for (int i = 2; i < 20; i++)
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
}
becomes
Parallel.For(2, 20, (i) =>
{
var result = SumRootN(i);
Console.WriteLine("root {0} : {1} ", i, result);
});
EDIT: That said, it would be productive / faster to perhaps also put intensive tasks into seperate threads... but to manually make your application 'Multi-Core' and have things like certain threads running on particular cores, that isn't currently possible, that's all managed under the hood...
have a look at plinq for example
and .Net Parallel Extensions
and look into
System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity = 4
Edit2:
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.
I have a program that executes multiple threads. Each thread simply executes a HTTPWebRequest and then screen scrapes the page looking for some text. I am a race against other users to find this text. I could execute 1000000 threads, all looking for the same thing.
My thought on that is that would put a lot of work on my processor and would actually cause the requests to execute slower. How can I find a balance between the number of threads to execute and the performance of the web requests. Basically what I want to do is find the optimal number of threads to spawn off so that the amount of data they pull down is greatest.
The application is using .NET4 and written in C#.
You are right to assume that 1000000 threads will put undue pressure on your CPU. The work that your CPU would have to do to manage and switch between that many threads would probably cause your system to be very slow indeed.
Obviously you are not serious about 1000000 threads, but it demonstrates that you cannot simply throw more threads at the problem. You dont really want to write your own load balancer - that will not be easy and will not perform as well as the classes that come with the base class library. Have a look at using ThreadPool threads - the CLR will manage them for you. You can also look at the Parallel Task Library that is new in .NET 4.0 (since you mention that is what you are using).
ALso check out this great article about multi-threading:
http://www.albahari.com/threading/
C# has a ThreadPool. Submit your web-scraping tasks to the pool. You can tweak the number of threads in the pool to tune your app - you will probably need to increase it well above the default for best performance with such a requirement as yours.
Huge numbers of threads are wasteful, as posted by #M Babcock.
I'm not sure if the number of threads in a C# ThreadPool can be changed at run-time, (I see no reason why not, but M$...). If it is tweakable during the run, tuning will be even easier!
you need to use Parallel.Foreach to manage your threads properly...
You are asking performance question and not providing any estimates on your actual requirements... so let me try doing it for you.
How much data can you pull in - assuming awesome network and regular network card - 100Mb/s at max, probably less than 10Mb/sec. This give about less than 10000 requests per second (assuming ~10K requests/response pairs).
Can one thread handle that much data - searching through 100Mb a second should not be a problem even for single thread. Super easy to prototype/measure.
How many threads I need to read data - likely 1 - starting asynchronous request is fast, reading response OR posting response in a queue for processing is fast for 10000 items a second.
So my estimates - 1 thread for simple code, (1 + one thread per core) if you have more cores and willing to run processing in parallel.
On a single processor, Will multi-threading increse the speed of the calculation. As we all know that, multi-threading is used for Increasing the User responsiveness and achieved by sepating UI thread and calculation thread. But lets talk about only console application. Will multi-threading increases the speed of the calculation. Do we get culculation result faster when we calculate through multi-threading.
what about on multi cores, will multi threading increse the speed or not.
Please help me. If you have any material to learn more about threading. please post.
Edit:
I have been asked a question, At any given time, only one thread is allowed to run on a single core. If so, why people use multithreading in a console application.
Thanks in advance,
Harsha
In general terms, no it won't speed up anything.
Presumably the same work overall is being done, but now there is the overhead of additional threads and context switches.
On a single processor with HyperThreading (two virtual processors) then the answer becomes "maybe".
Finally, even though there is only one CPU perhaps some of the threads can be pushed to the GPU or other hardware? This is kinda getting away from the "single processor" scenario but could technically be way of achieving a speed increase from multithreading on a single core PC.
Edit: your question now mentions multithreaded apps on a multicore machine.
Again, in very general terms, this will provide an overall speed increase to your calculation.
However, the increase (or lack thereof) will depend on how parallelizable the algorithm is, the contention for memory and cache, and the skill of the programmer when it comes to writing parallel code without locking or starvation issues.
Few threads on 1 CPU:
may increase performance in case you continue with another thread instead of waiting for I/O bound operation
may decrease performance if let say there are too many threads and work is wasted on context switching
Few threads on N CPUs:
may increase performance if you are able to cut job in independent chunks and process them in independent manner
may decrease performance if you rely heavily on communication between threads and bus becomes a bottleneck.
So actually it's very task specific - you can parallel one things very easy while it's almost impossible for others. Perhaps it's a bit advanced reading for new person but there are 2 great resources on this topic in C# world:
Joe Duffy's web log
PFX team blog - they have a very good set of articles for parallel programming in .NET world including patterns and practices.
What is your calculation doing? You won't be able to speed it up by using multithreading if it a processor bound, but if for some reason your calculation writes to disk or waits for some other sort of IO you may be able to improve performance using threading. However, when you say "calculation" I assume you mean some sort of processor intensive algorithm, so adding threads is unlikely to help, and could even slow you down as the context switch between threads adds extra work.
If the task is compute bound, threading will not make it faster unless the calculation can be split in multiple independent parts. Even so you will only be able to achieve any performance gains if you have multiple cores available. From the background in your question it will just add overhead.
However, you may still want to run any complex and long running calculations on a separate thread in order to keep the application responsive.
No, no and no.
Unless you write parallelizing code to take advantage of multicores, it will always be slower if you have no other blocking functions.
Exactly like the user input example, one thread might be waiting for a disk operation to complete, and other threads can take that CPU time.
As described in the other answers, multi-threading on a single core won't give you any extra performance (hyperthreading notwithstanding). However, if your machine sports an Nvidia GPU you should be able to use the CUDA to push calculations to the GPU. See http://www.hoopoe-cloud.com/Solutions/CUDA.NET/Default.aspx and C#: Perform Operations on GPU, not CPU (Calculate Pi).
Above mention most.
Running multiple threads on one processor can increase performance, if you can manage to get more work done at the same time, instead of let the processor wait between different operations. However, it could also be a severe loss of performance due to for example synchronization or that the processor is overloaded and cant step up to the requirements.
As for multiple cores, threading can improve the performance significantly. However, much depends on finding the hotspots and not overdo it. Using threads everywhere and the need of synchronization can even lower the performance. Optimizing using threads with multiple cores takes a lot of pre-studies and planning to get a good result. You need for example to think about how many threads to be use in different situations. You do not want the threads to sit and wait for information used by another thread.
http://www.intel.com/intelpress/samples/mcp_samplech01.pdf
https://computing.llnl.gov/tutorials/parallel_comp/
https://computing.llnl.gov/tutorials/pthreads/
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Simultaneous_multithreading
I have been doing some intensive C++ mathematical simulation runs using 24 core servers. If I run 24 separate simulations in parallel on the 24 cores of a single server, then I get a runtime for each of my simulations of say X seconds.
The bizarre thing I have noticed is that, when running only 12 simulations, using 12 of the 24 cores, with the other 12 cores idle, then each of the simulations runs at a runtime of Y seconds, where Y is much greater than X! When viewing the task manager graph of the processor usage, it is obvious that a process does not stick to only one core, but alternates between a number of cores. That is to say, the switching between cores to use all the cores slows down the calculation process.
The way I maintained the runtime when running only 12 simulations, is to run another 12 "junk" simulations on the side, using the remaining 12 cores!
Conclusion: When using multi-cores, use them all at 100%, for lower utilisation, the runtime increases!
For single core CPU,
Actually the performance depends on the job you are referring.
In your case, for calculation done by CPU, in that case OverClocking would help if your parentBoard supports it. Otherwise there is no way for CPU to do calculations that are faster than the speed of CPU.
For the sake of Multicore CPU
As the above answers say, if properly designed the performance may increase, if all cores are fully used.
In single core CPU, if the threads are implemented in User Level then multithreading wont matter if there are blocking system calls in the thread, like an I/O operation. Because kernel won't know about the userlevel threads.
So if the process does I/O then you can implement the threads in Kernel space and then you can implement different threads for different job.
(The answer here is on theory based.)
Even a CPU bound task might run faster multi-threaded if properly designed to take advantage of cache memory and pipelineing done by the processor. Modern processors spend a lot of time
twiddling their thumbs, even when nominally fully "busy".
Imagine a process that used a small chunk of memory very intensively. Processing
the same chunk of memory 1000 times would be much faster than processing 1000 chunks
of similar memory.
You could certainly design a multi threaded program that would be faster than a single thread.
Treads don't increase performance. Threads sacrifice performance in favor of keeping parts of the code responsive.
The only exception is if you are doing a computation that is so parallelizeable that you can run different threads on different cores (which is the exception, not the rule).