I wrote a LINQ to find out frequencies of unique characters from a text file.I was also transforming my initial result into an object with the help of select.The final result comes out in the form of a List.
Below is the query i have used.
charNodes = inputString.GroupBy(ch => ch)
.Select((ch) => new TNode(ch.Key.ToString(),ch.Count()))
.ToList<TNode>();
I have a quad core machine running and the above query ran in 15ms.But strangely it took more time when i PLINQ'ed the same query.The below one took about 40ms.
charNodes = inputString.GroupBy(ch => ch).AsParallel
.Select((ch) => new TNode(ch.Key.ToString(),ch.Count()))
.ToList<TNode>();
Worst was the case with the next query that took about 83ms
charNodes = inputString.AsParallel().GroupBy(ch => ch)
.Select((ch) => new TNode(ch.Key.ToString(), ch.Count()))
.ToList<TNode>();
What is going wrong here?.
When this type of question comes up the answer is always the same: The PLINQ overhead is higher than the gains.
This happens because the work items are extremely small (grouping by a char, or creating a new object from trivial inputs). It works much better when they are bigger.
It's really hard to tell what's going on there strictly based on the code you provided.
TPL uses thread pool threads. The thread pool starts up with about 10 running threads. If you need more threads then the thread pool will create new ones about once every second as long as a new thread is needed. If your loop resulted in more than 10 parallel operations, it would need to spend time spinning up a new thread.Correction: the number of threads a parallel loop needs takes away from available threads in the thread pool. The thread pool tries to keep a minimum number of available threads in that pool, if it notices that threads are taking too long, it will spin up new ones to compensate--which takes resources. Lots of parts of the framework use the thread pool, so there's all sorts of opportunities that could be stressing the thread pool. Starting up a thread is fairly expensive.
The other possibly is that if your number of iterations was more than the number of available CPUs, a lot of context switching resulted. Context switching is expensive and impacts the load on the CPUs as well as how fast the OS can switch between threads.
If you provide more detail, like the input data, I can provide more detail in the answer.
Related
I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?
I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.
Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing
Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.
I have an IEnumerable of actions and they are decendent ordered by the time they will consume when executing. Now i want all of them to be executed in parallel. Are there any better solutions than this one?
IEnumerable<WorkItem> workItemsOrderedByTime = myFactory.WorkItems.DecendentOrderedBy(t => t.ExecutionTime);
Parallel.ForEach(workItemsOrderedByTime, t => t.Execute(), Environment.ProcessorCount);
So my idea is to first execute all expensice tasks in terms of time they need to be done.
EDIT: The question is if there is a better solution to get all done in minimum of time.
To solve your XY Problem of
Because otherwise it can happen that 9 of 10 tasks are finished and the last one is executed on 1 core and all other cores are doing nothing.
What you need to do is tell Parallel.ForEach to only take one item from the source list at a time. That way when you are down to the last items you won't have a bunch of slow work items all in a single core's queue.
This can be done by using Partitioner.Create and passing in EnumerablePartitionerOptions.NoBuffering
Parallel.ForEach(Partitioner.Create(workItems, EnumerablePartitionerOptions.NoBuffering),
new ParallelOptions{MaxDegreeOfParallelism = Environment.ProcessorCount},
t => t.Execute());
By default there is no execution order guarantee in Parallel.ForEach
That is why your call to DecendentOrderedBy does not do anything good. Though it might do something bad: in case default partitioner decides to do a range partition dividing say 12 WorkItems into 4 groups of 3 items, by the order in IEnumerable. Then first core has much more work to do, thus creating the problem you try to avoid.
Easy fix to (2) is explained in the answer by Scott. If Parallel.ForEach takes just one item then you naturally get some load balancing. In most cases this will work fine
The optimal (in most cases) solution for an ordered IEnumerable (as you have) will be Striped Partitioning number of buckets = number of cores. AFIK there you don't get this out-of-the-box in .NET. But you can provide a custom OrderablePartitioner that will partition data just this way.
I am sorry to say it but: "No free lunch"
I make one method who doing some simple operations like +, -, *, /.
I need to run this method 1513 times.
Here I try to run this method only once. To see do is working good and how times is be needed for to finish with operations.
Stopwatch st = new Stopwatch();
st.Start();
DiagramValue dv = new DiagramValue();
double pixel = dv.CalculateYPixel(23.46, diction);
st.Stop();
When is stop the stopwatch is teling me the time is 0.06s.
When I run the same method 1513 times in for loop like that:
Stopwatch st = new Stopwatch();
st.Start();
for (int i = 0; i < 1513; i++)
{
DiagramValue dv = new DiagramValue();
double pixel = dv.CalculateYPixel(23.46, diction);
}
st.Stop();
Then the Stopwatch is tell me is working around 0.14s. Or 0.14s / 1513 times = 0.00009s for one time.
My question is why If I running some method only once is too slow and if I running around thousand times in for loop is almost the same time.
Writing benchmarks is hard.
First, Stopwatch isn't infinitely accurate. When you run the method just once, you're very much limited by the accuracy of the underlying stopwatch. On the other hand, running the method multiple times alleviates this - you can get arbitrary precision by using a big enough loop. Instead of 1 vs 1513, compare e.g. 1500 vs. 3000. You'll get around 100% time increase, as expected.
Second, there's usually some cost with the first call in particular (e.g. JIT compilation) or with the memory pressure at the time of the call. That's why you usually need to do "preheating" - run the method outside of the stopwatch first to isolate these, and measure (multiple invocations) later.
Third, in a garbage collected environment like .NET, the guy who ordered the beer isn't necessarily the guy who pays the bill. Most of the cost of memory allocation in .NET is in the collection, rather than the allocation itself (which is about as cheap as a stack allocation). The collection usually happens outside of the code that caused the allocations in the first place, pointing you in the entirely wrong direction when searching for performance issues. That's why most .NET memory trackers display garbage collection separately - it's important to take account of, but can easily mislead you as to the cause if you're not careful.
There's many more issues, but these should cover your particular scenario well enough.
Some possible reasons include:
Timing resolution. You get a more accurate figure when you find the mean over a large number of iterations.
Noise. The percentage of stuff that isn't what you actually want to record, will be different.
Jitting. .NET will create code the first time a method is used. As such the first time it is run in a programs lifetime, the longer it will take, by a large factor (try running it once and then measuring the second attempt).
Branch prediction. If you keep doing the same thing with the same data the CPU's branch predictor is going to get better at predicting which branches are takken.
GC stability. Not likely in this case, but possible. Often at the start of a set of operations that requires particular objects to be created and then released the program ends up having to get more memory from the OS. When it's a bit into that set of operations it's more likely to have reached a steady state where it can just get that memory by cleaning out objects it isn't using any more, which is faster.
I have a multi-threaded application, and in a certain section of code I use a Stopwatch to measure the time of an operation:
MatchCollection matches = regex.Matches(text); //lazy evaluation
Int32 matchCount;
//inside this bracket program should not context switch
{
//start timer
MyStopwatch matchDuration = MyStopwatch.StartNew();
//actually evaluate regex
matchCount = matches.Count;
//adds the time regex took to a list
durations.AddDuration(matchDuration.Stop());
}
Now, the problem is if the program switches control to another thread somewhere else while the stopwatch is started, then the timed duration will be wrong. The other thread could have done any amount of work before the context switches back to this section.
Note that I am not asking about locking, these are all local variables so there is no need for that. I just want the timed section to execute continuously.
edit: another solution could be to subtract the context-switched time to get the actual time done doing work in the timed section. Don't know if that's possible.
You can't do that. Otherwise it would be very easy for any application to get complete control over the CPU timeslices assigned to it.
You can, however, give your process a high priority to reduce the probability of a context-switch.
Here is another thought:
Assuming that you don't measure the execution time of a regular expression just once but multiple times, you should not see the average execution time as an absolute value but as a relative value compared to the average execution times of other regular expressions.
With this thinking you can compare the average execution times of different regular expressions without knowing the times lost to context switches. The time lost to context switches would be about the same in every average, assuming the environment is relatively stable with regards to CPU utilization.
I don't think you can do that.
A "best effort", for me, would be to put your method in a separate thread, and use
Thread.CurrentThread.Priority = ThreadPriority.Highest;
to avoid as much as possible context switching.
If I may ask, why do you need such a precise measurement, and why can't you extract the function, and benchmark it in its own program if that's the point ?
Edit : Depending on the use case it may be useful to use
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(2); // Or whatever core you want to stick to
to avoid switch between cores.
Here is my sample program for web service server side and client side. I met with a strnage performance problem, which is, even if I increase the number of threads to call web services, the performance is not improved. At the same time, the CPU/memory/network consumption from performance panel of task manager is low. I am wondering what is the bottleneck and how to improve it?
(My test experience, double the number of threads will almost double the total response time)
Client side:
class Program
{
static Service1[] clients = null;
static Thread[] threads = null;
static void ThreadJob (object index)
{
// query 1000 times
for (int i = 0; i < 100; i++)
{
clients[(int)index].HelloWorld();
}
}
static void Main(string[] args)
{
Console.WriteLine("Specify number of threads: ");
int number = Int32.Parse(Console.ReadLine());
clients = new Service1[number];
threads = new Thread[number];
for (int i = 0; i < number; i++)
{
clients [i] = new Service1();
ParameterizedThreadStart starter = new ParameterizedThreadStart(ThreadJob);
threads[i] = new Thread(starter);
}
DateTime begin = DateTime.Now;
for (int i = 0; i < number; i++)
{
threads[i].Start(i);
}
for (int i = 0; i < number; i++)
{
threads[i].Join();
}
Console.WriteLine("Total elapsed time (s): " + (DateTime.Now - begin).TotalSeconds);
return;
}
}
Server side:
[WebMethod]
public double HelloWorld()
{
return new Random().NextDouble();
}
thanks in advance,
George
Although you are creating a multithreaded client, bear in mind that .NET has a configurable bottleneck of 2 simultaneous calls to a single host. This is by design.
Note that this is on the client, not the server.
Try adjusting your app.config file in the client:
<system.net>
<connectionManagement>
<add address=“*” maxconnection=“20″ />
</connectionManagement></system.net>
There is some more info on this in this short article :
My experience is generally that locking is the problem: I had a massively parallel server once that spent more time context switching than it did performing work.
So - check your memory and process counters in perfmon, if you look at context switches and its high (more than 4000 per second) then you're in trouble.
You can also check your memory stats on the server too - if its spending all its time swapping, or just creating and freeing strings, it'll appear to stall also.
Lastly, check disk I/O, same reason as above.
The resolution is to remove your locks, or hold them for a minimum of time. Our problem was solved by removing the dependence on COM BSTRs and their global lock, you'll find that C# has plenty of similar synchronisation bottlenecks (intended to keep your code working safely). I've seen performance drop when I moved a simple C# app from a single-core to a multi-core box.
If you cannot remove the locks, the best option is not to create as many threads :) Use a thread pool instead to let the CPU finish one job before starting another.
I don't believe that you are running into a bottleneck at all actually.
Did you try what I suggested ?
Your idea is to add more threads to improve performance, because you are expecting that all of your threads will run perfectly in parallel. This is why you are assuming that doubling the number of threads should not double the total test time.
Your service takes a fraction of a second to return and your threads will not all start working at exactly the same instant in time on the client.
So your threads are not actually working completely in parallel as you have assumed, and the results you are seeing are to be expected.
You are not seeing any performance gain because there is none to be had. The one line of code in your service (below) probably executes without a context switch most of the time anyway.
return new Random().NextDouble();
The overhead involved in the web service call is higher than than the work you are doing inside of it. If you have some substantial work to do inside the service (database calls, look-ups, file access etc) you may begin to see some performance increase.
Just parallelizing a task will not automatically make it faster.
-Jason
Of course adding Sleep will not improve performance.
But the point of the test is to test with a variable number of threads.
So, keep the Sleep in your WebMethod.
And try now with 5, 10, 20 threads.
If there are no other problems with your code, then the increase in time should not be linear as before.
You realize that in your test, when you double the amount of threads, you are doubling the amount of work that is being done. So if your threads are not truly executing in parallel, then you will, of course, see a linear increase in total time...
I ran a simple test using your client code (with a sleep on the service).
For 5 threads, I saw a total time of about 53 seconds.
And for 10 threads, 62 seconds.
So, for 2x the number of calls to the webservice, it only took 17% more time.. That is what you are expecting, no ?
Well, in this case, you're not really balancing your work between the chosen n.º of threads... Each Thread you create will be performing the same Job. So if you create n threads and you have a limited parallel processing capacity, the performance naturally decreases. Another think I notice is that the required Job is a relatively fast operation for 100 iterations and even if you plan on dividing this Job through multiple threads you need to consider that the time spent in context switching, thread creation/deletion will be an important factor in the overall time.
As bruno mentioned, your webmethod is a very quick operation. As an experiment, try ensuring that your HelloWorld method takes a bit longer. Throw in a Thread.Sleep(1000) before you return the random double. This will make it more likely that your service is actually forced to process requests in parallel.
Then try your client with different amounts of threads, and see how the performance differs.
Try to use some processor consuming task instead of Thread.Sleep. Actually combined approach is the best.
Sleep will just pass thread's time frame to another thread.
IIS AppPool "Maximum Worker Processes" is set to 1 by default. For some reason, each worker process is limited to process 10 service calls at a time. My WCF async server-side function does Sleep(10*1000); only.
This is what happens when Maximum Worker Processes = 1
http://s4.postimg.org/4qc26cc65/image.png
alternatively
http://i.imgur.com/C5FPbpQ.png?1
(First post on SO, I need to combine all pictures into one picture.)
The client is making 48 async WCF WS calls in this test (using 16 processes). Ideally this should take ~10 seconds to complete (Sleep(10000)), but it takes 52 seconds. You can see 5 horizontal lines in the perfmon picture (above link) (using perfmon for monitoring Web Service Current Connections in server). Each horizontal line lasts 10 seconds (which Sleep(10000) does). There are 5 horizontal lines because the server processes 10 calls each time then closes that 10 connections (this happens 5 times to process 48 calls). Completion of all calls took 52 seconds.
After setting Maximum Worker Processes = 2
(in the same picture given above)
This time there are 3 horizontal lines because the server processes 20 calls each time then closes that 20 connections (this happens 3 times to process 48 calls). Took 33 secs.
After setting Maximum Worker Processes = 3
(in the same picture given above)
This time there are 2 horizontal lines because the server processes 30 calls each time. (happens 2 times to process 48 calls) Took 24 seconds.
After setting Maximum Worker Processes = 8
(in the same picture given above)
This time there is 1 horizontal line because the server processes 80 calls each time. (happens once to process 48 calls) Took 14 seconds.
If you don't care this situation, your parallel (async or threaded) client calls will be queued by 10s in the server, then all of your threaded calls (>10) won't get processed by the server in parallel.
PS: I was using Windows 8 x64 with IIS 8.5. The 10 concurrent request limit is for workstation Windows OSes. Server OSes doesn't have that limit according to another post on SO (I can't give link due to rep < 10).