Parallel execution with StackExchange.Redis? - c#

I have a 1M items store in List<Person> Which I'm serializing in order to insert to Redis. (2.8)
I divide work among 10 Tasks<> where each takes its own section ( List<> is thread safe for readonly ( It is safe to perform multiple read operations on a List)
Simplification :
example:
For ITEMS=100, THREADS=10 , each Task will capture its own PAGE and deal with the relevant range.
For exaple :
void Main()
{
var ITEMS=100;
var THREADS=10;
var PAGE=4;
List<int> lst = Enumerable.Range(0,ITEMS).ToList();
for (int i=0;i< ITEMS/THREADS ;i++)
{
lst[PAGE*(ITEMS/THREADS)+i].Dump();
}
}
PAGE=0 will deal with : 0,1,2,3,4,5,6,7,8,9
PAGE=4 will deal with : 40,41,42,43,44,45,46,47,48,49
All ok.
Now back to SE.redis.
I wanted to implement this pattern and so I did : (with ITEMS=1,000,000)
My testing :
(Here is dbsize checking each second) :
As you can see , 1M records were added via 10 threads.
Now , I don't know if it's fast but , when I change ITEMS from 1M to 10M -- things get really slow and I get exception :
The exception is on the for loop.
Unhandled Exception: System.AggregateException: One or more errors
occurred. ---
System.TimeoutException: Timeout performing SET urn:user>288257, inst: 1, queu e: 11, qu=0, qs=11, qc=0, wr=0/0, in=0/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
messa ge, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgen
t\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultip
lexer.cs:line 1722 at
StackExchange.Redis.RedisBase.ExecuteSync[T](Message message,
ResultProces sor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df
18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 79
... ... Press any key to continue . . .
Question:
Is my way of dividing work is the RIGHT way (fastest)
How can I get things faster ( a sample code would be much appreciated)
How can I resolve this exception?
Related info :
<gcAllowVeryLargeObjects enabled="true" /> Is present in App.config ( otherwise i'm getting outOfmemoryException ) , also - build for x64bit, I have 16GB , , ssd drive , i7 cpu).

Currently, your code is using the synchronous API (StringSet), and is being loaded by 10 threads concurrently. This will present no appreciable challenge to SE.Redis - it works just fine here. I suspect that it genuinely is a timeout where the server has taken longer than you would like to process some of the data, most likely also related to the server's allocator. One option, then, is to simply increase the timeout a bit. Not a lot... try 5 seconds instead of the default 1 second. Likely, most of the operations are working very fast anyway.
With regards to speeding it up: one option here is to not wait - i.e. keep pipelining data. If you are content not to check every single message for an error state, then one simple way to do this is to add , flags: CommandFlags.FireAndForget to the end of your StringSet call. In my local testing, this sped up the 1M example by 25% (and I suspect a lot of the rest of the time is actually spent in string serialization).
The biggest problem I had with the 10M example was simply the overhead of working with the 10M example - especially since this takes huge amounts of memory for both the redis-server and the application, which (to emulate your setup) are on the same machine. This creates competing memory pressure, with GC pauses etc in the managed code. But perhaps more importantly: it simply takes forever to start doing anything. Consequently, I refactored the code to use parallel yield return generators rather than a single list. For example:
static IEnumerable<Person> InventPeople(int seed, int count)
{
for(int i = 0; i < count; i++)
{
int f = 1 + seed + i;
var item = new Person
{
Id = f,
Name = Path.GetRandomFileName().Replace(".", "").Substring(0, appRandom.Value.Next(3, 6)) + " " + Path.GetRandomFileName().Replace(".", "").Substring(0, new Random(Guid.NewGuid().GetHashCode()).Next(3, 6)),
Age = f % 90,
Friends = ParallelEnumerable.Range(0, 100).Select(n => appRandom.Value.Next(1, f)).ToArray()
};
yield return item;
}
}
static IEnumerable<T> Batchify<T>(this IEnumerable<T> source, int count)
{
var list = new List<T>(count);
foreach(var item in source)
{
list.Add(item);
if(list.Count == count)
{
foreach (var x in list) yield return x;
list.Clear();
}
}
foreach (var item in list) yield return item;
}
with:
foreach (var element in InventPeople(PER_THREAD * counter1, PER_THREAD).Batchify(1000))
Here, the purpose of Batchify is to ensure that we aren't helping the server too much by taking appreciable time between each operation - the data is invented in batches of 1000 and each batch is made available very quickly.
I was also concerned about JSON performance, so I switched to JIL:
public static string ToJSON<T>(this T obj)
{
return Jil.JSON.Serialize<T>(obj);
}
and then just for fun, I moved the JSON work into the batching (so that the actual processing loops :
foreach (var element in InventPeople(PER_THREAD * counter1, PER_THREAD)
.Select(x => new { x.Id, Json = x.ToJSON() }).Batchify(1000))
This got the times down a bit more, so I can load 10M in 3 minutes 57 seconds, a rate of 42,194 rops. Most of this time is actually local processing inside the application. If I change it so that each thread loads the same item ITEMS / THREADS times, then this changes to 1 minute 48 seconds - a rate of 92,592 rops.
I'm not sure if I've answered anything really, but the short version might be simply "try a longer timeout; consider using fire-and-forget).

Related

Why is Queue consuming so much memory?

Basically I was doing a code kata on codewars site to kinda of 'warm up' before starting to code, and noticed a problem that I don't know if its because of my code, or just regular thing.
public static string WhoIsNext(string[] names, long n)
{
Queue<string> fifo = new Queue<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = fifo.Dequeue();
fifo.Enqueue(name);
fifo.Enqueue(name);
}
return fifo.Peek();
}
And Is called like this:
// Test 1
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 1;
var nth = CodeKata.WhoIsNext(names, n); // n = 1 Should return sheldon.
// test 2
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 52;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Penny.
// test 3
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 7230702951;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Leonard.
In this code When I put the long n with the value 7230702951 (a really high number...), it throws an out of memory exception. Is the number that high, or is the queue just not optimized for such numbers.
I say this because I tried using a List and the list memory usage stayed under 500 MB (the plateu was around 327MB btw), and this running for about 2/3min, whereas the queue throwed the exception in a matter of seconds, and went over 2GB in just that time alone.
Can someone explain to me the why of this happening, I just curious?
edit 1
I forgot to add the List code:
public static string WhoIsNext(string[] names, long n)
{
List<string> test = new List<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = test[0];
test.RemoveAt(0);
test.Add(name);
test.Add(name);
}
return test[0];
}
edit 2
For those saying that the code doubles the names and is inneficient, I already know that, the code isn't made to be useful, is just a kata. (I updated the link now!)
My question is as to why is Queue so much more inneficient thatn List with high count numbers.
Part of the reason is that the queue code is way faster than the List code, because queues are optimised for deletes due to the fact that they are a circular buffer. Lists aren't - the list copies the array contents every time you remove that first element.
Change the input value to 72307000 for example. On my machine, the queue finishes that in less than a second. The list is still chugging away minutes (and at this rate, hours) later. In 4 minutes i is now at 752408 - it has done almost 1% of the work).
Thus, I am not sure the queue is less memory efficient. It is just so fast that you run into the memory issue sooner. The list almost certainly has the same issue (the way that List and Queue do array size doubling is very similar) - it will just likely take days to run into it.
To a certain extent, you could predict this even without running your code. A queue with 7230702951 entries in it (running 64-bit) will take a minimum of 8 bytes per entry. So 57845623608 bytes. Which is larger than 50GB. Clearly your machine is going to struggle to fit that in RAM (plus .NET won't let you have an array that large)...
Additionally, your code has a subtle bug. The loop can't ever end (if n is greater than int.MaxValue). Your loop variable is an int but the parameter is a long. Your int will overflow (from int.MaxValue to int.MinValue with i++). So the loop will never exit, for large values of n (meaning the queue will grow forever). You likely should change the type of i to long.

Take pages and combine to list of pages

I have a list, let's say it contains 1000 items. I want to end up with a list of 10 times 100 items with something like:
myList.Select(x => x.y).Take(100) (until list is empty)
So I want Take(100) to run ten times, since the list contains 1000 items, and end up with list containing 10 lists which each contains 100 items.
You need to Skip the number of records you have already taken, you can keep track of this number and use it when you query
alreadyTaken = 0;
while (alreadyTaken < 1000) {
var pagedList = myList.Select(x => x.y).Skip(alreadyTaken).Take(100);
...
alreadyTaken += 100;
}
This can be achieved with a simple paging extension method.
public static List<T> GetPage<T>(this List<T> dataSource, int pageIndex, int pageSize = 100)
{
return dataSource.Skip(pageIndex * pageSize)
.Take(pageSize)
.ToList();
}
Of course, you can extend it to accept and/or return any kind of IEnumerable<T>.
As already posted you can use a for loop and Skip some elements and Take some elements. In this way you create a new query in every for loop. But a problem raises if you also want to go through each of those queries, because this will be very inefficient. Lets assume you just have 50 entries and you want to go through your list with ten elements every loop. You will have 5 loops doing
.Skip(0).Take(10)
.Skip(10).Take(10)
.Skip(20).Take(10)
.Skip(30).Take(10)
.Skip(40).Take(10)
Here two problem raises.
Skiping elements can still lead to computation. In your first query you just calculate the needed 10 elements, but in your second loop you calculated 20 elements and throwing 10 away, and so on. If you sum all 5 loops together you already computed 10 + 20 + 30 + 40 + 50 = 150 elements even you only had 50 elements. This result in an O(n^2) performance.
Not every IEnumerable does the above thing. Some IEnumerable like a database for example can optimize a Skip, for example they use an Offset (MySQL) definition in the SQL query. But that still doesn't solve the problem. The main problem you still have is that you will create 5 different Queries and execute all 5 of them. Those five queries will now take the most time. Because a simple Query to a database is even a lot slower than just Skipping some in-memory elements or some computations.
Because of all these problems it makes sense to not use a for loop with multiple .Skip(x).Take(y) if you also want to evaluate every query in every loop. Instead your algorithm should only go through your IEnumerable once, executing the query once, and on the first iteration return the first 10 elements. The next iteration returns the next 10 elements and so on, until it runs out of elements.
The following Extension Method does exactly this.
public static IEnumerable<IReadOnlyList<T>> Combine<T>(this IEnumerable<T> source, int amount) {
var combined = new List<T>();
var counter = 0;
foreach ( var entry in source ) {
combined.Add(entry);
if ( ++counter >= amount ) {
yield return combined;
combined = new List<T>();
counter = 0;
}
}
if ( combined.Count > 0 )
yield return combined;
}
With this you can just do
someEnumerable.Combine(100)
and you get a new IEnumerable<IReadOnlyList<T>> that goes through your enumeration just once slicing everything into chunks with a maximum of 100 elements.
Just to show how much difference the performance could be:
var numberCount = 100000;
var combineCount = 100;
var nums = Enumerable.Range(1, numberCount);
var count = 0;
// Bechmark with Combine() Extension
var swCombine = Stopwatch.StartNew();
var sumCombine = 0L;
var pages = nums.Combine(combineCount);
foreach ( var page in pages ) {
sumCombine += page.Sum();
count++;
}
swCombine.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Combine: {2}", count, sumCombine, swCombine.Elapsed);
// Doing it with .Skip(x).Take(y)
var swTakes = Stopwatch.StartNew();
count = 0;
var sumTaken = 0L;
var alreadyTaken = 0;
while ( alreadyTaken < numberCount ) {
sumTaken += nums.Skip(alreadyTaken).Take(combineCount).Sum();
alreadyTaken += combineCount;
count++;
}
swTakes.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Takes: {2}", count, sumTaken, swTakes.Elapsed);
The usage with the Combine() Extension Methods runs in 3 milliseconds on my computer (i5 # 4Ghz) while the for loop already needs 178 milliseconds
If you have a lot more elements or the slicing is smaller it gets even more worse. For example if combineCount is set to 10 instead of 100 the runtime changes to 4 milliseconds and 1800 milliseconds (1.8 seconds)
Now you could possibly say that you don't have so much elements or your slicing never gets so small. But remember, in this this example i just generated a sequence of numbers that has nearly zero computation time. The whole overhead from 4 milliseconds to 178 milliseconds is only caused of the re-evaluation and Skiping of values. If you have some more complex stuff going on behind the scenes the Skipping creates the most overhead, and also if an IEnumerable can implement Skip, like a database as explained above, that example will still get more worse, because the most overhead will be the execution of the query itself.
And the amount of queries can go really fast up. With 100.000 elements and a slicing/chunking of 100 you already will execute 1.000 queries. The Combine Extension provided above on the other hand will always execute your query once. And will never suffer of any of those problems described above.
All of that doesn't mean that Skip and Take should be avoided. They have their place. But if you really plan to go through every element you should avoid using Skip and Take to get your slicing done.
If the only thing you want is just to slice everything into pages with 100 elements, and you just want to fetch the third page, for example. You just should calculate how much elements you need to Skip.
var pageCount = 100;
var pageNumberToGet = 3;
var thirdPage = yourEnumerable.Skip(pageCount * (pageNumberToGet-1)).take(pageCount);
In this way you will get the elements from 200 to 300 in a single query. Also an IEnumerable with a databse can optimize that and you just have a single-query. So, if you only want a specific range of elements from your IEnumerable than you should use Skip and Take and do it like above instead of using the Combine Extension Method that i provided.

How to loop a list while using Multi-threading in c#

I have a List to loop while using multi-thread,I will get the first item of the List and do some processing,then remove the item.
While the count of List is not greater than 0 ,fetch data from data.
In a word:
In have a lot of records in my database.I need to publish them to my server.In the process of publishing, multithreading is required and the number of threads may be 10 or less.
For example:
private List<string> list;
void LoadDataFromDatabase(){
list=...;//load data from database...
}
void DoMethod()
{
While(list.Count>0)
{
var item=list.FirstOrDefault();
list.RemoveAt(0);
DoProcess();//how to use multi-thread (custom the count of theads)?
if(list.Count<=0)
{
LoadDataFromDatabase();
}
}
}
Please help me,I'm a beginner of c#,I have searched a lot of solutions, but no similar.
And more,I need to custom the count of theads.
Should your processing of the list be sequential? In other words, cannot you process element n + 1 while not finished yet processing of element n? If this is your case, then Multi-Threading is not the right solution.
Otherwise, if your processing elements are fully independent, you can use m threads, deviding Elements.Count / m elements for each thread to work on
Example: printing a list:
List<int> a = new List<int> { 1, 2, 3, 4,5 , 6, 7, 8, 9 , 10 };
int num_threads = 2;
int thread_elements = a.Count / num_threads;
// start the threads
Thread[] threads = new Thread[num_threads];
for (int i = 0; i < num_threads; ++i)
{
threads[i] = new Thread(new ThreadStart(Work));
threads[i].Start(i);
}
// this works fine if the total number of elements is divisable by num_threads
// but if we have 500 elements, 7 threads, then thread_elements = 500 / 7 = 71
// but 71 * 7 = 497, so that there are 3 elements not processed
// process them here:
int actual = thread_elements * num_threads;
for (int i = actual; i < a.Count; ++i)
Console.WriteLine(a[i]);
// wait all threads to finish
for (int i = 0; i < num_threads; ++i)
{
threads[i].Join();
}
void Work(object arg)
{
Console.WriteLine("Thread #" + arg + " has begun...");
// calculate my working range [start, end)
int id = (int)arg;
int mystart = id * thread_elements;
int myend = (id + 1) * thread_elements;
// start work on my range !!
for (int i = mystart; i < myend; ++i)
Console.WriteLine("Thread #" + arg + " Element " + a[i]);
}
ADD For your case, (uploading to server), it is the same as the code obove. You assign a number of threads, assigning each thread number of elements (which is auto calculated in the variable thread_elements, so you need only to change num_threads). For method Work, all you need is replacing the line Console.WriteLine("Thread #" + arg + " Element " + a[i]); with you uploading code.
One more thing to keep in mind, that multi-threading is dependent on your machine CPU. If your CPU has 4 cores, for example, then the best performance obtained would be 4 threads at maximum, so that assigning each core a thread. Otherwise, if you have 10 threads, for example, they would be slower than 4 threads because they will compete on CPU cores (Unless the threads are idle, waiting for some event to occur (e.g. uploading). In this case, 10 threads can run, because they don't take %100 of CPU usage)
WARNING: DO NOT modify the list while any thread is working (add, remove, set element...), neither assigning two threads the same element. Such things cause you a lot of bugs and exceptions !!!
This is a simple scenario that can be expanded in multiple ways if you add some details to your requirements:
IEnumerable<Data> LoadDataFromDatabase()
{
return ...
}
void ProcessInParallel()
{
while(true)
{
var data = LoadDataFromDatabase().ToList();
if(!data.Any()) break;
data.AsParallel().ForEach(ProcessSingleData);
}
}
void ProcessSingleData(Data d)
{
// do something with data
}
There are many ways to approach this. You can create threads and partition the list yourself or you can take advantage of the TPL and utilize Parallel.ForEach. In the example on the link you see a Action is called for each member of the list being iterated over. If this is your first taste of threading I would also attempt to do it the old fashioned way.
Here my opinion ;)
You can avoid use multithread if youur "List" is not really huge.
Instead of a List, you can use a Queue (FIFO - First In First Out). Then only use Dequeue() method to get one element of the Queue, DoSomeWork and get the another. Something like:
while(queue.Count > 0)
{
var temp = DoSomeWork(queue.Dequeue());
}
I think that this will be better for your propose.
I will get the first item of the List and do some processing,then remove the item.
Bad.
First, you want a queue, not a list.
Second, you do not process then remove, you remove THEN process.
Why?
So that you keep the locks small. Lock list access (note you need to synchonize access), remove, THEN unlock immediately and then process. THis way you keep the locks short. If you take, process, then remove - you basically are single threaded as you have to keep the lock in place while processing, so the next thread does not take the same item again.
And as you need to synchronize access and want multiple threads this is about the only way.
Read up on the lock statement for a start (you can later move to something like spinlock). Do NOT use threads unless you ahve to put schedule Tasks (using the Tasks interface new in 4.0), which gives you more flexibility.

How come this algorithm in Ruby runs faster than in Parallel'd C#?

The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".

Should I always use Parallel.Foreach because more threads MUST speed up everything?

Does it make sense to you to use for every normal foreach a parallel.foreach loop ?
When should I start using parallel.foreach, only iterating 1,000,000 items?
No, it doesn't make sense for every foreach. Some reasons:
Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
If your work will complete very fast anyway, there's no benefit, and it may well slow things down
Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.
No, you should definitely not do that. The important point here is not really the number of iterations, but the work to be done. If your work is really simple, executing 1000000 delegates in parallel will add a huge overhead and will most likely be slower than a traditional single threaded solution. You can get around this by partitioning the data, so you execute chunks of work instead.
E.g. consider the situation below:
Input = Enumerable.Range(1, Count).ToArray();
Result = new double[Count];
Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });
The operation here is so simple, that the overhead of doing this in parallel will dwarf the gain of using multiple cores. This code runs significantly slower than a regular foreach loop.
By using a partition we can reduce the overhead and actually observe a gain in performance.
Parallel.ForEach(Partitioner.Create(0, Input.Length), range => {
for (var index = range.Item1; index < range.Item2; index++) {
Result[index] = Input[index]*Math.PI;
}
});
The morale of the story here is that parallelism is hard and you should only employ this after looking closely at the situation at hand. Additionally, you should profile the code both before and after adding parallelism.
Remember that regardless of any potential performance gain parallelism always adds complexity to the code, so if the performance is already good enough, there's little reason to add the complexity.
The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can.
Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.
See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?
The long answer:
We can classify loops by how they fall on two axes:
Few iterations through to many iterations.
Each iteration is fast through to each iteration is slow.
A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.
When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.
Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:
var tasks = new List<Task>(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());
Where there are many iterations, Parallel.ForEach is in its element.
The Microsoft documentation states that
When a parallel loop runs, the TPL partitions the data source so that
the loop can operate on multiple parts concurrently. Behind the
scenes, the Task Scheduler partitions the task based on system
resources and workload. When possible, the scheduler redistributes
work among multiple threads and processors if the workload becomes
unbalanced.
This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.
I ran some code.
The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.
The test code is:
namespace ParallelTests
{
class Program
{
private static int Fibonacci(int x)
{
if (x <= 1)
{
return 1;
}
return Fibonacci(x - 1) + Fibonacci(x - 2);
}
private static void DummyWork()
{
var result = Fibonacci(10);
// inspect the result so it is no optimised away.
// We know that the exception is never thrown. The compiler does not.
if (result > 300)
{
throw new Exception("failed to to it");
}
}
private const int TotalWorkItems = 2000000;
private static void SerialWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
for (int index1 = 0; index1 < outerWorkItems; index1++)
{
InnerLoop(innerLoopLimit);
}
}
private static void InnerLoop(int innerLoopLimit)
{
for (int index2 = 0; index2 < innerLoopLimit; index2++)
{
DummyWork();
}
}
private static void ParallelWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
var outerRange = Enumerable.Range(0, outerWorkItems);
Parallel.ForEach(outerRange, index1 =>
{
InnerLoop(innerLoopLimit);
});
}
private static void TimeOperation(string desc, Action operation)
{
Stopwatch timer = new Stopwatch();
timer.Start();
operation();
timer.Stop();
string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed);
Console.WriteLine(message);
}
static void Main(string[] args)
{
TimeOperation("serial work: 1", () => Program.SerialWork(1));
TimeOperation("serial work: 2", () => Program.SerialWork(2));
TimeOperation("serial work: 3", () => Program.SerialWork(3));
TimeOperation("serial work: 4", () => Program.SerialWork(4));
TimeOperation("serial work: 8", () => Program.SerialWork(8));
TimeOperation("serial work: 16", () => Program.SerialWork(16));
TimeOperation("serial work: 32", () => Program.SerialWork(32));
TimeOperation("serial work: 1k", () => Program.SerialWork(1000));
TimeOperation("serial work: 10k", () => Program.SerialWork(10000));
TimeOperation("serial work: 100k", () => Program.SerialWork(100000));
TimeOperation("parallel work: 1", () => Program.ParallelWork(1));
TimeOperation("parallel work: 2", () => Program.ParallelWork(2));
TimeOperation("parallel work: 3", () => Program.ParallelWork(3));
TimeOperation("parallel work: 4", () => Program.ParallelWork(4));
TimeOperation("parallel work: 8", () => Program.ParallelWork(8));
TimeOperation("parallel work: 16", () => Program.ParallelWork(16));
TimeOperation("parallel work: 32", () => Program.ParallelWork(32));
TimeOperation("parallel work: 64", () => Program.ParallelWork(64));
TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000));
TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000));
TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000));
Console.WriteLine("done");
Console.ReadLine();
}
}
}
the results on a 4-core Windows 7 machine are:
serial work: 1 took 00:02.31
serial work: 2 took 00:02.27
serial work: 3 took 00:02.28
serial work: 4 took 00:02.28
serial work: 8 took 00:02.28
serial work: 16 took 00:02.27
serial work: 32 took 00:02.27
serial work: 1k took 00:02.27
serial work: 10k took 00:02.28
serial work: 100k took 00:02.28
parallel work: 1 took 00:02.33
parallel work: 2 took 00:01.14
parallel work: 3 took 00:00.96
parallel work: 4 took 00:00.78
parallel work: 8 took 00:00.84
parallel work: 16 took 00:00.86
parallel work: 32 took 00:00.82
parallel work: 64 took 00:00.80
parallel work: 1k took 00:00.77
parallel work: 10k took 00:00.78
parallel work: 100k took 00:00.77
done
Running code Compiled in .Net 4 and .Net 4.5 give much the same results.
The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.
The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.
It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).
In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.
There is no lower limit for doing parallel operations. If you have only 2 items to work on but each one will take a while, it might still make sense to use Parallel.ForEach. On the other hand if you have 1000000 items but they don't do very much, the parallel loop might not go any faster than the regular loop.
For example, I wrote a simple program to time nested loops where the outer loop ran both with a for loop and with Parallel.ForEach. I timed it on my 4-CPU (dual-core, hyperthreaded) laptop.
Here's a run with only 2 items to work on, but each takes a while:
2 outer iterations, 100000000 inner iterations:
for loop: 00:00:00.1460441
ForEach : 00:00:00.0842240
Here's a run with millions of items to work on, but they don't do very much:
100000000 outer iterations, 2 inner iterations:
for loop: 00:00:00.0866330
ForEach : 00:00:02.1303315
The only real way to know is to try it.
In general, once you go above a thread per core, each extra thread involved in an operation will make it slower, not faster.
However, if part of each operation will block (the classic example being waiting on disk or network I/O, another being producers and consumers that are out of synch with each other) then more threads than cores can begin to speed things up again, because tasks can be done while other threads are unable to make progress until the I/O operation returns.
For this reason, when single-core machines were the norm, the only real justifications in multi-threading were when either there was blocking of the sort I/O introduces or else to improve responsiveness (slightly slower to perform a task, but much quicker to start responding to user-input again).
Still, these days single-core machines are increasingly rare, so it would appear that you should be able to make everything at least twice as fast with parallel processing.
This will still not be the case if order is important, or something inherent to the task forces it to have a synchronised bottleneck, or if the number of operations is so small that the increase in speed from parallel processing is outweighed by the overheads involved in setting up that parallel processing. It may or may not be the case if a share resource requires threads to block on other threads performing the same parallel operation (depending on the degree of lock contention).
Also, if your code is inherently multithreaded to begin with, you can be in a situation where you are essentially competing for resources with yourself (a classic case being ASP.NET code handling simultaneous requests). Here the advantage in parallel operation may mean that a single test operation on a 4-core machine approaches 4 times the performance, but once the number of requests needing the same task to be performed reaches 4, then since each of those 4 requests are each trying to use each core, it becomes little better than if they had a core each (perhaps slightly better, perhaps slightly worse). The benefits of parallel operation hence disappears as the use changes from a single-request test to a real-world multitude of requests.
You shouldn't blindly replace every single foreach loop in your application with the parallel foreach. More threads doesn't necessary mean that your application will work faster. You need to slice the task into smaller tasks which could run in parallel if you want to really benefit from multiple threads. If your algorithm is not parallelizable you won't get any benefit.
No. You need to understand what the code is doing and whether it is amenable to parallelization. Dependencies between your data items can make it hard to parallelize, i.e., if a thread uses the value calculated for the previous element it has to wait until the value is calculated anyway and can't run in parallel. You also need to understand your target architecture, though, you will typically have a multicore CPU on just about anything you buy these days. Even on a single core, you can get some benefits from more threads but only if you have some blocking tasks. You should also keep in mind that there is overhead in creating and organizing the parallel threads. If this overhead is a significant fraction of (or more than) the time your task takes you could slow it down.
These are my benchmarks showing pure serial is slowest, along with various levels of partitioning.
class Program
{
static void Main(string[] args)
{
NativeDllCalls(true, 1, 400000000, 0); // Seconds: 0.67 |) 595,203,995.01 ops
NativeDllCalls(true, 1, 400000000, 3); // Seconds: 0.91 |) 439,052,826.95 ops
NativeDllCalls(true, 1, 400000000, 4); // Seconds: 0.80 |) 501,224,491.43 ops
NativeDllCalls(true, 1, 400000000, 8); // Seconds: 0.63 |) 635,893,653.15 ops
NativeDllCalls(true, 4, 100000000, 0); // Seconds: 0.35 |) 1,149,359,562.48 ops
NativeDllCalls(true, 400, 1000000, 0); // Seconds: 0.24 |) 1,673,544,236.17 ops
NativeDllCalls(true, 10000, 40000, 0); // Seconds: 0.22 |) 1,826,379,772.84 ops
NativeDllCalls(true, 40000, 10000, 0); // Seconds: 0.21 |) 1,869,052,325.05 ops
NativeDllCalls(true, 1000000, 400, 0); // Seconds: 0.24 |) 1,652,797,628.57 ops
NativeDllCalls(true, 100000000, 4, 0); // Seconds: 0.31 |) 1,294,424,654.13 ops
NativeDllCalls(true, 400000000, 0, 0); // Seconds: 1.10 |) 364,277,890.12 ops
}
static void NativeDllCalls(bool useStatic, int nonParallelIterations, int parallelIterations = 0, int maxParallelism = 0)
{
if (useStatic) {
Iterate<string, object>(
(msg, cntxt) => {
ServiceContracts.ForNativeCall.SomeStaticCall(msg);
}
, "test", null, nonParallelIterations,parallelIterations, maxParallelism );
}
else {
var instance = new ServiceContracts.ForNativeCall();
Iterate(
(msg, cntxt) => {
cntxt.SomeCall(msg);
}
, "test", instance, nonParallelIterations, parallelIterations, maxParallelism);
}
}
static void Iterate<T, C>(Action<T, C> action, T testMessage, C context, int nonParallelIterations, int parallelIterations=0, int maxParallelism= 0)
{
var start = DateTime.UtcNow;
if(nonParallelIterations == 0)
nonParallelIterations = 1; // normalize values
if(parallelIterations == 0)
parallelIterations = 1;
if (parallelIterations > 1) {
ParallelOptions options;
if (maxParallelism == 0) // default max parallelism
options = new ParallelOptions();
else
options = new ParallelOptions { MaxDegreeOfParallelism = maxParallelism };
if (nonParallelIterations > 1) {
Parallel.For(0, parallelIterations, options
, (j) => {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
});
}
else { // no nonParallel iterations
Parallel.For(0, parallelIterations, options
, (j) => {
action(testMessage, context);
});
}
}
else {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
}
var end = DateTime.UtcNow;
Console.WriteLine("\tSeconds: {0,8:0.00} |) {1,16:0,000.00} ops",
(end - start).TotalSeconds, (Math.Max(parallelIterations, 1) * nonParallelIterations / (end - start).TotalSeconds));
}
}

Categories