I'm working on an academic open source project and now I need to create a fast blocking FIFO queue in C#. My first implementation simply wrapped a synchronized queue (w/dynamic expansion) within a reader's semaphore, then I decided to re-implement in the following (theorically faster) way
public class FastFifoQueue<T>
{
private T[] _array;
private int _head, _tail, _count;
private readonly int _capacity;
private readonly Semaphore _readSema, _writeSema;
/// <summary>
/// Initializes FastFifoQueue with the specified capacity
/// </summary>
/// <param name="size">Maximum number of elements to store</param>
public FastFifoQueue(int size)
{
//Check if size is power of 2
//Credit: http://stackoverflow.com/questions/600293/how-to-check-if-a-number-is-a-power-of-2
if ((size & (size - 1)) != 0)
throw new ArgumentOutOfRangeException("size", "Size must be a power of 2 for this queue to work");
_capacity = size;
_array = new T[size];
_count = 0;
_head = int.MinValue; //0 is the same!
_tail = int.MinValue;
_readSema = new Semaphore(0, _capacity);
_writeSema = new Semaphore(_capacity, _capacity);
}
public void Enqueue(T item)
{
_writeSema.WaitOne();
int index = Interlocked.Increment(ref _head);
index %= _capacity;
if (index < 0) index += _capacity;
//_array[index] = item;
Interlocked.Exchange(ref _array[index], item);
Interlocked.Increment(ref _count);
_readSema.Release();
}
public T Dequeue()
{
_readSema.WaitOne();
int index = Interlocked.Increment(ref _tail);
index %= _capacity;
if (index < 0) index += _capacity;
T ret = Interlocked.Exchange(ref _array[index], null);
Interlocked.Decrement(ref _count);
_writeSema.Release();
return ret;
}
public int Count
{
get
{
return _count;
}
}
}
This is the classic FIFO queue implementation with static array we find on textbooks. It is designed to atomically increment pointers, and since I can't make the pointer go back to zero when reached (capacity-1), I compute modulo apart. In theory, using Interlocked is the same as locking before doing the increment, and since there are semaphores, multiple producers/consumers may enter the queue but only one at a time is able to modify the queue pointers.
First, because Interlocked.Increment first increments, then returns, I already understand that I am limited to use the post-increment value and start store items from position 1 in the array. It's not a problem, I'll go back to 0 when I reach a certain value
What's the problem with it?
You wouldn't believe that, running on heavy loads, sometimes the queue returns a NULL value. I am SURE, repeat, I AM SURE, that no method enqueues null into the queue. This is definitely true because I tried to put a null check in Enqueue to be sure, and no error was thrown. I created a test case for that with Visual Studio (by the way, I use a dual core CPU like maaaaaaaany people)
private int _errors;
[TestMethod()]
public void ConcurrencyTest()
{
const int size = 3; //Perform more tests changing it
_errors = 0;
IFifoQueue<object> queue = new FastFifoQueue<object>(2048);
Thread.CurrentThread.Priority = ThreadPriority.AboveNormal;
Thread[] producers = new Thread[size], consumers = new Thread[size];
for (int i = 0; i < size; i++)
{
producers[i] = new Thread(LoopProducer) { Priority = ThreadPriority.BelowNormal };
consumers[i] = new Thread(LoopConsumer) { Priority = ThreadPriority.BelowNormal };
producers[i].Start(queue);
consumers[i].Start(queue);
}
Thread.Sleep(new TimeSpan(0, 0, 1, 0));
for (int i = 0; i < size; i++)
{
producers[i].Abort();
consumers[i].Abort();
}
Assert.AreEqual(0, _errors);
}
private void LoopProducer(object queue)
{
try
{
IFifoQueue<object> q = (IFifoQueue<object>)queue;
while (true)
{
try
{
q.Enqueue(new object());
}
catch
{ }
}
}
catch (ThreadAbortException)
{ }
}
private void LoopConsumer(object queue)
{
try
{
IFifoQueue<object> q = (IFifoQueue<object>)queue;
while (true)
{
object item = q.Dequeue();
if (item == null) Interlocked.Increment(ref _errors);
}
}
catch (ThreadAbortException)
{ }
}
Once a null is got by the consumer thread, an error is counted.
When performing the test with 1 producer and 1 consumer, it succeeds. When performing the test with 2 producers and 2 consumers, or more, a disaster happens: even 2000 leaks are detected. I found that the problem can be in the Enqueue method. By design contract, a producer can write only into a cell that is empty (null), but modifying my code with some diagnostics I found that sometimes a producer is trying to write on a non-empty cell, which is then occupied by "good" data.
public void Enqueue(T item)
{
_writeSema.WaitOne();
int index = Interlocked.Increment(ref _head);
index %= _capacity;
if (index < 0) index += _capacity;
//_array[index] = item;
T leak = Interlocked.Exchange(ref _array[index], item);
//Diagnostic code
if (leak != null)
{
throw new InvalidOperationException("Too bad...");
}
Interlocked.Increment(ref _count);
_readSema.Release();
}
The "too bad" exception happens then often. But it's too strange that a conflict raises from concurrent writes, because increments are atomic and writer's semaphore allows only as many writers as the free array cells.
Can somebody help me with that? I would really appreciate if you share your skills and experience with me.
Thank you.
I must say, this struck me as a very clever idea, and I thought about it for a while before I started to realize where (I think) the bug is here. So, on one hand, kudos on coming up with such a clever design! But, at the same time, shame on you for demonstrating "Kernighan's Law":
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
The issue is basically this: you are assuming that the WaitOne and Release calls effectively serialize all of your Enqueue and Dequeue operations; but that isn't quite what is going on here. Remember that the Semaphore class is used to restrict the number of threads accessing a resource, not to ensure a particular order of events. What happens between each WaitOne and Release is not guaranteed to occur in the same "thread-order" as the WaitOne and Release calls themselves.
This is tricky to explain in words, so let me try to provide a visual illustration.
Let's say your queue has a capacity of 8 and looks like this (let 0 represent null and x represent an object):
[ x x x x x x x x ]
So Enqueue has been called 8 times and the queue is full. Therefore your _writeSema semaphore will block on WaitOne, and your _readSema semaphore will return immediately on WaitOne.
Now let's suppose Dequeue is called more or less concurrently on 3 different threads. Let's call these T1, T2, and T3.
Before proceeding let me apply some labels to your Dequeue implementation, for reference:
public T Dequeue()
{
_readSema.WaitOne(); // A
int index = Interlocked.Increment(ref _tail); // B
index %= _capacity;
if (index < 0) index += _capacity;
T ret = Interlocked.Exchange(ref _array[index], null); // C
Interlocked.Decrement(ref _count);
_writeSema.Release(); // D
return ret;
}
OK, so T1, T2, and T3 have all gotten past point A. Then for simplicity let's suppose they each reach line B "in order", so that T1 has an index of 0, T2 has an index of 1, and T3 has an index of 2.
So far so good. But here's the gotcha: there is no guarantee that from here, T1, T2, and T3 are going to get to line D in any specified order. Suppose T3 actually gets ahead of T1 and T2, moving past line C (and thus setting _array[2] to null) and all the way to line D.
After this point, _writeSema will be signaled, meaning you have one slot available in your queue to write to, right? But your queue now looks like this!
[ x x 0 x x x x x ]
So if another thread has come along in the meantime with a call to Enqueue, it will actually get past _writeSema.WaitOne, increment _head, and get an index of 0, even though slot 0 is not empty. The result of this will be that the item in slot 0 could actually be overwritten, before T1 (remember him?) reads it.
To understand where your null values are coming from, you need only to visualize the reverse of the process I just described. That is, suppose your queue looks like this:
[ 0 0 0 0 0 0 0 0 ]
Three threads, T1, T2, and T3, all call Enqueue nearly simultaneously. T3 increments _head last but inserts its item (at _array[2]) and calls _readSema.Release first, resulting in a signaled _readSema but a queue looking like:
[ 0 0 x 0 0 0 0 0 ]
So if another thread has come along in the meantime with a call to Dequeue (before T1 and T2 are finished doing their thing), it will get past _readSema.WaitOne, increment _tail, and get an index of 0, even though slot 0 is empty.
So there's your problem. As for a solution, I don't have any suggestions at the moment. Give me some time to think it over... (I'm posting this answer now because it's fresh in my mind and I feel it might help you.)
(+1 to Dan Tao who I vote has the answer)
The enqueue would be changed to something like this...
while (Interlocked.CompareExchange(ref _array[index], item, null) != null)
;
The dequeue would be changed to something like this...
while( (ret = Interlocked.Exchange(ref _array[index], null)) == null)
;
This builds upon Dan Tao's excellent analysis. Because the indexes are atomically obtained, then (assuming that no threads die or terminate in the enqueue or dequeue methods) a reader is guaranteed to eventually have his cell filled in, or the writer is guaranteed to eventually have his cell freed (null).
Thank you Dan Tao and Les,
I really appreciated your help a lot. Dan, you opened my mind: it's not important how many producers/consumers are inside the critical section, the important is that the locks are released in order. Les, you found the solution to the problem.
Now it's time to finally answer my own question with the final code I made thanks to the help of both of you. Well, it's not much but it's a little enhancement from Les's code
Enqueue:
while (Interlocked.CompareExchange(ref _array[index], item, null) != null)
Thread.Sleep(0);
Dequeue:
while ((ret = Interlocked.Exchange(ref _array[index], null)) == null)
Thread.Sleep(0);
Why Thread.Sleep(0)? When we find that an element cannot be retrieved/stored, why immediately checking again? I need to force context switch to allow other threads to read/write. Obviously, the next thread that will be scheduled could be another thread unable to operate, but at least we force it. Source: http://progfeatures.blogspot.com/2009/05/how-to-force-thread-to-perform-context.html
I also tested the code of the previous test case to get proof of my claims:
without sleep(0)
Read 6164150 elements
Wrote 6322541 elements
Read 5885192 elements
Wrote 5785144 elements
Wrote 6439924 elements
Read 6497471 elements
with sleep(0)
Wrote 7135907 elements
Read 6361996 elements
Wrote 6761158 elements
Read 6203202 elements
Wrote 5257581 elements
Read 6587568 elements
I know this is not a "great" discover and I will wiln no Turing prize for these numbers. Performance increment is not dramatical, but is greater than zero. Forcing context switch allows more RW operations to be performed (=higher throughput).
To be clear: in my test, I merely evaluate the performance of the queue, not simulate a producer/consumer problem, so don't care if at the end of the test after a minute there are still elements in queue. But I just demonstrated my approach works, thanks to you all.
Code available open source as MS-RL: http://logbus-ng.svn.sourceforge.net/viewvc/logbus-ng/trunk/logbus-core/It.Unina.Dis.Logbus/Utils/FastFifoQueue.cs?revision=461&view=markup
Related
Assume I have Producer-Consumer pattern where the consumer can also produce additional work. Essentially, imagine a list with 1000 integers:
var LL = new List<int> {1, 2, 3, ....., 1000};
I want to multi-thread sum - so I am taking 2 numbers at a time, summing them and adding the result back to LL. I would do this until there is only 1 entry left in LL when the last outstanding thread returns.
My experimental code looks like this:
var LL = Enumerable.Range(1, 1000).ToList();
Func<int, int, int> sum = (a, b) => { return a + b; };
object o = new object();
int outstandingThreads = 0;
while (LL.Count > 1 || outstandingThreads > 0)
{
//Note that I set an upper bound of 8 simulateneous Threads
if (LL.Count > 1 && outstandingThreads < 8)
{
var l1 = LL[0];
LL.RemoveAt(0);
var l2 = LL[0];
LL.RemoveAt(0);
Interlocked.Increment(ref outstandingThreads);
var t = Task.Factory.StartNew(() =>
{
var rr = l1 + l2;
// In practice I would use a ConcurrentBag and not explicitly log
lock (o)
{
LL.Add(rr);
}
Interlocked.Decrement(ref outstandingThreads);
}, CancellationToken.None,
TaskCreationOptions.DenyChildAttach,
TaskScheduler.Default);
}
}
I'm scratching my head as this is not working. I get a different result almost every time. I must be hitting a race condition that I cannot see. Please note, that processing a List is not my actual test case, just a simplification. If there's a better pattern I could be using, I'm also all ears. Multithreading, as you can see, is not my forte.
You've got a lock around Add, but RemoveAt is also modification of the list.
Why no lock around that?
A race may happen between .Add from worker thread and .RemoveAt from main thread, and it could screw up the .Count property that the List caches (calculating .Count by walking the whole list would be an overkill, so the List caches it for sure), as both Add and Remove do two things: modify the list items and update the .Count, even if it doesn't crash, it may get messed up, so yeah, I think that's it.
I have 3 threads in 3 classes running in parallel. Each of them, increase Pos or Neg of the Fourthclass by "1". After 3 threads are done, if Fourclass.Pos > Fourclass.Neg, it will run Terminal4.
Q: How can i run the Terminal4 only 1 time. Because putting Fourthclass.Terminal4(); in each Terminal1-2-3 will run the Terminal4 3 times.
Here is what i have done:
public class Firstclass
{
static int Pos = 1;
static int Neg = 0;
public static void Terminal1()
{
if (Pos > Neg)
{
Fourthclass.Pos += 1;
// Fourthclass.Terminal4();
}
}
}
public class Secondclass
{
static int Pos = 1;
static int Neg = 0;
public static void Terminal2()
{
if (Pos > Neg)
{
Fourthclass.Pos += 1;
// Fourthclass.Terminal4();
}
}
}
public class Thirdclass
{
static int Pos = 1;
static int Neg = 0;
public static void Terminal3()
{
if (Pos > Neg)
{
Fourthclass.Neg += 1;
// Fourthclass.Terminal4();
}
}
}
public static class Fourthclass
{
public static int Pos = 0;
public static int Neg = 0;
public static void Terminal4()
{
if (Pos > Neg)
{
Console.WriteLine("Pos = {0} - Neg = {1}", Pos, Neg);
Console.WriteLine();
}
else { Console.WriteLine("fail"); }
}
}
class Program
{
static void Main(string[] args)
{
Thread obj1 = new Thread(new ThreadStart(Firstclass.Terminal1));
Thread obj2 = new Thread(new ThreadStart(Secondclass.Terminal2));
Thread obj3 = new Thread(new ThreadStart(Thirdclass.Terminal3));
obj1.Start();
obj2.Start();
obj3.Start();
}
}
Original Answer
By the by... these increments are not thread safe, they may suffer the ABA problem and that is ignoring thread visibility problems.
For that problem, please use Interloked. Interlocked.Increment and Interlocked.Decrement will take care of it.
Now, for making a code block run only once, keep an int variable that will be 1 if it did run, and 0 if it did not. Then use Interlocked.CompareExchange:
int didrun;
// ...
if (Interlocked.CompareExchange(ref didrun, 1, 0) == 0)
{
// code here will only run once, until you put didrun back to 0
}
There are other ways, of course. This one is just very versatile.
Addendum: Ok, what it does...
Interlocked.CompareExchange will look at the value of the passed variable (didrun) compare it to the second parameter (0) and if it matches, it will change the variable to the value of first parameter (1) [Since it may change the variable, you have to pass it by ref]. The return value is what it found in the variable.
Thus, if it returns 0, you know it found 0, which means that it did update the value to 1. Now, the next time this piece of code is called, the value of the variable is 1, so Interlocked.CompareExchange returns 1 and the thread does not enter the block.
Ok, why do not use a bool instead? Because of thread visibility. A thread may change the value of the variable, but this update could happen in CPU cache only, and not be visible to other threads... Interlocked gets around of that problem. Just use Interlocked. And MSDN is your friend.
That will work regardless if you are using ThreadPool, Task or async/await or just plain old Threads as you do. I mention that because I would like to suggest using those...
Sneaky link to Threading in C#.
Extended Answer
In comment, you ask about a different behavior:
The Terminal4 has cooldown until the next run
Well, if there is a cool down (which I understand as a period) you do not only need to store whatever or not the code did run, but also when was the last time it did.
Now, the conditional cannot be just "run only if it has not run yet" but instead "run if it has not run yet or if the period from the last time it ran to now is greater than the cool down".
We have to check multiple things, that is a problem. Now the check will no longer be atomic (from the Latin atomus which means indivisible, from a- "not" + tomos "a cutting, slice, volume, section").
That is relevant because if the check is not atomic, we are back to the ABA problem.
I will use this case to explain the ABA problem. If we encode the following:
1. Check if the operation has not run (if it has not go to 4)
2. Get the last time it ran
3. Compute the difference from the last run to now (exit if less than cool down)
4. Update the last run time to now
5. Run code
Two threads may do the following:
|
t Thread1: Check if the operation has not run (it has)
i Thread2: Check if the operation has not run (it has)
m Thread2: Get the last time it ran
e Thread1: Get the last time it ran
| Thread1: Compute the difference from the last run to now (more than cool down)
v Thread2: Compute the difference from the last run to now (more than cool down)
Thread2: Update the last run time to now
Thread2: Run code
Thread1: Update the last run time to now
Thread1: Run code
As you see, they both Run code.
What we need is a way to check and update in a single atomic operation, that way the act of checking will alter the result of the other thread. That is what we get with Interlocked.
How Interlocked manages to do that is beyond the scope of the question. Suffice to say that there are some special CPU instructions for that.
The new pattern I suggest is the following (pseudocode):
bool canRun = false;
DateTime lastRunCopy;
DateTime now = DateTime.Now;
if (try to set lastRun to now if lastRun is not set, copy lastRun to lastRunCopy)
{
// We set lastRun
canRun = true;
}
else
{
if ((now - lastRunCopy) < cooldown)
{
if (try to set lastRun to now if lastRun = lastRunCopy, copy lastRun to lastRunCopy)
{
// we updated it
canRun = true;
}
}
else
{
// Another thread got in
}
}
if (canRun)
{
// code here will only run once per cool down
}
Notice I have expressed the operations in terms of "try to set X to Y if X is Z, copy X to W" which is how Interlocked.CompareExchange works.
There is a reason I left that in pseudo code, and that is that DateTime is not an atomic type.
In order to make the code work we will have to use DateTime.Ticks. For an unset value we will use 0 (00:00:00.0000000 UTC, January 1, 0001), which is something you have to worry about for a cool down greater than a couple of millennia.
In addition, of course, we will use the overload of Interlocked.CompareExchange that takes long because DateTime.Ticks is of that type.
Note: Ah, we will use TimeSpan.Ticks for the cool down.
The code is as follows:
long lastRun = 0;
long cooldown = TimeSpan.FromSeconds(1).Ticks; // Or whatever, I do not know.
// ...
bool canRun = false;
long lastRunCopy = 0;
long now = DateTime.Now.Ticks;
lastRunCopy = Interlocked.CompareExchange(ref lastRun, now, 0);
if (lastRunCopy == 0)
{
// We set lastRun
canRun = true;
}
else
{
if ((now - lastRunCopy) < cooldown)
{
if (Interlocked.CompareExchange(ref lastRun, now, lastRunCopy) == lastRunCopy)
{
// we updated it
canRun = true;
}
else
{
// Another thread got in
}
}
}
if (canRun)
{
// code here will only run once per cooldown
}
Alternatively, if you want to boil it down to a single conditional:
long lastRun = 0;
long cooldown = TimeSpan.FromSeconds(1).Ticks; // Or whatever, I do not know.
// ...
long lastRunCopy;
var now = DateTime.Now.Ticks;
if
(
(lastRunCopy = Interlocked.CompareExchange(ref lastRun, now, 0)) == 0
|| now - lastRunCopy < cooldown
&& Interlocked.CompareExchange(ref lastRun, now, lastRunCopy) == lastRunCopy
)
{
// code here will only run once per cooldown
}
As I said, Interlocked.CompareExchange is versatile. Although, as you can see, you still need to think around the requirements.
I have a List to loop while using multi-thread,I will get the first item of the List and do some processing,then remove the item.
While the count of List is not greater than 0 ,fetch data from data.
In a word:
In have a lot of records in my database.I need to publish them to my server.In the process of publishing, multithreading is required and the number of threads may be 10 or less.
For example:
private List<string> list;
void LoadDataFromDatabase(){
list=...;//load data from database...
}
void DoMethod()
{
While(list.Count>0)
{
var item=list.FirstOrDefault();
list.RemoveAt(0);
DoProcess();//how to use multi-thread (custom the count of theads)?
if(list.Count<=0)
{
LoadDataFromDatabase();
}
}
}
Please help me,I'm a beginner of c#,I have searched a lot of solutions, but no similar.
And more,I need to custom the count of theads.
Should your processing of the list be sequential? In other words, cannot you process element n + 1 while not finished yet processing of element n? If this is your case, then Multi-Threading is not the right solution.
Otherwise, if your processing elements are fully independent, you can use m threads, deviding Elements.Count / m elements for each thread to work on
Example: printing a list:
List<int> a = new List<int> { 1, 2, 3, 4,5 , 6, 7, 8, 9 , 10 };
int num_threads = 2;
int thread_elements = a.Count / num_threads;
// start the threads
Thread[] threads = new Thread[num_threads];
for (int i = 0; i < num_threads; ++i)
{
threads[i] = new Thread(new ThreadStart(Work));
threads[i].Start(i);
}
// this works fine if the total number of elements is divisable by num_threads
// but if we have 500 elements, 7 threads, then thread_elements = 500 / 7 = 71
// but 71 * 7 = 497, so that there are 3 elements not processed
// process them here:
int actual = thread_elements * num_threads;
for (int i = actual; i < a.Count; ++i)
Console.WriteLine(a[i]);
// wait all threads to finish
for (int i = 0; i < num_threads; ++i)
{
threads[i].Join();
}
void Work(object arg)
{
Console.WriteLine("Thread #" + arg + " has begun...");
// calculate my working range [start, end)
int id = (int)arg;
int mystart = id * thread_elements;
int myend = (id + 1) * thread_elements;
// start work on my range !!
for (int i = mystart; i < myend; ++i)
Console.WriteLine("Thread #" + arg + " Element " + a[i]);
}
ADD For your case, (uploading to server), it is the same as the code obove. You assign a number of threads, assigning each thread number of elements (which is auto calculated in the variable thread_elements, so you need only to change num_threads). For method Work, all you need is replacing the line Console.WriteLine("Thread #" + arg + " Element " + a[i]); with you uploading code.
One more thing to keep in mind, that multi-threading is dependent on your machine CPU. If your CPU has 4 cores, for example, then the best performance obtained would be 4 threads at maximum, so that assigning each core a thread. Otherwise, if you have 10 threads, for example, they would be slower than 4 threads because they will compete on CPU cores (Unless the threads are idle, waiting for some event to occur (e.g. uploading). In this case, 10 threads can run, because they don't take %100 of CPU usage)
WARNING: DO NOT modify the list while any thread is working (add, remove, set element...), neither assigning two threads the same element. Such things cause you a lot of bugs and exceptions !!!
This is a simple scenario that can be expanded in multiple ways if you add some details to your requirements:
IEnumerable<Data> LoadDataFromDatabase()
{
return ...
}
void ProcessInParallel()
{
while(true)
{
var data = LoadDataFromDatabase().ToList();
if(!data.Any()) break;
data.AsParallel().ForEach(ProcessSingleData);
}
}
void ProcessSingleData(Data d)
{
// do something with data
}
There are many ways to approach this. You can create threads and partition the list yourself or you can take advantage of the TPL and utilize Parallel.ForEach. In the example on the link you see a Action is called for each member of the list being iterated over. If this is your first taste of threading I would also attempt to do it the old fashioned way.
Here my opinion ;)
You can avoid use multithread if youur "List" is not really huge.
Instead of a List, you can use a Queue (FIFO - First In First Out). Then only use Dequeue() method to get one element of the Queue, DoSomeWork and get the another. Something like:
while(queue.Count > 0)
{
var temp = DoSomeWork(queue.Dequeue());
}
I think that this will be better for your propose.
I will get the first item of the List and do some processing,then remove the item.
Bad.
First, you want a queue, not a list.
Second, you do not process then remove, you remove THEN process.
Why?
So that you keep the locks small. Lock list access (note you need to synchonize access), remove, THEN unlock immediately and then process. THis way you keep the locks short. If you take, process, then remove - you basically are single threaded as you have to keep the lock in place while processing, so the next thread does not take the same item again.
And as you need to synchronize access and want multiple threads this is about the only way.
Read up on the lock statement for a start (you can later move to something like spinlock). Do NOT use threads unless you ahve to put schedule Tasks (using the Tasks interface new in 4.0), which gives you more flexibility.
I have a program which color codes a returned results set a certain way depending on what the results are. Due to the length of time it takes to color-code the results (currently being done with Regex and RichTextBox.Select + .SelectionColor), I cut off color-coding at 400 results. At around that number it takes about 20 seconds, which is just about max time of what I'd consider reasonable.
To try an improve performance I re-wrote the Regex part to use a Parallel.ForEach loop to iterate through the MatchCollection, but the time was about the same (18-19 seconds vs 20)! Is just not a job that lends itself to Parallel programming very well? Should I try something different? Any advice is welcome. Thanks!
PS: Thought it was a bit strange that my CPU utilization never went about 14%, with or without Parallel.ForEach.
Code
MatchCollection startMatches = Regex.Matches(tempRTB.Text, startPattern);
object locker = new object();
System.Threading.Tasks.Parallel.ForEach(startMatches.Cast<Match>(), m =>
{
int i = 0;
foreach (Group g in m.Groups)
{
if (i > 0 && i < 5 && g.Length > 0)
{
tempRTB.Invoke(new Func<bool>(
delegate
{
lock (locker)
{
tempRTB.Select(g.Index, g.Length);
if ((i & 1) == 0) // Even number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.ValueColor;
else // Odd number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.AttributeColor;
return true;
}
}));
}
else if (i == 5 && g.Length > 0)
{
var result = tempRTB.Invoke(new Func<string>(
delegate
{
lock (locker)
{
return tempRTB.Text.Substring(g.Index, g.Length);
}
}));
MatchCollection subMatches = Regex.Matches((string)result, pattern);
foreach (Match subMatch in subMatches)
{
int j = 0;
foreach (Group subGroup in subMatch.Groups)
{
if (j > 0 && subGroup.Length > 0)
{
tempRTB.Invoke(new Func<bool>(
delegate
{
lock (locker)
{
tempRTB.Select(g.Index + subGroup.Index, subGroup.Length);
if ((j & 1) == 0) // Even number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.ValueColor;
else // Odd number
tempRTB.SelectionColor = Namespace.Properties.Settings.Default.AttributeColor;
return true;
}
}));
}
j++;
}
}
}
i++;
}
});
Virtually no aspect of your program is actually able to run in parallel.
The generation of the matches needs to be done sequentially. It can't find the second match until it has already found the first. Parallel.ForEach will, at best, allow you to process the results of the sequence in parallel, but they are still generated sequentially. This is where the majority of your time consuming work seems to be, and there are no gains there.
On top of that, you aren't really processing the results in parallel either. The majority of code run in the body of your loop is all inside an invoke to the UI thread, which means it's all being run by a single thread.
In short, only a tiny, tiny bit of your program is actually run in parallel, and using parallelization in general adds some overhead; it sounds like you're just barely getting more than that overhead. There isn't really much that you did wrong, the operation just inherently doesn't lend itself to parallelization, unless there is an effective way of breaking up the initial string into several smaller chucks that the regex can parse individually (in parallel).
The most time in your code is most likely spent in the part that actually selects the text in the richtext box and sets the color.
This code is impossible to execute in parallel, because it has to be marshalled to the UI thread - which you do via tempRTB.Invoke.
Furthermore, you explicitly make sure that the highlighting is not executed in parallel but sequentially by using the lock statement. This is unnecessary, because all of that code is run on the single UI thread anyway.
You could try to improve your performance by suspending the layouting of your UI while you select and color the text in the RTB:
tempRTB.SuspendLayout();
// your loop
tempRTB.ResumeLayout();
The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".