Multithreading speed issue

Multithreading speed issue - c#

I added multithreading part to my code .
public class ThreadClassSeqGroups
{
public Dictionary<string, string> seqGroup;
public Dictionary<string, List<SearchAlgorithm.CandidateStr>> completeModels;
public Dictionary<string, List<SearchAlgorithm.CandidateStr>> partialModels;
private Thread nativeThread;
public ThreadClassSeqGroups(Dictionary<string, string> seqs)
{
seqGroup = seqs;
completeModels = new Dictionary<string, List<SearchAlgorithm.CandidateStr>>();
partialModels = new Dictionary<string, List<SearchAlgorithm.CandidateStr>>();
}
public void Run(DescrStrDetail dsd, DescrStrDetail.SortUnit primarySeedSu,
List<ushort> secondarySeedOrder, double partialCutoff)
{
nativeThread = new Thread(() => this._run(dsd, primarySeedSu, secondarySeedOrder, partialCutoff));
nativeThread.Priority = ThreadPriority.Highest;
nativeThread.Start();
}
public void _run(DescrStrDetail dsd, DescrStrDetail.SortUnit primarySeedSu,
List<ushort> secondarySeedOrder, double partialCutoff)
{
int groupSize = this.seqGroup.Count;
int seqCount = 0;
foreach (KeyValuePair<string, string> p in seqGroup)
{
Console.WriteLine("ThreadID {0} (priority:{1}):\t#{2}/{3} SeqName: {4}",
nativeThread.ManagedThreadId, nativeThread.Priority.ToString(), ++seqCount, groupSize, p.Key);
List<SearchAlgorithm.CandidateStr> tmpCompleteModels, tmpPartialModels;
SearchAlgorithm.SearchInBothDirections(
p.Value.ToUpper().Replace('T', 'U'), dsd, primarySeedSu, secondarySeedOrder, partialCutoff,
out tmpCompleteModels, out tmpPartialModels);
completeModels.Add(p.Key, tmpCompleteModels);
partialModels.Add(p.Key, tmpPartialModels);
}
}
public void Join()
{
nativeThread.Join();
}
}
class Program
{
public static int _paramSeqGroupSize = 2000;
static void Main(Dictionary<string, string> rawSeqs)
{
// Split the whole rawSeqs (Dict<name, seq>) into several groups
Dictionary<string, string>[] rawSeqGroups = SplitSeqFasta(rawSeqs, _paramSeqGroupSize);
// Create a thread for each seqGroup and run
var threadSeqGroups = new MultiThreading.ThreadClassSeqGroups[rawSeqGroups.Length];
for (int i = 0; i < rawSeqGroups.Length; i++)
{
threadSeqGroups[i] = new MultiThreading.ThreadClassSeqGroups(rawSeqGroups[i]);
//threadSeqGroups[i].SetPriority();
threadSeqGroups[i].Run(dsd, primarySeedSu, secondarySeedOrder, _paramPartialCutoff);
}
// Merge results from threads after the thread finish
var allCompleteModels = new Dictionary<string, List<SearchAlgorithm.CandidateStr>>();
var allPartialModels = new Dictionary<string, List<SearchAlgorithm.CandidateStr>>();
foreach (MultiThreading.ThreadClassSeqGroups t in threadSeqGroups)
{
t.Join();
foreach (string name in t.completeModels.Keys)
{
allCompleteModels.Add(name, t.completeModels[name]);
}
foreach (string name in t.partialModels.Keys)
{
allPartialModels.Add(name, t.partialModels[name]);
}
}
}
}
However, the speed with multiple threads is much slower than single thread, and the CPU load is generally <10%.
For example:
The input file contain 2500 strings
_paramGroupSize = 3000, main thread + 1 calculation thread cost 200 sec
_paramGroupSize = 400, main thread + 7 calculation threads cost much more time (I killed it after over 10 mins run).
Is there any problem with my implementation? How to speed it up?
Thanks.

It seems to me that you are trying to process a file in parallel with multiple threads. This is a bad idea, assuming you have a single mechanical disk.
Basically, the head of the disk needs to seek the next reading location for each read request. This is a costly operation and since multiple threads issue read commands it means the head gets bounced around as each thread gets its turn to run. This will drastically reduce performance compared to the case where a single thread is doing the reading.

What was the code prior to multithreading? It's hard to tell what this code is doing, and much of the "working" code seems to be hidden in your search algorithm. However, some thoughts:
You mention an "input file", but this is not clearly shown in code - if your file access is being threaded, this will not increase performance as the file access will be the bottleneck.
Creating more threads than you have CPU cores will ultimately reduce performance (unless each thread is blocked waiting on different resources). In your case I would suggest that 8 total threads is too many.
It seems that a lot of data (memory) access might be done through your class DescrStrDetail which is passed from variable dsd in your Main method to every child thread. However, the declaration of this variable is missing and so its usage/implementation is unknown. If this variable has locks that prevent multiple threads accessing at the same time, then your multiple threads will potentially be locking eachother out of this data, further slowing performance.

When threads are run they are given time on a specific processor. if there are more threads than processors, the system context switches between threads to get all active threads some time to process. Context switching is really expensive. If you have more threads than processors most of the CPU time can be take up by context switching and make a single-threaded solution look faster than a multi thread solution.
Your example shows starting an indeterminate number of threads. if SplitSeqFasta returns more entries than cores, you will create more threads and cores and introduce a lot of context switching.
I suggest you throttle the number of threads manually, or use something like the thread parallel library and the Parallel class to have it automatically throttle for you.

Related

Multithread foreach slows down main thread

Edit: As per the discussion in the comments, I was overestimating how much many threads would help, and have gone back to Parallell.ForEach with a reasonable MaxDegreeOfParallelism, and just have to wait it out.
I have a 2D array data structure, and perform work on slices of the data. There will only ever be around 1000 threads required to work on all the data simultaneously. Basically there are around 1000 "days" worth of data for all ~7000 data points, and I would like to process the data for each day in a new thread in parallel.
My issue is that doing work in the child threads dramatically slows the time in which the main thread starts them. If I have no work being done in the child threads, the main thread starts them all basically instantly. In my example below, with just a bit of work, it takes ~65ms to start all the threads. In my real use case, the worker threads will take around 5-10 seconds to compute all what they need, but I would like them all to start instantly otherwise, I am basically running the work in sequence. I do not understand why their work is slowing down the main thread from starting them.
How the data is setup shouldn't matter (I hope). The way it's setupmight look weird I was just simulating exactly how I receive the data. What's important is that if you comment out the foreach loop in the DoThreadWork method, the time it takes to start the threads is waaay lower.
I have the for (var i = 0; i < 4; i++) loop just to run the simulation multiple times to see 4 sets of timing results to make sure that it wasn't just slow the first time.
Here is a code snippet to simulate my real code:
public static void Main(string[] args)
{
var fakeData = Enumerable
.Range(0, 7000)
.Select(_ => Enumerable.Range(0, 400).ToArray())
.ToArray();
const int offset = 100;
var dataIndices = Enumerable
.Range(offset, 290)
.ToArray();
for (var i = 0; i < 4; i++)
{
var s = Stopwatch.StartNew();
var threads = dataIndices
.Select(n =>
{
var thread = new Thread(() =>
{
foreach (var fake in fakeData)
{
var sliced = new ArraySegment<int>(fake, n - offset, n - (n - offset));
DoThreadWork(sliced);
}
});
return thread;
})
.ToList();
foreach (var thread in threads)
{
thread.Start();
}
Console.WriteLine($"Before Join: {s.Elapsed.Milliseconds}");
foreach (var thread in threads)
{
thread.Join();
}
Console.WriteLine($"After Join: {s.Elapsed.Milliseconds}");
}
}
private static void DoThreadWork(ArraySegment<int> fakeData)
{
// Commenting out this foreach loop will dramatically increase the speed
// in which all the threads start
var a = 0;
foreach (var fake in fakeData)
{
// Simulate thread work
a += fake;
}
}

Use the thread/task pool and limit thread/task count to 2*(CPU Cores) at most. Creating more threads doesn't magically make more work get done as you need hardware "threads" to run them (1 per CPU core for non-SMT CPU's, 2 per core for Intel HT, AMD's SMT implementation). Executing hundreds to thousands of threads that don't have to passively await asynchronous callbacks (i.e. I/O) makes running the threads far less efficient due to thrashing the CPU with context switches for no reason.

How to optimize CPU load up to 100% with GZip multithread archiver?

I've been working on a multithreaded file archiver for a week now, it works exclusively on clean threads. Synchronization is achieved by monitors and AutoResetEvent.
I allocated the number of threads to the number of cores like that:
public static int GetCoreCount()
{
int coreCount = 0;
foreach (var item in new System.Management.ManagementObjectSearcher("Select * from Win32_Processor").Get())
{
coreCount += int.Parse(item["NumberOfCores"].ToString());
}
return coreCount;
}
But that load my CPU max ~65%.
And this load is far from uniform, it constantly falls and rises.
Tell me. Does anyone have any idea how to use 100% processor capability?
This is my Run() code :
public void Run()
{
var readingThread = new Thread(new ThreadStart(ReadInFile));
var compressingThreads = new List<Thread>();
for (var i = 0; i < CoreManager.GetCoreCount(); i++)
{
var j = i;
ProcessEvents[j] = new AutoResetEvent(false);
compressingThreads.Add(new Thread(() => Process(j)));
}
var writingThread = new Thread(new ThreadStart(WriteOutFile));
readingThread.Start();
foreach (var compressThread in compressingThreads)
{
compressThread.Start();
}
writingThread.Start();
WaitHandle.WaitAll(ProcessEvents);
OutputDictionary.SetCompleted();
writingThread.Join();

It's not possible to tell what is limiting your core usage without profiling, and also knowing how much data you are compressing in your test.
However I can say that in order to get good efficiency, which includes both full core utilization and close to a factor of n speedup for n threads over one thread, in pigz I have to create pools of threads that are always there, either running or waiting for more work. It is a huge impact to create and destroy threads for every chunk of data to be processed. I also have pools of pre-allocated blocks of memory for the same reason.
The source code at the link, in C, may be of help.

Batch process all items in ConcurrentBag

I have the following use case. Multiple threads are creating data points which are collected in a ConcurrentBag. Every x ms a single consumer thread looks at the data points that came in since the last time and processes them (e.g. count them + calculate average).
The following code more or less represents the solution that I came up with:
private static ConcurrentBag<long> _bag = new ConcurrentBag<long>();
static void Main()
{
Task.Run(() => Consume());
var producerTasks = Enumerable.Range(0, 8).Select(i => Task.Run(() => Produce()));
Task.WaitAll(producerTasks.ToArray());
}
private static void Produce()
{
for (int i = 0; i < 100000000; i++)
{
_bag.Add(i);
}
}
private static void Consume()
{
while (true)
{
var oldBag = _bag;
_bag = new ConcurrentBag<long>();
var average = oldBag.DefaultIfEmpty().Average();
var count = oldBag.Count;
Console.WriteLine($"Avg = {average}, Count = {count}");
// Wait x ms
}
}
Is a ConcurrentBag the right tool for the job here?
Is switching the bags the right way to achieve clearing the list for new data points and then processing the old ones?
Is it safe to operate on oldBag or could I run into trouble when I iterate over oldBag and a thread is still adding an item?
Should I use Interlocked.Exchange() for switching the variables?
EDIT
I guess the above code was not really a good representation of what I'm trying to achieve. So here is some more code to show the problem:
public class LogCollectorTarget : TargetWithLayout, ILogCollector
{
private readonly List<string> _logMessageBuffer;
public LogCollectorTarget()
{
_logMessageBuffer = new List<string>();
}
protected override void Write(LogEventInfo logEvent)
{
var logMessage = Layout.Render(logEvent);
lock (_logMessageBuffer)
{
_logMessageBuffer.Add(logMessage);
}
}
public string GetBuffer()
{
lock (_logMessageBuffer)
{
var messages = string.Join(Environment.NewLine, _logMessageBuffer);
_logMessageBuffer.Clear();
return messages;
}
}
}
The class' purpose is to collect logs so they can be sent to a server in batches. Every x seconds GetBuffer is called. This should get the current log messages and clear the buffer for new messages. It works with locks but it as they are quite expensive I don't want to lock on every Logging-operation in my program. So that's why I wanted to use a ConcurrentBag as a buffer. But then I still need to switch or clear it when I call GetBuffer without loosing any log messages that happen during the switch.

Since you have a single consumer, you can work your way with a simple ConcurrentQueue, without swapping collections:
public class LogCollectorTarget : TargetWithLayout, ILogCollector
{
private readonly ConcurrentQueue<string> _logMessageBuffer;
public LogCollectorTarget()
{
_logMessageBuffer = new ConcurrentQueue<string>();
}
protected override void Write(LogEventInfo logEvent)
{
var logMessage = Layout.Render(logEvent);
_logMessageBuffer.Enqueue(logMessage);
}
public string GetBuffer()
{
// How many messages should we dequeue?
var count = _logMessageBuffer.Count;
var messages = new StringBuilder();
while (count > 0 && _logMessageBuffer.TryDequeue(out var message))
{
messages.AppendLine(message);
count--;
}
return messages.ToString();
}
}
If memory allocations become an issue, you can instead dequeue them to a fixed-size array and call string.Join on it. This way, you're guaranteed to do only two allocations (whereas the StringBuilder could do many more if the initial buffer isn't properly sized):
public string GetBuffer()
{
// How many messages should we dequeue?
var count = _logMessageBuffer.Count;
var buffer = new string[count];
for (int i = 0; i < count; i++)
{
_logMessageBuffer.TryDequeue(out var message);
buffer[i] = message;
}
return string.Join(Environment.NewLine, buffer);
}

Is a ConcurrentBag the right tool for the job here?
Its the right tool for a job, this really depends on what you are trying to do, and why. The example you have given is very simplistic without any context so its hard to tell.
Is switching the bags the right way to achieve clearing the list for
new data points and then processing the old ones?
The answer is no, for probably many reasons. What happens if a thread writes to it, while you are switching it?
Is it safe to operate on oldBag or could I run into trouble when I
iterate over oldBag and a thread is still adding an item?
No, you have just copied the reference, this will achieve nothing.
Should I use Interlocked.Exchange() for switching the variables?
Interlock methods are great things, however this will not help you in your current problem, they are for thread safe access to integer type values. You are really confused and you need to look up more thread safe examples.
However Lets point you in the right direction. forget about ConcurrentBag and those fancy classes. My advice is start simple and use locking so you understand the nature of the problem.
If you want multiple tasks/threads to access a list, you can easily use the lock statement and guard access to the list/array so other nasty threads aren't modifying it.
Obviously the code you have written is a nonsensical example, i mean you are just adding consecutive numbers to a list, and getting another thread to average them them. This hardly needs to be consumer producer at all, and would make more sense to just be synchronous.
At this point i would point you to better architectures that would allow you to implement this pattern, e.g Tpl Dataflow, but i fear this is just a learning excise and unfortunately you really need to do more reading on multithreading and try more examples before we can truly help you with a problem.

It works with locks but it as they are quite expensive. I don't want to lock on every logging-operation in my program.
Acquiring an uncontended lock is actually quite cheap. Quoting from Joseph Albahari's book:
You can expect to acquire and release a lock in as little as 20 nanoseconds on a 2010-era computer if the lock is uncontended.
Locking becomes expensive when it is contended. You can minimize the contention by reducing the work inside the critical region to the absolute minimum. In other words don't do anything inside the lock that can be done outside the lock. In your second example the method GetBuffer does a String.Join inside the lock, delaying the release of the lock and increasing the chances of blocking other threads. You can improve it like this:
public string GetBuffer()
{
string[] messages;
lock (_logMessageBuffer)
{
messages = _logMessageBuffer.ToArray();
_logMessageBuffer.Clear();
}
return String.Join(Environment.NewLine, messages);
}
But it can be optimized even further. You could use the technique of your first example, and instead of clearing the existing List<string>, just swap it with a new list:
public string GetBuffer()
{
List<string> oldList;
lock (_logMessageBuffer)
{
oldList = _logMessageBuffer;
_logMessageBuffer = new();
}
return String.Join(Environment.NewLine, oldList);
}
Starting from .NET Core 3.0, the Monitor class has the property Monitor.LockContentionCount, that returns the number of times there was contention at the entry point of a lock. You could watch the delta of this property every second, and see if the number is concerning. If you get single-digit numbers, there is nothing to worry about.
Touching some of your questions:
Is a ConcurrentBag the right tool for the job here?
No. The ConcurrentBag<T> is a very specialized collection intended for mixed producer scenarios, mainly object pools. You don't have such a scenario here. A ConcurrentQueue<T> is preferable to a ConcurrentBag<T> in almost all scenarios.
Should I use Interlocked.Exchange() for switching the variables?
Only if the collection was immutable. If the _logMessageBuffer was an ImmutableQueue<T>, then it would be excellent to swap it with Interlocked.Exchange. With mutable types you have no idea if the old collection is still in use by another thread, and for how long. The operating system can suspend any thread at any time for a duration of 10-30 milliseconds or even more (demo). So it's not safe to use lock-free techniques. You have to lock.

how to restrict a method to n concurrent calls

I have an integration service which runs a calculation heavy, data bound process. I want to make sure that there are never more than say, n = 5, (but n will be configurable, changeable at runtime) of these processes running at the same. The idea is to throttle the load on the server to a safe level. The amount of data processed by the method is limited by batching, so I don't need to worry about 1 process representing a much bigger load than another.
The processing method is called by another process, where requests to run payroll are held on a queue, and I can insert some logic at that point to determine whether to process this request now, or leave it on the queue.
So i want a seperate method on the same service as the processing method, which can tell me if the server can accept another call to the processing method. It's going to ask, "how many payroll runs are going on? is that less than n?" What's a neat way of achieving this?
-----------edit------------
I think I need to make it clear, the process that decides whether to take the request off the queue this is seperated from the service that processes the payroll data by a WCF boundary. Stopping a thread on the payroll processing process isn't going to prevent more requests coming in

You can use a Semaphore to do this.
public class Foo
{
private Semaphore semaphore;
public Foo(int numConcurrentCalls)
{
semaphore = new Semaphore(numConcurrentCalls, numConcurrentCalls);
}
public bool isReady()
{
return semaphore.WaitOne(0);
}
public void Bar()
{
try
{
semaphore.WaitOne();//it will only get past this line if there are less than
//"numConcurrentCalls" threads in this method currently.
//do stuff
}
finally
{
semaphore.Release();
}
}
}

Review the Object Pool pattern. This is what you're describing. While not strictly required by the pattern, you can expose the number of objects currently in the pool, the maximum (configured) number, the high-watermark, etc.

I think that you might want a BlockingCollection, where each item in the collection represents one of the concurrent calls.
Also see IProducerConsumerCollection.
If you were just using threads, I'd suggest you look at the methods for limiting thread concurrency (e.g. the TaskScheduler.MaximumConcurrencyLevel property, and this example.).
Also see ParallelEnumerable.WithDegreeOfParallelism

void ThreadTest()
{
ConcurrentQueue<int> q = new ConcurrentQueue<int>();
int MaxCount = 5;
Random r = new Random();
for (int i = 0; i <= 10000; i++)
{
q.Enqueue(r.Next(100000, 200000));
}
ThreadStart proc = null;
proc = () =>
{
int read = 0;
if (q.TryDequeue(out read))
{
Console.WriteLine(String.Format("[{1:HH:mm:ss}.{1:fff}] starting: {0}... #Thread {2}", read, DateTime.Now, Thread.CurrentThread.ManagedThreadId));
Thread.Sleep(r.Next(100, 1000));
Console.WriteLine(String.Format("[{1:HH:mm:ss}.{1:fff}] {0} ended! #Thread {2}", read, DateTime.Now, Thread.CurrentThread.ManagedThreadId));
proc();
}
};
for (int i = 0; i <= MaxCount; i++)
{
new Thread(proc).Start();
}
}

Multithreaded code executes by threadnumber-times slower using System.Threading and Visual Studio C# Express Hosting Process

I have a very simple program counting the characters in a string. An integer threadnum sets the number of threads and divides the data by threadnum accordingly into chunks for each thread to process.
Each thread increments the values contained in a shared dictionary, building a character historgram.
private Dictionary<UInt32, int> dict = new Dictionary<UInt32, int>();
In order to wait for all threads to finish and continue with the main process, I invoke Thread.Join
Initially I had a local dictionary for each thread which get merged afterwards, but a shared dictionary worked fine, without locking.
No references are locked in the method BuildDictionary, though locking the dictionary did not significantly impact thread-execution time.
Each thread is timed, and the resulting dictionary compared.
The dictionary content is the same regardless of a single or multiple threads - as it should be.
Each thread takes a fraction determined by threadnum to complete - as it should be.
Problem:
The total time is roughly a multiple of threadnum , that is to say the execution time increases ?
(Unfortunately I cannot run a C# Profiler at the moment. Additionally I would prefer C# 3 code compatibility. )
Others are likely struggling as well. It may be that the VS 2010 express edition vshost process stacks and schedules threads to be run sequentially?
Another MT-performance issue was posted recently posted here as "Visual Studio C# 2010 Express Debug running Faster than Release":
Code:
public int threadnum = 8;
Thread[] threads = new Thread[threadnum];
Stopwatch stpwtch = new Stopwatch();
stpwtch.Start();
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
threads[threadidx].Join(); //Blocks the calling thread, till thread completion
}
WriteLine("Total - time: {0} msec", stpwtch.ElapsedMilliseconds);
Can you help please?
Update:
It appears that the strange behavior of an almost linear slowdown with increasing thread-number is an artifact due to the numerous hooks of the IDE's Debugger.
Running the process outside the developer environment, I actually do get a 30% speed increase on a 2 logical/physical core machine. During debugging I am already at the high end of CPU utilization, and hence I suspect it is wise to have some leeway during development through additional idle cores.
As initially, I let each thread compute on its own local data-chunk, which is locked and written back to a shared list and aggregated after all threads have finished.
Conclusion:
Be heedful of the environment the process is running in.

We can put the dictionary synchronization issues Tony the Lion mentions in his answer aside for the moment, because in your current implementation you are in fact not running anything in parallel!
Let's take a look at what you are currently doing in your loop:
Start a thread.
Wait for the thread to complete.
Start the next thread.
In other words, you should not be calling Join inside the loop.
Instead, you should start all threads as you are doing, but use a singaling construct such as an AutoResetEvent to determine when all threads have completed.
See example program:
class Program
{
static EventWaitHandle _waitHandle = new AutoResetEvent(false);
static void Main(string[] args)
{
int numThreads = 5;
for (int i = 0; i < numThreads; i++)
{
new Thread(DoWork).Start(i);
}
for (int i = 0; i < numThreads; i++)
{
_waitHandle.WaitOne();
}
Console.WriteLine("All threads finished");
}
static void DoWork(object id)
{
Thread.Sleep(1000);
Console.WriteLine(String.Format("Thread {0} completed", (int)id));
_waitHandle.Set();
}
}
Alternatively you could just as well be calling Join in the second loop if you have references to the threads available.
After you have done this you can and should worry about the dictionary synchronization problems.

A Dictionary can support multiple readers concurrently, as long as the collection is not modified. From MSDN
You say:
but a shared dictionary worked fine, without locking.
Each thread increments the values contained in a shared dictionary
Your program is by definition broken, if you alter the data in the dictionary without proper locking, you will end up with bugs. Nothing more needs to be said.

I wouldn't use some shared static Dictionary, if each thread worked on a local copy you could amalgamate your results once all threads had signalled completion.
WaitHandle.WaitAll avoids any deadlocking on an AutoResetEvent.
class Program
{
static void Main()
{
char[] text = "Some String".ToCharArray();
int numThreads = 5;
// I leave the implementation of the next line to the OP.
Partition[] partitions = PartitionWork(text, numThreads);
completions = new WaitHandle[numThreads];
results = IDictionary<char, int>[numThreads];
for (int i = 0; i < numThreads; i++)
{
results[i] = new IDictionary<char, int>();
completions[i] = new ManualResetEvent(false);
new Thread(DoWork).Start(
text,
partitions[i].Start,
partitions[i].End,
results[i],
completions[i]);
}
if (WaitHandle.WaitAll(completions, new TimeSpan(366, 0, 0, 0))
{
Console.WriteLine("All threads finished");
}
else
{
Console.WriteLine("Timed out after a year and a day");
}
// Merge the results
IDictionary<char, int> result = results[0];
for (int i = 1; i < numThreads - 1; i ++)
{
foreach(KeyValuePair<char, int> item in results[i])
{
if (result.ContainsKey(item.Key)
{
result[item.Key] += item.Value;
}
else
{
result.Add(item.Key, item.Value);
}
}
}
}
static void BuildDictionary(
char[] text,
int start,
int finish,
IDictionary<char, int> result,
WaitHandle completed)
{
for (int i = start; i <= finish; i++)
{
if (result.ContainsKey(text[i])
{
result[text[i]]++;
}
else
{
result.Add(text[i], 1);
}
}
completed.Set();
}
}
With this implementation the only variable that is ever shared is the char[] of the text and that is always read only.
You do have the burden of merging the dictionaries at the end but, that is a small price for avoiding any concurrencey issues. In a later version of the framework I would have used TPL and ConcurrentDictionary and possibly Partitioner<TSource>.

I totally agree with TonyTheLion and others, and as you fix the actual problem with join'ing at the wrong place, there still will be problem with (no) locks and updating the shared dictionary. I wanted to drop you a quick workaround: just wrap your integer value into some object:
instead of:
Dictionary<uint, int> dict = new Dictionary<uint, int>();
use:
class Entry { public int value; }
Dictionary<uint, Entry> dict = new Dictionary<uint, Entry>();
and now increment the Entry::value instead. That way, the Dictionary will not notice any changes and it will be safe without locking the dictionary.
Note: this will however work only if you are guaranteed if one thread would use only its own one Entry. I've just noticed this is not true as you said 'histogram of characters'. You will have to lock over each Entry during the increment, or some increments may be lost. Still, locking at Entry layer will speed up signinificantly when compared to locking at whole dictionary

Roem saw it.
Your main thread should Join the X other Threads after having started all of them.
Else it waits for the 1st thread to be finished, to start and wait for the 2nd one.
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
}
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx].Join(); //Blocks the calling thread, till thread completion
}

As Rotem points out, by joining in the loop you are waiting for each thread to complete before going continuing.
The hint for why this is can be found on the Thread.Join documentation on MSDN
Blocks the calling thread until a thread terminates
So you loop will not continue until that one thread has completed it's work. To start all the threads then wait for them to complete, join them outside the loop:
public int threadnum = 8;
Thread[] threads = new Thread[threadnum];
Stopwatch stpwtch = new Stopwatch();
stpwtch.Start();
// Start all the threads doing their work
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
}
// Join to all the threads to wait for them to complete
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx].Join();
}
System.Diagnostics.Debug.WriteLine("Total - time: {0} msec", stpwtch.ElapsedMilliseconds);
You will really need to post your BuildDictionary function. It is very likely that the operation will be no faster with multiple threads and the threading overhead will actually increase execution time.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.