I am creating an component which performs sequential processing on inputs. As it will be hosted in several different processes, I need it to be thread safe. At first, I intentionally left out thread safety from the code. Now it is time to introduce that.
First, I wanted to provoke an error to start with, but was not able to. Here is a simplified version of the code for the processing engine:
public Document DoOrchestration(Document input)
{
Document output = new Document();
foreach (var orchestrationStep in m_OrchestrationSteps)
{
var processor = GetProcessor(orchestrationStep).Clone();
output = processor.Process(input);
input = output;
}
return output;
}
The processors can be developed by other people in my organisation, and that can include some complex initialization. They may also be thread unsafe, so I use the Prototype Pattern to get unique instances of the to avoid threading issues in those.
To test this function I used the following code:
for (int i = 0; i < 20000; i++)
{
Thread t = new Thread(() => TestOrchestration(i));
t.Start();
}
void TestOrchestration(int number)
{
Document doc = new Document(string.Format("Test {0}", number));
doc = DoOrchestration(doc);
if (doc.ToString().Substring(0,35) != strExpectedResult)
{
System.Console.WriteLine("Error: {0}", doc.ToString();
}
}
I expected that some of the threads would collide with another and mix up their results, but to my surprise that did not happen.
There is probably a easy and logical explanation to this, but it eludes me. Or is it just that the code is too simple to result in two threads fiddling with the input/output variables at the same time?
Check out CHESS.
CHESS is a tool for finding and reproducing Heisenbugs in concurrent
programs. CHESS repeatedly runs a concurrent test ensuring that every
run takes a different interleaving. If an interleaving results in an
error, CHESS can reproduce the interleaving for improved debugging.
CHESS is available for both managed and native programs.
I would assume that due to the simplicity of your test function, your threads do not even get the time to spawn in a large quantity before the previous one is done with it's work. Consider using a barrier to allow all threads to spawn before you start the computational step. Also, you will want to consider increasing the complexity of your test case, for instance by performing several of the same operation in the same loop (starting a thread is expensive, and allows other cores to complete their work before you even get around to the resource contention.
Generally, resource contention can be provoked by having rapid access to the same resource over a longer period of time, your test case does not seem to allow for this. As an aside, I would strongly recommend that you design for thread safety in mind, instead of introducing this later. When working with the code you have a better understanding of resource access patterns than you will have when analyzing the code at a later stage.
You could use the ManualResetEvent for simultaneous continuation of any number of awaiting threads.
I think your test function is almost complete before next thread starts. You can make all the threads wait on an ManualResetEventSlim before invoking the orchestration function and then set the ManualResetEventSlim
That way all your threads will infact try to invoke orchestration at the same time.
Also possibly you would not need 20,000 threads to simulate this behavior if all threads male orchestracion call at almost the same time
ManualResetEventSlim manualEvent = new ManualResetEventSlim (false);
for (int i = 0; i < 20000; i++)
{
Thread t = new Thread(() => TestOrchestration(i));
t.Start();
}
manualEvent.Set();
void TestOrchestration(int number)
{
manualEvent.Wait();
Document doc = new Document(string.Format("Test {0}", number));
doc = DoOrchestration(doc);
if (doc.ToString().Substring(0,35) != strExpectedResult)
{
System.Console.WriteLine("Error: {0}", doc.ToString();
}
}
Related
I'm going to start by describing my use case:
I have built an app which processes LARGE datasets, runs various transformations on them and them spits them out. This process is very time sensitive so a lot of time has gone into optimising.
The idea is to read a bunch of records at a time, process each one on different threads and write the results to file. But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end. This is so that we avoid memory write protection exceptions or bottlenecks (as much as possible).
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10. That value is what determines which of the fileUtils objects get passed to the record processing object per thread. The idea is that each util class is responsible for collecting and writing to just one of the temp output files.
It's worth nothing that each FileUtils object gathers about 100 records in a member outputBuildString variable before writing it out, hence having them exist separately and outside of the threading process, where objects lifespan is limited.
The is to more or less evenly disperse the responsability for collecting, storing and then writing the output data across multiple fileUtil objects which means we can write more per second than if we were just writing to one file.
my problem is that this approach results in a Array Out Of Bounds exception as my threadedOutputIterator jumps above the upper limit value, despite there being code that is supposed to reduce it when this happens:
//by default threadCount = 10
private void ProcessRecords()
{
try
{
Parallel.ForEach(clientInputRecordList, new ParallelOptions { MaxDegreeOfParallelism = threadCount }, LocalInit, ThreadMain, LocalFinally);
}
catch (Exception e)
{
Console.WriteLine("The following error occured: " + e);
}
}
private SplitLineParseObject LocalInit()
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
//still somehow goes above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount, ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator<threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
private SplitLineParseObject ThreadMain(ClientInputRecord record, ParallelLoopState state, SplitLineParseObject threadLocalObject)
{
threadLocalObject.clientInputRecord = record;
threadLocalObject.ProcessRecord();
recordsPassedToObject++;
return threadLocalObject;
}
private void LocalFinally(SplitLineParseObject obj)
{
obj = null;
}
As explained in the above comment,it still manages to jump above 10, and this is where the excepetion hits since there are only 10 objects in the threadedFileUtils array. I understand that this is because multiple threads would be incrementing that number at the same time before either of the code in those if statements could be called, meaning theres still the chance it will fail in its current state.
How could I better approach this such that I avoid that exception, while still being able to take advantage of the read, store and write efficiency that having multiple fileUtils gives me?
Thanks!
But instead of writing them to one file, the results are written to one of many temp files which get combined into the desired output file at the end
That is probably not a great idea. If you can fit the data in memory it is most likely better to keep it in memory, or do the merging of data concurrently with the production of data.
To achieve that, we have an array of 10 fileUtils, 1 of which get passed to a thread as it is initiated. There is a threadCountIterator which increments at each localInit, and is reset back to zero when that count reaches 10
This does not sound safe to me. The parallel loop should guarantee that no more than 10 threads should run concurrently (if that is your limit), and that local init will run once for each thread that is used. As far as I know it makes no guarantee that no more than 10 threads will be used in total, so it seem possible that thread #0 and thread #10 could run concurrently.
The correct usage would be to create a new fileUtils-object in the localInit.
This more or less works and ends up being more efficient than if we are writing to just one file
Are you sure? typically IO does not scale very well with concurrency. While SSDs are absolutely better than HDDs, both tend to work best with sequential IO.
How could I better approach this?
My approach would be to use a single writing thread, and a blockingCollection as a thread-safe buffer between the producers and the writer. This assumes that the order of items is not significant:
public async Task ProcessAndWriteItems(List<int> myItems)
{
// BlockingCollection uses a concurrentQueue by default
// Can also set a max size , in case the writer cannot keep up with the producers
var writeQueue = new BlockingCollection<string>();
var writeTask = Task.Run(() => Writer(writeQueue));
Parallel.ForEach(
myItems,
item =>
{
writeQueue.Add(item.ToString());
});
writeQueue.CompleteAdding(); // signal the writer to stop once all items has been processed
await writeTask;
}
private void Writer(BlockingCollection<string> queue)
{
using var stream = new StreamWriter(myFilePath);
foreach (var line in queue.GetConsumingEnumerable())
{
stream.WriteLine(line);
}
}
There is also dataflow that should be suitable for tasks like this. But I have not used it, so I cannot provide specific recommendations.
Note that multi threaded programming is difficult. While it can be made easier by proper use of modern programming techniques, you still need need to know a fair bit about thread safety to understand the problems, and what options and tools exist to solve them. You will not always be so lucky to get actual exceptions, a more typical result of multi threading bugs would be that your program just produces the wrong result. If you are unlucky this only occur in production, on a full moon, and only when processing important data.
LocalInit obviously is not thread safe, so when invoked multiple times in parallel it will have all the multithreading problems caused by not-atomic operations. As a quick fix you can lock the whole method:
private object locker = new object();
private SplitLineParseObject LocalInit()
{
lock (locker)
{
if (threadedOutputIterator >= threadCount)
{
threadedOutputIterator = 0;
}
SplitLineParseObject splitLineParseUtil = new SplitLineParseObject(parmUtils, ref recCount,
ref threadedFileUtils[threadedOutputIterator], ref recordsPassedToFileUtils);
if (threadedOutputIterator < threadCount)
{
threadedOutputIterator++;
}
return splitLineParseUtil;
}
}
Or maybe try to workaround with Interlocked for more fine-grained control and better performance (but it would not be very easy, if even possible).
Note that even if you will implement this in current code - there is still no guarantee that all previous writes are actually finished i.e. for 10 files there is a possibility that the one with 0 index is not yet finished while next 9 are and the 10th will try writing to the same file as 0th is writing too. Possibly you should consider another approach (if you still want to write to multiple files, though IO does not usually scale that well, so possibly just blocking write with queue in one file is a way to go) - you can consider splitting your data in chunks and process them in parallel (i.e. "thread" per chunk) while every chunk writes to it's own file, so there is no need for sync.
Some potentially useful reading:
Overview of synchronization primitives
System.Threading.Channels
TPL Dataflow
Threading in C# by Joseph Albahari
I posted another SO question here, and as a follow-up, my colleague did a test, seen below, as some form of "counter" to the argument for async/await/Tasks.
(I am aware that the lock on resultList isn't needed, disregard that)
I am aware that async/await and Tasks is not made to handle CPU-intensive tasks but instead handle I/O operations that are done by the OS. The benchmark below is a CPU-intensive task, so the test is flawed from start.
However, as I understand it, using new Task().Start() will schedule the operation on the ThreadPool and execute the test code on different threads on the ThreadPool. Wouldnt that mean that the first and second test are more or less the same? (I'm guessing not, please explain
Why then the big difference between them?
some form of "counter" to the argument for async/await/Tasks.
The posted code has absolutely nothing to do with async or await. It's comparing three different kinds of parallelism:
Dynamic Task Parallelism.
Direct threadpool access.
Manual multithreading with manual partitioning.
The first two are somewhat comparable. Of course, direct threadpool access will be faster than Dynamic Task Parallelism. But what these tests don't show is that direct threadpool access is much harder to do correctly. In particular, when you are running real-world code and need to handle exceptions and return values, you have to add in boilerplate code and object instances to the direct threadpool access code that slows it down.
The third one is not comparable at all. It just uses 10 manual threads. Again, this example ignores the additional complexity necessary in real-world code; specifically, the need to handle exceptions and return values. It also assumes a partition size, which is problematic; real-world code does not have that luxury. If you're managing your own set of threads, then you have to decide things like how quickly you should increase the number of threads when the queue has many items, and how quickly you should end threads when the queue is empty. These are all difficult questions that add lots of code to the #3 test before you're really comparing the same thing.
And that's not even to say anything about the cost of maintenance. In my experience (i.e., as an application developer), micro-optimizations are just not worth it. Even if you took the "worst" (#1) approach, you're losing about 7 microseconds per item. That is an unimaginably small amount of savings. As a general rule, developer time is far more valuable to your company than user time. If your users have to process a hundred thousand items, the difference would barely be perceptible. If you were to adopt the "best" (#3) approach, the code would be much less maintainable, particularly considering the boilerplate and thread management code necessary in production code and not shown here. Going with #3 would probably cost your company far more in terms of developer time just writing or reading the code than it would ever save in terms of user time.
Oh, and the funniest part of all this is that with all these different kinds of parallelism compared, they didn't even include the one that is most suitable for this test: PLINQ.
static void Main(string[] args)
{
TaskParallelLibrary();
ManualThreads();
Console.ReadKey();
}
static void ManualThreads()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var resultList = new List<string>();
var stopwatch = Stopwatch.StartNew();
var counter = 0;
for (int i = 0; i != 10; ++i)
{
new Thread(() =>
{
while (true)
{
var t = "";
lock (queue)
{
if (counter >= queue.Count)
break;
t = queue[counter];
++counter;
}
t = t.Substring(0, 5);
string t2 = t.Substring(0, 2) + t;
lock (resultList)
resultList.Add(t2);
}
}).Start();
}
while (resultList.Count < queue.Count)
Thread.Sleep(1);
stopwatch.Stop();
Console.WriteLine($"Manual threads: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
static void TaskParallelLibrary()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var stopwatch = Stopwatch.StartNew();
var resultList = queue.AsParallel().Select(t =>
{
t = t.Substring(0, 5);
return t.Substring(0, 2) + t;
}).ToList();
stopwatch.Stop();
Console.WriteLine($"Parallel: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
On my machine, after running this code several times, I find that the PLINQ code outperforms the Manual Threads by about 30%. Sample output on .NET Core 3.0 preview5-27626-15, built for Release, run standalone:
Parallel: Processed 1000000 in 00:00:00.3629408
Manual threads: Processed 1000000 in 00:00:00.5119985
And, of course, the PLINQ code is:
Shorter
More maintainable
More robust (handles exceptions and return types)
Less awkward (no need to poll for completion)
More portable (partitions based on number of processors)
More flexible (automatically adjusts the thread pool as necessary based on amount of work)
I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.
The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.
Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());
Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..
I have a function, that processes a list of 6100 list items. The code used to work when the list was just 300 items. But instantly crashes with 6100. Is there a way I can loop through these 6100 items say 30 at a time and execute a new thread per item?
for (var i = 0; i < ListProxies.Items.Count; i++)
{
var s = ListProxies.Items[i] as string;
var thread = new ParameterizedThreadStart(ProxyTest.IsAlive);
var doIt = new Thread(thread) { Name = "CheckProxy# " + i };
doIt.Start(s);
}
Any help would be greatly appreciated.
Do you really need to spawn a new thread for each work item? Unless there is a genuine need for this (if so, please tell us why), I would strongly recommend you use the Managed Thread Pool instead. This will give you the concurrency benefits you require, but without the resource requirements (as well as the creation, destruction and massive context-switching costs) of running thousands of threads. If you are on .NET 4.0, you might also want to consider using the Task Parallel Library.
For example:
for (var i = 0; i < ListProxies.Items.Count; i++)
{
var s = ListProxies.Items[i] as string;
ThreadPool.QueueUserWorkItem(ProxyTest.IsAlive, s);
}
On another note, I would seriously consider renaming the IsAlive method (which looks like a boolean property or method) since:
It clearly has a void IsAlive(object) signature.
It has observable side-effects (from your comment that it "increment a progress bar and add a 'working' proxy to a new list").
There is a limit on the number of threads you can spawn. 6100 threads does seem quite a bit excesive.
I agree win Ani, you should look into a ThreadPool or even a Producer / Consumer process depending on what you are trying to accomplish.
There are quite a few processes for handling multi threaded applications but without knowing what you are doing in the start there really is no way to recommend any approach other than a ThreadPool or Producer / Consumer process (Queues with SyncEvents).
At any rate you really should try to keep the number of threads to a minimum otherwise you run the risk of thread locks, spin locks, wait locks, dead locks, race conditions, who knows what, etc...
If you want good information on threading with C# check out the book Concurrent Programming on Windows By Joe Duffy it is really helpful.
I am trying to use ThreadPool.RegisterWaitForSingleObject to add a timer to a set of threads. I create 9 threads and am trying to give each of them an equal chance of operation as at the moment there seems to be a little starvation going on if I just add them to the thread pool. I am also trying to implement a manual reset event as I want all 9 threads to exit before continuing.
What is the best way to ensure that each thread in the threadpool gets an equal chance at running as the function that I am calling has a loop and it seems that each thread (or whichever one runs first) gets stuck in it and the others don't get a chance to run.
resetEvents = new ManualResetEvent[table_seats];
//Spawn 9 threads
for (int i = 0; i < table_seats; i++)
{
resetEvents[i] = new ManualResetEvent(false);
//AutoResetEvent ev = new AutoResetEvent(false);
RegisteredWaitHandle handle = ThreadPool.RegisterWaitForSingleObject(autoEvent, ObserveSeat, (object)i, 100, false);
}
//wait for threads to exit
WaitHandle.WaitAll(resetEvents);
However, it doesn't matter if I use resetEvents[] or ev neither seem to work properly. Am I able to implement this or am I (probably) misunderstanding how they should work.
Thanks, R.
I would not use the RegisterWaitForSingleObject for this purpose. The patterns I am going to describe here require the Reactive Extensions download since you are using .NET v3.5.
First, to wait for all work items from the ThreadPool to complete use the CountdownEvent class. This is a lot more elegant and scalable than using multiple ManualResetEvent instances. Plus, the WaitHandle.WaitAll method is limited to 64 handles.
var finished = new CountdownEvent(1);
for (int i = 0; i < table_seats; i++)
{
finished.AddCount();
ThreadPool.QueueUserWorkItem(ObserveSeat);
(state) =>
{
try
{
ObserveSeat(state);
}
finally
{
finished.Signal();
}
}, i);
}
finished.Signal();
finished.Wait();
Second, you could try calling Thread.Sleep(0) after several iterations of the loop to force a context switch so that the current thread yields to another. If you want a considerably more complex coordination strategy then use the Barrier class. Add another parameter to your ObserveSeat function which accepts this synchronization mechanism. You could supply it by capturing it in the lambda expression in the code above.
public void ObserveSeat(object state, Barrier barrier)
{
barrier.AddParticipant();
try
{
for (int i = 0; i < NUM_ITERATIONS; i++)
{
if (i % AMOUNT == 0)
{
// Let the other threads know we are done with this phase and wait
// for them to catch up.
barrier.SignalAndWait();
}
// Perform your work here.
}
}
finally
{
barrier.RemoveParticipant();
}
}
Note that although this approach would certainly prevent the starvation issue it might limit the throughput of the threads. Calling SignalAndWait too much might cause a lot of unnecessary context switching, but calling it too little might cause a lot of unnecessary waiting. You would probably have to tune AMOUNT to get the optimal balance of throughput and starvation. I suspect there might be a simple way to do the tuning dynamically.