I am reading http://www.mono-project.com/ThreadsBeginnersGuide.
The first example looks like this:
public class FirstUnsyncThreads {
private int i = 0;
public static void Main (string[] args) {
FirstUnsyncThreads myThreads = new FirstUnsyncThreads ();
}
public FirstUnsyncThreads () {
// Creating our two threads. The ThreadStart delegate is points to
// the method being run in a new thread.
Thread firstRunner = new Thread (new ThreadStart (this.firstRun));
Thread secondRunner = new Thread (new ThreadStart (this.secondRun));
// Starting our two threads. Thread.Sleep(10) gives the first Thread
// 10 miliseconds more time.
firstRunner.Start ();
Thread.Sleep (10);
secondRunner.Start ();
}
// This method is being excecuted on the first thread.
public void firstRun () {
while(this.i < 10) {
Console.WriteLine ("First runner incrementing i from " + this.i +
" to " + ++this.i);
// This avoids that the first runner does all the work before
// the second one has even started. (Happens on high performance
// machines sometimes.)
Thread.Sleep (100);
}
}
// This method is being excecuted on the second thread.
public void secondRun () {
while(this.i < 10) {
Console.WriteLine ("Second runner incrementing i from " + this.i +
" to " + ++this.i);
Thread.Sleep (100);
}
}
}
Output:
First runner incrementing i from 0 to 1
Second runner incrementing i from 1 to 2
Second runner incrementing i from 3 to 4
First runner incrementing i from 2 to 3
Second runner incrementing i from 5 to 6
First runner incrementing i from 4 to 5
First runner incrementing i from 6 to 7
Second runner incrementing i from 7 to 8
Second runner incrementing i from 9 to 10
First runner incrementing i from 8 to 9
Wow, what is this? Unfortunately, the explanation in the article is inadequate for me. Can you explain me why the increments happened in a jumbled order?
Thanks!
I think the writer of the article has confused things.
VoteyDisciple is correct that ++i is not atomic and a race condition can occur if the target is not locked during the operation but this will not cause the issue described above.
If a race condition occurs calling ++i then internal operations of the ++ operator will look something like:-
1st thread reads value 0
2nd thread reads value 0
1st thread increments value to 1
2nd thread increments value to 1
1st thread writes value 1
2nd thread writes value 1
The order of operations 3 to 6 is unimportant, the point is that both the read operations, 1 and 2, can occur when the variable has value x resulting in the same incrementation to y, rather than each thread performing incrementations for distinct values of x and y.
This may result in the following output:-
First runner incrementing i from 0 to 1
Second runner incrementing i from 0 to 1
What would be even worse is the following:-
1st thread reads value 0
2nd thread reads value 0
2nd thread increments value to 1
2nd thread writes value 1
2nd thread reads value 1
2nd thread increments value to 2
2nd thread writes value 2
1st thread increments value to 1
1st thread writes value 1
2nd thread reads value 1
2nd thread increments value to 2
2nd thread writes value 2
This may result in the following output:-
First runner incrementing i from 0 to 1
Second runner incrementing i from 0 to 1
Second runner incrementing i from 1 to 2
Second runner incrementing i from 1 to 2
And so on.
Furthermore, there is a possible race condition between reading i and performing ++i since the Console.WriteLine call concatenates i and ++i. This may result in output like:-
First runner incrementing i from 0 to 1
Second runner incrementing i from 1 to 3
First runner incrementing i from 1 to 2
The jumbled console output which the writer has described can only result from the unpredictability of the console output and has nothing to do with a race condition on the i variable. Taking a lock on i whilst performing ++i or whilst concatenating i and ++i will not change this behaviour.
When I run this (on a dualcore), my output is
First runner incrementing i from 0 to 1
Second runner incrementing i from 1 to 2
First runner incrementing i from 2 to 3
Second runner incrementing i from 3 to 4
First runner incrementing i from 4 to 5
Second runner incrementing i from 5 to 6
First runner incrementing i from 6 to 7
Second runner incrementing i from 7 to 8
First runner incrementing i from 8 to 9
Second runner incrementing i from 9 to 10
As I would have expected. You are running two loops, both executing Sleep(100). That is very ill suited to demonstrate a race-condition.
The code does have a race condition (as VoteyDisciple describes) but it is very unlikely to surface.
I can't explain the lack of order in your output (is it a real output?), but the Console class will synchronize output calls.
If you leave out the Sleep() calls and run the loops 1000 times (instead of 10) you might see two runners both incrementing from 554 to 555 or something.
Synchronization is essential when multiple threads are present. In this case you are seeing that both threads read and write to this.i , but no good attempt is done at synchronize these accesses. Since both of them concurrently modify the same memory area, you observe the jumbled output.
The call to Sleep is dangerous, it is an approach which leads to sure bugs. You cannot assume that the threads will be always displaced by the inital 10 ms.
In short: Never use Sleep for synchronization :-) but instead adopt some kind of thread synchronization technique (eg. locks, mutexes, semaphores). Always try to use the lightest possible lock that will fulfill your need....
A useful resource is the book by Joe Duffy, Concurrent Programming on Windows.
The increments are not happening out of order, the Console.WriteLine(...) is writing the output from multiple threads into a single-threaded console, and the synchronization from many threads to one thread is causing the messages to appear out of order.
I assume this example attempted to create a race condition, and in your case failed. Unfortunately, concurrency issues, such as a race condition and deadlocks, are hard to predict and reproduce due to their nature. You might want to try and run it a few more times, alter it to use more threads and each thread should increment more times (say 100,000). Then you might see that the end result will not equal the sum of all the increments (caused by a race condition).
Related
I am trying to optimize a data collection process in C#. I would like to understand why a certain method of parallelism I am trying is not working as expected (more details below; see "Question" section at the very bottom)
BACKGROUND
I have an external .NET Framework DLL, which is an API for an external data source; I can only consume the API, I do not have access to what goes on behind the scenes.
The API provides a function like: GetInfo(string partID, string fieldValue). Using this function, I can get information about a specific part, filtered for a single field/criteria. One single call (for just one part ID and one field value) takes around 20 milliseconds in an optimal case.
A part can have many values for the field criteria. So in order to get all the info for a part, I have to enumerate through all possible field values (13 in this case). And to get all the info for many parts (~260), I have to enumerate through all the part IDs and all the field values.
I already have all the part IDs and possible field values. The problem is performance. Using a serial approach (2 nested for-loops) is too slow (takes ~70 seconds). I would like to get the time down to ~5 seconds.
WHAT I HAVE TRIED
For different mediums of parallelizing work, I have tried:
calling API in parallel via Tasks within a single main application.
calling API in parallel via Parallel.ForEach within a single main application.
Wrapping the API call with a WCF service, and having multiple WCF service instances (just like having multiple Tasks, but this is multiple processes instead); the single main client application will call the API in parallel through these services.
For different logic of parallelizing work, I have tried:
experiment 0 has 2 nested for-loops; this is the base case without any parallel calls (so that's ~260 part IDs * 13 field values = ~3400 API calls in series).
experiment 1 has 13 parallel branches, and each branch deals with smaller 2 nested for-loops; essentially, it is dividing experiment 0 into 13 parallel branches (so rather than iterating over ~260 part IDs * 13 field values, each branch will iterate over ~20 part IDs * all 13 field values = ~260 API calls in series per branch).
experiment 2 has 13 parallel branches, and each branch deals with ALL part IDs but only 1 specific field value for each branch (so each branch will iterate over ~260 part IDs * 1 field value = ~260 API calls in series per branch).
experiment 3 has 1 for-loop which iterates over the part IDs in series, but inside the loop makes 13 parallel calls (for 13 field values); only when all 13 info is retrieved for one part ID will the loop move on to the next part ID.
I have tried experiments 1, 2, and 3 combined with the different mediums (Tasks, Parallel.ForEach, separate processes via WCF Services); so there is a total of 9 combinations. Plus the base case experiment 0 which is just 2 nested for-loops (no parallelizing there).
I also ran each combination 4 times (each time with a different set of ~260 part IDs), to test for repeatability.
In every experiment/medium combination, I am timing only the direct API call using Stopwatch; so the time is not affected by any other parts of the code (like Task creation, etc.).
Here is how I am wrapping the API call in WCF service (also shows how I am timing the API call):
public async Task<Info[]> GetInfosAsync(string[] partIDs, string[] fieldValues)
{
Info[] infos = new Info[partIDs.Length * fieldValues.Length];
await Task.Run(() =>
{
for (int i = 0; i < partIDs.Length; i++)
{
for (int j = 0; j < fieldValues.Length; j++)
{
Stopwatch timer = new Stopwatch();
timer.Restart();
infos[i * fieldValues.Length + j] = api.GetInfo(partIDs[i], fieldValues[j]);
timer.Stop();
// log timer.ElapsedMilliseconds to file (each parallel branch writes to its own file)
}
}
});
return infos;
}
And to better illustrate the 3 different experiments, here is how they are structured. These are run from the main application. I am only including how the experiments were done using the inter-process communication (GetInfosAsync defined above), as that gave me the most significant results (as explained under "Results" further below).
// experiment 1
Task<Info[]>[] tasks = new Task<Info[]>[numBranches]; // numBranches = 13
for (int k = 0; k < numBranches; k++)
{
tasks[k] = services[k].GetInfosAsync(partIDsForBranch[k], fieldValues); // each service/branch gets partIDsForBranch[k] (a subset of ~20 partIDs only used for branch k) and all 13 fieldValues
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 2
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(partIDs, new string[] { fieldValues[j] }); // each service/branch gets all ~260 partIDs and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
// experiment 3
for (int i = 0; i < partIDs.Length; i++)
{
Task<Info[]>[] tasks = new Task<Info[]>[fieldValues.Length];
for (int j = 0; j < fieldValues.Length; j++)
{
tasks[j] = services[j].GetInfosAsync(new string[] { partIDs[i] }, new string[] { fieldValues[j] }); // each branch/service gets the currently iterated partID and only 1 unique fieldValue
}
Task.WaitAll(tasks); // loop through each task.Result after WaitAll is complete to get Info[]
}
RESULTS
For experiments 1 and 2...
Task (within same application) and Parallel.ForEach (within same application) perform almost just like the base case experiment (approximately 70 to 80 seconds).
inter-process communication (i.e. making parallel calls to multiple WCF services separate from the main application) performs significantly better than Task/Parallel.ForEach. This made sense to me (I've read about how multi-process could potentially be faster than multi-thread). Experiment 2 performs better than experiment 1, with the best experiment 2 run being around 8 seconds.
For experiment 3...
Task and Parallel.ForEach (within same application) perform close to their experiment 1 and 2 counterparts (but around 10 to 20 seconds more).
inter-process communication was significantly worse compared to all other experiments, taking around 200 to 300 seconds in total. This is the result I don't understand (see "What I Expected" section further below).
The graphs below give a visual representation of these results. Except for the bar chart summary, I only included the charts for inter-process communication results since that gave significantly good/bad results.
Figure 1 (above). Elapsed times of each individual API call for experiments 1, 2, and 3 for a particular run, for inter-process communication; experiment 0 is also included (top-left).
Figure 2 (above). Summary for each method/experiment for all 4 runs (top-left). And aggregate versions for the experiment graphs above (for experiments 1 and 2, this is the sum of each branch, and the total time would be the max of these sums; for experiment 3, this is the max of each loop, and the total time would be the sum of all these maxes). So in experiment 3, almost every iteration of the outer loop is taking around 1 second, meaning there is one parallel API call in every iteration that is taking 1 second...
WHAT I EXPECTED
The best performance I got was experiment 2 with inter-process communication (best run was around 8 seconds in total). Since experiment 2 runs were better than experiment 1 runs, perhaps there is some optimization behind-the-scenes on the field value i.e. experiment 1 could potentially have different branches clash by calling the same field value at any point in time, whereas each branch in experiment 2 calls their own unique field value at any point in time).
I understand that the backend will be restricted by a certain number of calls per time period, so the spikes I see in experiments 1 and 2 make sense (and why there is almost no spikes in experiment 0).
That's why I thought, for experiment 3 using inter-process communication, I am only making 13 API calls in parallel at any single point in time (for a single part ID, each branch having its own field value), and not proceeding to the next part ID until all 13 are done. This seemed like less API calls per time period than experiment 2, which continuously makes calls on each branch. So for experiment 3, I expected little spikes and all 13 to complete in the time it took for a single API call (~20ms).
But what actually happened was that experiment 3 took the most time, and majority of API call times are spiking significantly (i.e. each iteration having a call taking around 1 second).
I also understand that experiments 1 and 2 only have 13 long-lasting parallel branches that last throughout the lifetime of a single run, whereas experiment 3 creates 13 new short-lived parallel branches ~260 times (so there around be ~3400 short-lived parallel branches created throughout the lifetime of the run). If I was timing the task creation, I would understand the increased time due to overhead, but if I am timing the API call directly, how does this impact the API call itself?
QUESTION
Is there a possible explanation to why experiment 3 behaved this way? Or is there a way to profile/investigate this? I understand that this is not much to go off of without knowing what happens behind-the-scenes of the API... But what I am asking is how to go about investigating this, if possible.
This may be because your experiment 3 used two loops, and the object created in the loop will cause each loop to be created, increasing the workload.
I can already see it's not by the incorrect increments, but there's just one small piece of the puzzle I can't quite seem to catch.
We have the following code:
internal class StupidObject
{
static public SemaphoreSlim semaphore = new SemaphoreSlim(0, 100);
private int counter;
public bool MethodCall() => counter++ == 0;
public int GetCounter() => counter;
}
And the following test code to try and see if it's an atomic operation:
var sharedObj = new StupidObject();
var resultTasks = new Task[100];
for (int i = 0; i < 100; i++)
{
resultTasks[i] = Task.Run(async () =>
{
await StupidObject.semaphore.WaitAsync();
if (sharedObj.MethodCall())
{
Console.WriteLine("True");
};
});
}
Console.WriteLine("Done");
Console.ReadLine();
StupidObject.semaphore.Release(100);
Console.ReadLine();
Console.WriteLine(sharedObj.GetCounter());
Console.ReadLine();
I expect to see multiple True's written to the console, but I ever see a single one.
Why is that? By my understanding, a ++ operation reads the value, increments the read value, and then stores that value to the variable.
Those are 3 operations. If we had a race condition, where thread A did the following:
Reads value to be 0.
Increments read value by 1.
And another thread B did the same things, but beat thread A to the third operation as following:
Writes read value to variable.
When A finishes writing the incremented read value, it should print back 0, same with thread B after it has done its write operation.
Am I missing something at the design aspect of things, or is my test not good enough to make this exact situation come to fruition?
Example without the Task Parallel Library (still yields a single True to the console):
var sharedObj = new StupidObject();
var resultTasks = new Thread[10000];
for (int i = 0; i < 10000; i++)
{
resultTasks[i] = new Thread(() =>
{
StupidObject.semaphore.Wait();
if (sharedObj.MethodCall())
{
Console.WriteLine("True");
};
});
resultTasks[i].IsBackground = false;
resultTasks[i].Start();
}
Console.WriteLine("Done");
Console.ReadLine();
StupidObject.semaphore.Release(10000);
What Liam said about Console.WriteLine is possible, but also there's another thing.
Starting Tasks doesn't equal starting threads, and even starting threads doesn't guarantee that all threads will begin immediatelly. Starting 100 short tasks probably won't even fill .Net's thread pool significantly, because those tasks end quickly and thread pool's manager probably won't start more than 3-5 threads. That's not the "immediate" and "parallel" you'd like to see when you want to start parallel 100 increments to race with each other, right? Remember that Tasks are queued first, then assigned to threads.
Note that the StupidObject's counter starts with zero and that's the ONLY MOMENT EVER that the value is zero. If ANY thread wins the race and successfully writes an update to that integer, you'll get FALSE in all future tasks, because it's already 1.
And if there are many tasks on the thread pool's queue, something first has to notice that fact. At program's start, thread pool lacks threads. They are not started in dozens right at program start. They are started on demand. Most probably you fill up the queue with 100 tasks, threadpool's thread is created, picks first task, bumps counter to 1, then maybe thread pool starts new threads to consume tasks faster.
To get a bit better image what's happening, instead of printing out 'true', collect values observed by return counter++: let each task run, finish, store its value in Task's .Result, then run threads/tasks, then wait for all of then to stop, then collect .Results and write a histogram of those values. Even if you don't see 5 zeros, maybe you will see 3 ones, 7 twos, 2 threes and so on.
I'm creating new threads for every sql call for a project. There are millions of sql calls so I'm calling a procedure in a new thread to handle the sql calls.
In doing so, I wanted to increment and decrement a counter so that I know when these threads have completed the sql query.
To my amazement the output shows NEGATIVE values in the counter. HOW? When I am starting with 0 and adding 1 at the beginning of the process and subtracting 1 at the end of the process?
This int is not called anywhere else in the program.. the following is the code..
public static int counter=0;
while(!txtstream.EndOfStream)
{
new Thread(delegate()
{
processline();
}).Start();
Console.WriteLine(counter);
}
public static void processline()
{
counter++;
sql.ExecuteNonQuery();
counter--;
}
Output looks something like this:
1
21
-2
-2
5
Nothing mysterious about it, you are using threading, right?
The ++ and -- operator aren't thread safe. Do this.
public static void processline()
{
Interlocked.Increment(ref counter);
sql.ExecuteNonQuery();
Interlocked.Decrement(ref counter);
}
How to overcome
Use Interlocked.Increment and Interlocked.Decrement to safely change the value of the counter.
Why this happensYou have counter as variable, which is shared across multiple threads. This variable is non-volatile and not wrapped by any synchronization block, so each thread has its own copy of that variable. So if two threads try to change it value at the same time, value would be overrriden by copy from thread which accessed it last. Imagine you start your code in two different threads:
Initially counter equals zero, both threads havy copy of that
Both thread invoke increment their cached copies, and than change counter. So thread1 increments its copy to 1 and overrides counter, thread2 also increments its copy (still equal to zero) to 1 and overrides counter to, again, 1. After that that value is propagated to all threads (all copies are refreshed)
Both threads invoke sql query. Due to variability in sql performance, these queries are completed in different time.
Thread1 ends sql query, decrements counter from 1 to 0. Counter value is propagated to all threads
After some time, Thread2 ends sql query, decrement counter from already propagated 0 to -1. Counter value is propagated to all threads. And it is -1.
I'm investigating the Parallelism Break in a For loop.
After reading this and this I still have a question:
I'd expect this code :
Parallel.For(0, 10, (i,state) =>
{
Console.WriteLine(i); if (i == 5) state.Break();
}
To yield at most 6 numbers (0..6).
not only he is not doing it but have different result length :
02351486
013542
0135642
Very annoying. (where the hell is Break() {after 5} here ??)
So I looked at msdn
Break may be used to communicate to the loop that no other iterations after the current iteration need be run.
If Break is called from the 100th iteration of a for loop iterating in
parallel from 0 to 1000, all iterations less than 100 should still be
run, but the iterations from 101 through to 1000 are not necessary.
Quesion #1 :
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
Question #2 :
Lets assume we are using Parallel + range partition (due to no cpu cost change between elements) so it divides the data among threads . So if we have 4 cores (and perfect divisions among them):
core #1 got 0..250
core #2 got 251..500
core #3 got 501..750
core #4 got 751..1000
so the thread in core #1 will meet value=100 sometime and will break.
this will be his iteration number 100 .
But the thread in core #4 got more quanta and he is on 900 now. he is way beyond his 100'th iteration.
He doesnt have index less 100 to be stopped !! - so he will show them all.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Question #3 :
How cn I truly break when (i == 5) ?
p.s.
I mean , come on ! when I do Break() , I want things the loop to stop.
excactly as I do in regular For loop.
To yield at most 6 numbers (0..6).
The problem is that this won't yield at most 6 numbers.
What happens is, when you hit a loop with an index of 5, you send the "break" request. Break() will cause the loop to no longer process any values >5, but process all values <5.
However, any values greater than 5 which were already started will still get processed. Since the various indices are running in parallel, they're no longer ordered, so you get various runs where some values >5 (such as 8 in your example) are still being executed.
Which iterations ? the overall iteration counter ? or per thread ? I'm pretty sure it is per thread. please approve.
This is the index being passed into Parallel.For. Break() won't prevent items from being processed, but provides a guarantee that all items up to 100 get processed, but items above 100 may or may not get processed.
Am I right ? is that is the reason why I get more than 5 elements in my example ?
Yes. If you use a partitioner like you've shown, as soon as you call Break(), items beyond the one where you break will no longer get scheduled. However, items (which is the entire partition) already scheduled will get processed fully. In your example, this means you're likely to always process all 1000 items.
How can I truly break when (i == 5) ?
You are - but when you run in Parallel, things change. What is the actual goal here? If you only want to process the first 6 items (0-5), you should restrict the items before you loop through them via a LINQ query or similar. You can then process the 6 items in Parallel.For or Parallel.ForEach without a Break() and without worry.
I mean , come on ! when I do Break() , I want things the loop to stop. excactly as I do in regular For loop.
You should use Stop() instead of Break() if you want things to stop as quickly as possible. This will not prevent items already running from stopping, but will no longer schedule any items (including ones at lower indices or earlier in the enumeration than your current position).
If Break is called from the 100th iteration of a for loop iterating in parallel from 0 to 1000
The 100th iteration of the loop is not necessarily (in fact probably not) the one with the index 99.
Your threads can and will run in an indeterminent order. When the .Break() instruction is encountered, no further loop iterations will be started. Exactly when that happens depends on the specifics of thread scheduling for a particular run.
I strongly recommend reading
Patterns of Parallel Programming
(free PDF from Microsoft)
to understand the design decisions and design tradeoffs that went into the TPL.
Which iterations ? the overall iteration counter ? or per thread ?
Off all the iterations scheduled (or yet to be scheduled).
Remember the delegate may be run out of order, there is no guarantee that iteration i == 5 will be the sixth to execute, rather this is unlikely to be the case except in rare cases.
Q2: Am I right ?
No, the scheduling is not so simplistic. Rather all the tasks are queued up and then the queue is processed. But the threads each use their own queue until it is empty when they steal from other the threads. This leads no way to predict which thread will process what delegate.
If the delegates are sufficiently trivial it might all be processed on the original calling thread (no other thread gets a chance to steal work).
Q3: How cn I truly break when (i == 5) ?
Don't use concurrently if you want linear (in specific) processing.
The Break method is there to support speculative execution: try various ways and stop as soon as any one completes.
I have a multi-threaded application with (4) thread i want to know how much processing time spent in threads. I've created all these threads with ThreadPool
Thread1 doing job1
Thread2 doing job2
..
..
result would be:
Thread1 was running in 12 millisecond
Thread2 was running in 20 millisecond
I've actually download a web page in a job that each job is processing in one thread i want to know how much time it takes a web page is downloaded (without the affection of other threads context switch in calculated time
I found this code on codeproject:
http://www.codeproject.com/KB/dotnet/ExecutionStopwatch.aspx
Try it and report back ;)
If you want to get the total time you would get from a stopwatch, there's the Stopwatch class:
Stopwatch sw = Stopwatch.StartNew();
// execute code
sw.Stop();
// read and report on sw.ElapsedMilliseconds
If you want to find out how much time the thread was actually executing code (and not waiting for I/O etc.) you can examine the ProcessThread.TotalProcessorTime property, by enumerating the threads of the Process object for your application.
Note that threads in the thread pool are not destroyed after use, but left in the pool for reuse, which means your total time for a thread includes everything it has done before your current workload.
the WMI class Win32_Thared contains properties KernelModeTime and UserModeTime which will, if available, give you a count of 100ns units of actual execution.
But, from the documentation:
If this information is not available, a value of 0 (zero) should be used.
So this might be OS dependent (it is certainly populated here on Win7).
A query like: select * from win32_thread where ProcessHandle="x" will get the Win32_Thread instances for process id x (ignore the "handle" in the name). E.g., using PowerShell, looking at its own threads:
PS[64bit] > gwmi -Query "select * from win32_thread where ProcessHandle=""7064"""|
ft -AutoSize Handle,KernelModeTime,UserModeTime
Handle KernelModeTime UserModeTime
------ -------------- ------------
5548 218 312
6620 0 0
6112 0 0
7148 0 15
6888 0 0
7380 0 0
3992 0 0
8372 0 0
644 0 0
1328 0 15
(And to confirm this is not elapsed time, the process start time is 16:44:50 2010-09-30.
Can not be done. The problem is that unless you block it (hard to do- makes little sense) threads can be inteeruupted. So, while it TAKES THread2 20ms to complete, you do not know how much of that it was active.
The negative side of what is called preemtive multitasking.