Trying to understand multi-threading in C# - c#

I'm trying to understand the basics of multi-threading so I built a little program that raised a few question and I'll be thankful for any help :)
Here is the little program:
class Program
{
public static int count;
public static int max;
static void Main(string[] args)
{
int t = 0;
DateTime Result;
Console.WriteLine("Enter Max Number : ");
max = int.Parse(Console.ReadLine());
Console.WriteLine("Enter Thread Number : ");
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
foreach (Thread th in MyThreads)
th.Start();
while (count < max)
{
}
Console.WriteLine("Finish , Took : " + (DateTime.Now - Result).ToString() + " With : " + t + " Threads.");
Console.ReadLine();
}
public static void print()
{
while (count < max)
{
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
I checked this with some test runs:
I made the maximum number 100, and it seems to be that the fastest execution time is with 2 threads which is 80% faster than the time with 10 threads.
Questions:
1) Threads 4-10 don't print even one time, how can it be?
2) Shouldn't more threads be faster?
I made the maximum number 10000 and disabled printing.
With this configuration, 5 threads seems to be fastest.
Why there is a change compared to the first check?
And also in this configuration (with printing) all the threads print a few times. Why is that different from the first run where only a few threads printed?
Is there is a way to make all the threads print one by one? In a line or something like that?
Thank you very much for your help :)

Your code is certainly a first step into the world of threading, and you've just experienced the first (of many) headaches!
To start with, static may enable you to share a variable among the threads, but it does not do so in a thread safe manner. This means your count < max expression and count++ are not guaranteed to be up to date or an effective guard between threads. Look at the output of your program when max is only 10 (t set to 4, on my 8 processor workstation):
T0 - 0
T0 - 1
T0 - 2
T0 - 3
T1 - 0 // wait T1 got count = 0 too!
T2 - 1 // and T2 got count = 1 too!
T2 - 6
T2 - 7
T2 - 8
T2 - 9
T0 - 4
T3 - 1 // and T3 got count = 1 too!
T1 - 5
To your question about each thread printing one-by-one, I assume you're trying to coordinate access to count. You can accomplish this with synchronization primitives (such as the lock statement in C#). Here is a naive modification to your code which will ensure only max increments occur:
static object countLock = new object();
public static void printWithLock()
{
// loop forever
while(true)
{
// protect access to count using a static object
// now only 1 thread can use 'count' at a time
lock (countLock)
{
if (count >= max) return;
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
This simple modification makes your program logically correct, but also slow. The sample now exhibits a new problem: lock contention. Every thread is now vying for access to countLock. We've made our program thread safe, but without any benefits of parallelism!
Threading and parallelism is not particularly easy to get right, but thankfully recent versions of .Net come with the Task Parallel Library (TPL) and Parallel LINQ (PLINQ).
The beauty of the library is how easy it would be to convert your current code:
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, max)
.AsParallel()
.ForAll(number =>
Console.WriteLine("T{0}: {1}",
Thread.CurrentThread.ManagedThreadId,
number));
Console.WriteLine("{0} ms elapsed", sw.ElapsedMilliseconds);
// Sample output from max = 10
//
// T9: 3
// T9: 4
// T9: 5
// T9: 6
// T9: 7
// T9: 8
// T9: 9
// T8: 1
// T7: 2
// T1: 0
// 30 ms elapsed
The output above is an interesting illustration of why threading produces "unexpected results" for newer users. When threads execute in parallel, they may complete chunks of code at different points in time or one thread may be faster than another. You never really know with threading!

Your print function is far from thread safe, that's why 4-10 doesn't print. All threads share the same max and count variables.
Reason for the why more threads slows you down is likely the state change taking place each time the processor changes focus between each thread.
Also, when you're creating a lot of threads, the system needs to allocate new ones. Most of the time it is now advisable to use Tasks instead, as they are pulled from a system managed thread-pool. And thus doesn't necessarily have to be allocated. The creation of a distinct new thread is rather expensive.
Take a look here anyhow: http://msdn.microsoft.com/en-us/library/aa645740(VS.71).aspx

Look carefuly:
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
I think you missed a variable t ( i < 31).
You should read many books on parallel and multithreaded programming before writing code, because programming language is just a tool. Good luck!

Related

Why are 1000 threads faster than a few?

I have a simple program that searches linearly in an array of 2D points. I do 1000 searches into an array of 1 000 000 points.
The curious thing is that if I spawn 1000 threads, the program works as fast as when I span only as much as CPU cores I have, or when I use Parallel.For. This is contrary to everything I know about creating threads. Creating and destroying threads is expensive, but obviously not in this case.
Can someone explain why?
Note: this is a methodological example; the search algorithm is deliberately not meant do to optimal. The focus is on threading.
Note 2: I tested on an 4-core i7 and 3-core AMD, the results follow the same pattern!
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
/// <summary>
/// We search for closest points.
/// For every point in array searchData, we search into inputData for the closest point,
/// and store it at the same position into array resultData;
/// </summary>
class Program
{
class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom (Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
const int inputDataSize = 1_000_000;
static Point[] inputData = new Point[inputDataSize];
const int searchDataSize = 1000;
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
static void GenerateRandomData (Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[]{ rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For (0, searchDataSize,
index =>
{
SearchOne(index);
});
}
static void Main(string[] args)
{
Console.Write("Generatic data... ");
GenerateRandomData(inputData);
GenerateRandomData(searchData);
Console.WriteLine("Done.");
Console.WriteLine();
Stopwatch watch = new Stopwatch();
Console.Write("All thread searching... ");
watch.Restart();
AllThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Few thread searching... ");
watch.Restart();
FewThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Parallel thread searching... ");
watch.Restart();
ParallelThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.WriteLine();
Console.WriteLine("Press ENTER to quit.");
Console.ReadLine();
}
}
EDIT: Please make sure to run the app outside the debugger. VS Debugger slows down the case of multiple threads.
EDIT 2: Some more tests.
To make it clear, here is updated code that guarantees we do have 1000 running at once:
public static void AllThreadSearch()
{
ManualResetEvent startEvent = new ManualResetEvent(false);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
startEvent.WaitOne();
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
startEvent.Set();
foreach (var t in threads) t.Join();
}
Testing with a smaller array - 100K elements, the results are:
1000 vs 8 threads
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|----------:|-------:|
AllThreadSearch | 323.0 ms | 7.307 ms | 21.546 ms | 1.00 |
FewThreadSearch | 164.9 ms | 3.311 ms | 5.251 ms | 1.00 |
ParallelThreadSearch | 141.3 ms | 1.503 ms | 1.406 ms | 1.00 |
Now, 1000 threads is much slower, as expected. Parallel.For still bests them all, which is also logical.
However, growing the array to 500K (i.e. the amount of work every thread does), things start to look weird:
1000 vs 8, 500K
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|---------:|-------:|
AllThreadSearch | 890.9 ms | 17.74 ms | 30.61 ms | 1.00 |
FewThreadSearch | 712.0 ms | 13.97 ms | 20.91 ms | 1.00 |
ParallelThreadSearch | 714.5 ms | 13.75 ms | 12.19 ms | 1.00 |
Looks like context-switching has negligible costs. Thread-creation costs are also relatively small. The only significant cost of having too many threads is loss of memory (memory addresses). Which, alone, is bad enough.
Now, are thread-creation costs that little indeed? We've been universally told that creating threads is very bad and context-switches are evil.
You may want to consider how the application is accessing memory. In the maximum threads scenario you are effectively accessing memory sequentially, which is efficient from a caching point of view. The approach using a small number of threads is more random, causing cache misses. Depending on the CPU there are performance counters that allow you to measure L1 and L2 cache hits/misses.
I think the real issue (other than memory use) with too many threads is that the CPU may have a hard time optimizing itself because it is switching tasks all the time. In the OP's original benchmark, the threads are all working on the same task and so you aren't seeing that much of a cost for the extra threads.
To simulate threads working on different tasks, I modified Jodrell's reformulation of the original code (labeled "Normal" in the data below) to first optimize memory access by ensuring all the threads are working in the same block of memory at the same time and such that the block fits in the cache (4mb) using the method from this cache blocking techniques article. Then I "reversed" that to ensure each set of 4 threads work in a different block of memory. The results for my machine (in ms):
Intel Core i7-5500U CPU 2.40GHz (Max: 2.39GHz) (Broadwell), 1 CPU, 4 logical and 2 physical cores)
inputDataSize = 1_000_000; searchDataSize = 1000; blocks used for O/D: 10
Threads 1 2 3 4 6 8 10 18 32 56 100 178 316 562 1000
Normal(N) 5722 3729 3599 2909 3485 3109 3422 3385 3138 3220 3196 3216 3061 3012 3121
Optimized(O) 5480 2903 2978 2791 2826 2806 2820 2796 2778 2775 2775 2805 2833 2866 2988
De-optimized(D) 5455 3373 3730 3849 3409 3350 3297 3335 3365 3406 3455 3553 3692 3719 3900
For O, all the threads worked in the same block of cacheable memory at the same time (where 1 block = 1/10 of inputData). For D, for every set of 4 threads, no thread worked in the same block of memory at the same time. So basically, in the former case access of inputData was able to make use of the cache whereas in the latter case for 4 threads access of inputData was forced to use main memory.
It's easier to see in charts. These charts have the thread-creation cost subtracted out and note the x-axis is logarithmic and y-axis is truncated to better show the shape of the data. Also, the value for 1 thread has been halved to show the theoretical best multi-threaded performance:
A quick glance above shows the optimized data (O) is indeed faster than the others. It is also more consistent (smoother) because compared to N it is not having to deal with cache-misses. As suggested by Jodrell, there appears to be a sweet spot around 100 threads, which is the number on my system which would allow a thread to complete its work within 1 time-slice. After that, the time increases linearly with number of threads (remember, the x-axis has a logarithmic scale on the chart.)
Comparing the normal and optimized data, the former is quite jagged whereas the latter is smooth. This answer suggested more threads would be more efficient from a caching point of view compared to fewer threads where the memory access could be more "random". The chart below seems to confirm this (note 4 threads is optimal for my machine as it has 4 logical cores):
The de-optimized version is most interesting. The worse case is with 4 threads as they have been forced to work in different areas of memory, preventing effective caching. As the number threads increases, the system is able to cache as threads share blocks of memory. But, as the number of threads increases presumably the context-switching makes it harder for the system to cache again and the results tend back to the worst-case:
I think this last chart is what shows the real cost of context-switching. In the original (N) version, the threads are all doing the same task. As a result there is limited competition for resources other than CPU time and the CPU is able to optimize itself for the workload (i.e. cache effectively.) If the threads are all doing different things, then the CPU isn't able to optimize itself and a severe performance hit results. So it's not directly the context switching that causes the problem, but the competition for resources.
In this case, the difference for 4 Threads between O (2909ms) and D (3849ms) is 940ms. This represents a 32% performance hit. Because my machine has a shared L3 cache, this performance hit shows up even with only 4 threads.
I took the liberty of rearranging your code to run using BenchmarkDotNet, it looks like this,
using System;
using System.Collections.Generic;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Benchmarks
{
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom(Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
[ClrJob(baseline: true)]
public class SomeVsMany
{
[Params(1000)]
public static int inputDataSize = 1000;
[Params(10)]
public static int searchDataSize = 10;
static Point[] inputData = new Point[inputDataSize];
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
[GlobalSetup]
public static void Setup()
{
GenerateRandomData(inputData);
GenerateRandomData(searchData);
}
[Benchmark]
public static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[] { rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For(0, searchDataSize,
index =>
{
SearchOne(index);
});
}
private static void GenerateRandomData(Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
}
public class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<SomeVsMany>();
}
}
}
When I run the benchmark I get these results,
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.14393.2485
(1607/AnniversaryUpdate/Redstone1) Intel Core i7-7600U CPU 2.80GHz
(Max: 2.90GHz) (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
Frequency=2835938 Hz, Resolution=352.6170 ns, Timer=TSC [Host] :
.NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3163.0
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit
RyuJIT-v4.7.3163.0 Job=Clr Runtime=Clr
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000 10 1,276.53us 51.0605us 142.3364us
FewThreadSearch 1000 10 547.72us 24.8199us 70.0049us
ParallelThreadSearch 1000 10 36.54us 0.6973us 0.8564us
These are the kind of results I'd expect and different to what you are claiming in the question. However, as you correctly identify in the comment, this is because I have reduced the values of inputDataSize and searchDataSize.
If I rerun the test with the original values I get results like this,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000000 1000 2.872s 0.0554s 0.0701s
FewThreadSearch 1000000 1000 2.384s 0.0471s 0.0612s
ParallelThreadSearch 1000000 1000 2.449s 0.0368s 0.0344s
These results support your question.
FWIW I did another test run,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 20000000 40 1.972s 0.0392s 0.1045s
FewThreadSearch 20000000 40 1.718s 0.0501s 0.1477s
ParallelThreadSearch 20000000 40 1.978s 0.0454s 0.0523s
This may help distinguish the cost of context switching versus thread creation but ultimately, there must be an element of both.
There is a little speculation but, here are a few assertions and, a conclusion, based on our aggregated results.
Creating a Thread incurs some fixed overhead. When the work is large, the overhead becomes insignificant.
The operating system and processor architecture can only run a certain number of CPU threads at a time. Some amount of CPU time will be reserved for the many operations that keep the computer running behind the scenes. A chunk of that CPU time will be consumed by the background processes and services, not related to this test.
Even if we have a 8 core CPU and spawn just 2 threads we cannot expect both threads to progress through the program at exactly the same rate.
Accepting the points above, whether or not the threads are serviced via a .Net ThreadPool, only a finite number can be serviced concurrently. Even if all instantiated threads are progressed to some semaphore, they did not all get there at once and they will not all proceed at once. If we have more threads than available cores, some threads will have to wait before they can progress at all.
Each thread will proceed for a certain time-slice or until it is waiting for a resource.
This is where the speculation comes in but, when inputDataSize is small, the threads will tend to complete their work within one time-slice, requiring less or no context switching.
When inputDataSize becomes sufficiently large, the work cannot be completed within one time-slice, this makes context switching more likely.
So, given a large fixed size for searchDataSize we have three scenarios. The boundaries of these scenarios will depend on the characteristics of the test platform.
inputDataSize is small
Here, the cost of thread creation is significant, AllThreadSearch is massively slower. ParallelThreadSearch tends to win because it minimizes the cost of thread creation.
inputDataSize is medium
The cost of thread creation is insignificant. Crucially, the work can be completed in one time slice. AllThreadSearch makes use of OS level scheduling and avoids the reasonable but significant overhead of both the Parallel.For and the bucket looping in FewThreadSearch. Somewhere in this area is the sweet spot for AllThreadSearch, it may be possible that for some combinations AllThreadSearch is the fastest option.
inputDataSize is large
Crucially, the work cannot be completed in one time slice. Both the OS scheduler and the ThreadPool fail to anticipate the cost of context switching. Without some expensive heuristics how could they? FewThreadSearch wins out because it avoids the context switching, the cost of which outweighs the cost of bucket looping.
As ever, if you care about performance it pays to benchmark, on a representative system, with a representative workload, with a representative configuration.
First you have to understand the difference between Process and Thread to deep dive into the benefits of concurrency to achieve faster results over sequential programming.
Process - We can call it as an instance of a program in execution. Operating System creates different processes while executing an application. An application can have one or more processes. Process creation is some what costly job to the operating system as it needs to provide several resources while creating, such as Memory, Registers, open handles to system objects to access, security context etc.,
Thread - it is the entity within a process that can be scheduled for execution(can be a part of your code). Unlike Process creation, thread creation is not costly/time consuming as threads share virtual address space and system resources of the process where it belongs. It's improving the performance of the OS as it no need to provide resources for each thread it creates.
Below diagram will elaborate more than my words.
As threads are sharing the resources and having the concurrency nature in them they can run parallel and produce improved results. If your application needs to be highly parallel then you can create ThreadPool(collection of worker threads) to achieve efficiently execute asynchronous callbacks.
And to correct your final assumption/question, creating/destroying threads is not costly than creating/destroying process so always having a "properly handled threading code" would benefit the performance of the application.
It is simply because you can't create threads more than the capacity of your cpu ... so actually in both cases you are creating the same number of threads; your CPU max ...

Multithreading usages for CountdownEvent vs Barrier?

Looking at the Barrier class , it allows n threads to rendezvous at a point in time :
static Barrier _barrier = new Barrier(3);
static void Main()
{
new Thread(Speak).Start();
new Thread(Speak).Start();
new Thread(Speak).Start();
}
static void Speak()
{
for (int i = 0; i < 5; i++)
{
Console.Write(i + " ");
_barrier.SignalAndWait();
}
}
//OUTPUT: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4
But so does the CountdownEvent class :
static CountdownEvent _countdown = new CountdownEvent(3);
static void Main()
{
new Thread(SaySomething).Start("I am thread 1");
new Thread(SaySomething).Start("I am thread 2");
new Thread(SaySomething).Start("I am thread 3");
_countdown.Wait(); // Blocks until Signal has been called 3 times
Console.WriteLine("All threads have finished speaking!");
}
static void SaySomething(object thing)
{
Thread.Sleep(1000);
Console.WriteLine(thing);
_countdown.Signal();
}
// output :
I am thread 3
I am thread 1
I am thread 2
All threads have finished speaking!
So it seems that the Barrier blocks until n threads are meeting
while the CountdownEvent is also blocking until n threads are signaling.
It's kind of confusing( to me ) to learn from it , when should I use which.
Question:
In which (real life scenario) should I need to choose using Barrier as opposed to CountdownEvent (and vice versa) ?
There are some interesting things to note about the two:
A CountdownEvent does not have an explicit post-action associated with it; a Barrier does.
A CountdownEvent and a Barrier with one phase are roughly equivalent so can be used interchangeably.
A Barrier can have multiple phases. As each phase completes, the post-phase action is executed; when that action completes, the next phase begins.
There is a similar question on SO that discusses this behavior, but for the Java equivalents of the C# classes. The answers give several examples, which are valid for the C# equivalents.
That said, consider a real world scenario: checking 3 credit sources for a potential loan to a home buyer. Let's say you don't want to make a decision until you receive all 3 credit scores and evaluate them. You could use either a CountdownEvent (with the code after the Wait() checking the scores) or a single-phased barrier with the score checking code action.
Here's where Barrier would be a better choice: let's say the loan officer wants to also check the home buyer's SO reputation score (because expert users get better credit!) and two other social scores, but only after the credit scores are retrieved (because hey, we don't want check social media if we don't have to).
The neat thing about the Barrier is you can move through the phases in a single method call, which keeps the logic compact and tidy:
var barrier = new Barrier(participantCount: 3, b => LogScoreAndPossiblyEvaluate(b));
var credit = new int[3];
var social = new int[3];
void LogScoreAndPossiblyEvaluate(Barrier b)
{
Log.Info("Got scores for {b.CurrentPhaseNumber == 1 ? "credit" : "social"} phase");
...
if (b.CurrentPhaseNumber == 2 && SomeComplexCalculationWithSixScores() == LoanResult.Approved)
LoanMoney();
}
...
for (int i=0; i<3; i++)
Task.Run(() => {
credit[i] = GetCreditScore(CreditSource(i);
barrier.SignalAndWait();
social[i] = GetSocialScore(SocialSource(i);
barrier.SignalAndWait();
// all phases done at this point
});

C# additional threads decreases efficiency of application

I wanted to stress test my new cpu. I made this in about 2 mins.
When I add more threads to it, the efficiency of it decreases dramatically. These are results: (Please note I set the Priority in the taskmanager to High)
1 Thread: After one minute on 1 thread(s), you got to the number/prime 680263811
2 Threads: After one minute on 2 thread(s), you got to the number/prime 360252913
4 Threads: After one minute on 4 thread(s), you got to the number/prime 216150449
There are problems with the code, I just made it as a test. Please don't bash me that it was written horribly... I've kinda had a bad day
static void Main(string[] args)
{
Console.Write("Stress test, how many OS threads?: ");
int thr = int.Parse(Console.ReadLine());
Thread[] t = new Thread[thr];
Stopwatch s = new Stopwatch();
s.Start();
UInt64 it = 0;
UInt64 prime = 0;
for (int i = 0; i < thr; i++)
{
t[i] = new Thread(delegate()
{
while (s.Elapsed.TotalMinutes < 1)
{
it++;
if (it % 2 != 0)// im 100% sure that x % 2 does not give primes, but it uses up the cpu, so idc
{
prime = it;
}
}
Console.WriteLine("After one minute on " + t.Length + " thread(s), you got to the number/prime " + prime.ToString());//idc if this prints 100 times, this is just a test
});
t[i].Start();
}
Console.ReadLine();
}
Question: Can someone explain these unexpected results?
Your threads are incrementing it without any synchronization, so you're going to get weird results like this. Worse, you're also assigning prime without any synchronization.
thread 1: reads 0 from it, then gets unscheduled by the OS for whatever reason
thread 2: reads 0 from it, then increments to 1
thread 2: does work, assigns 1 to prime
... thread 2 repeats for awhile. thread 2 is now up to 7, and is about to check if (it % 2 != 0)
thread 1: regains the CPU
thread 1: increments it to 1
thread 2: assigns 1 to prime --- wat?
The possibilities get even worse as you get to the point where a bit in the high half of it is changing because 64-bit reads and writes are not atomic either although these numbers are a little larger than in the question, after running for longer, wildly variable would be possible ... consider
After some time it = 0x00000000_FFFFFFFF
thread 1 reads reads it (both words)
thread 2 reads the higher word 0x0000000_????????
thread 1 calculates it + 1 0x00000001_00000000
thread 1 writes it (both halves)
thread 2 reads the lower word (and puts this with the already read half) 0x00000000_00000000
While thread 1 was incrementing to 4294967296, thread 2 managed to read 0.
You should apply keyword volatile to variable it and prime.

How to loop a list while using Multi-threading in c#

I have a List to loop while using multi-thread,I will get the first item of the List and do some processing,then remove the item.
While the count of List is not greater than 0 ,fetch data from data.
In a word:
In have a lot of records in my database.I need to publish them to my server.In the process of publishing, multithreading is required and the number of threads may be 10 or less.
For example:
private List<string> list;
void LoadDataFromDatabase(){
list=...;//load data from database...
}
void DoMethod()
{
While(list.Count>0)
{
var item=list.FirstOrDefault();
list.RemoveAt(0);
DoProcess();//how to use multi-thread (custom the count of theads)?
if(list.Count<=0)
{
LoadDataFromDatabase();
}
}
}
Please help me,I'm a beginner of c#,I have searched a lot of solutions, but no similar.
And more,I need to custom the count of theads.
Should your processing of the list be sequential? In other words, cannot you process element n + 1 while not finished yet processing of element n? If this is your case, then Multi-Threading is not the right solution.
Otherwise, if your processing elements are fully independent, you can use m threads, deviding Elements.Count / m elements for each thread to work on
Example: printing a list:
List<int> a = new List<int> { 1, 2, 3, 4,5 , 6, 7, 8, 9 , 10 };
int num_threads = 2;
int thread_elements = a.Count / num_threads;
// start the threads
Thread[] threads = new Thread[num_threads];
for (int i = 0; i < num_threads; ++i)
{
threads[i] = new Thread(new ThreadStart(Work));
threads[i].Start(i);
}
// this works fine if the total number of elements is divisable by num_threads
// but if we have 500 elements, 7 threads, then thread_elements = 500 / 7 = 71
// but 71 * 7 = 497, so that there are 3 elements not processed
// process them here:
int actual = thread_elements * num_threads;
for (int i = actual; i < a.Count; ++i)
Console.WriteLine(a[i]);
// wait all threads to finish
for (int i = 0; i < num_threads; ++i)
{
threads[i].Join();
}
void Work(object arg)
{
Console.WriteLine("Thread #" + arg + " has begun...");
// calculate my working range [start, end)
int id = (int)arg;
int mystart = id * thread_elements;
int myend = (id + 1) * thread_elements;
// start work on my range !!
for (int i = mystart; i < myend; ++i)
Console.WriteLine("Thread #" + arg + " Element " + a[i]);
}
ADD For your case, (uploading to server), it is the same as the code obove. You assign a number of threads, assigning each thread number of elements (which is auto calculated in the variable thread_elements, so you need only to change num_threads). For method Work, all you need is replacing the line Console.WriteLine("Thread #" + arg + " Element " + a[i]); with you uploading code.
One more thing to keep in mind, that multi-threading is dependent on your machine CPU. If your CPU has 4 cores, for example, then the best performance obtained would be 4 threads at maximum, so that assigning each core a thread. Otherwise, if you have 10 threads, for example, they would be slower than 4 threads because they will compete on CPU cores (Unless the threads are idle, waiting for some event to occur (e.g. uploading). In this case, 10 threads can run, because they don't take %100 of CPU usage)
WARNING: DO NOT modify the list while any thread is working (add, remove, set element...), neither assigning two threads the same element. Such things cause you a lot of bugs and exceptions !!!
This is a simple scenario that can be expanded in multiple ways if you add some details to your requirements:
IEnumerable<Data> LoadDataFromDatabase()
{
return ...
}
void ProcessInParallel()
{
while(true)
{
var data = LoadDataFromDatabase().ToList();
if(!data.Any()) break;
data.AsParallel().ForEach(ProcessSingleData);
}
}
void ProcessSingleData(Data d)
{
// do something with data
}
There are many ways to approach this. You can create threads and partition the list yourself or you can take advantage of the TPL and utilize Parallel.ForEach. In the example on the link you see a Action is called for each member of the list being iterated over. If this is your first taste of threading I would also attempt to do it the old fashioned way.
Here my opinion ;)
You can avoid use multithread if youur "List" is not really huge.
Instead of a List, you can use a Queue (FIFO - First In First Out). Then only use Dequeue() method to get one element of the Queue, DoSomeWork and get the another. Something like:
while(queue.Count > 0)
{
var temp = DoSomeWork(queue.Dequeue());
}
I think that this will be better for your propose.
I will get the first item of the List and do some processing,then remove the item.
Bad.
First, you want a queue, not a list.
Second, you do not process then remove, you remove THEN process.
Why?
So that you keep the locks small. Lock list access (note you need to synchonize access), remove, THEN unlock immediately and then process. THis way you keep the locks short. If you take, process, then remove - you basically are single threaded as you have to keep the lock in place while processing, so the next thread does not take the same item again.
And as you need to synchronize access and want multiple threads this is about the only way.
Read up on the lock statement for a start (you can later move to something like spinlock). Do NOT use threads unless you ahve to put schedule Tasks (using the Tasks interface new in 4.0), which gives you more flexibility.

How come this algorithm in Ruby runs faster than in Parallel'd C#?

The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".

Categories