I have the following piece of code. I wish to start the file creation on multiple threads. The objective is that it will take less time to create 10 files when I do it on multiple threads. As I understand I need to introduce the element of asynchronous calls to make that happen.
What changes should I make in this piece of code?
using System;
using System.Text;
using System.Threading;
using System.IO;
using System.Diagnostics;
namespace MultiDemo
{
class MultiDemo
{
public static void Main()
{
var stopWatch = new Stopwatch();
stopWatch.Start();
// Create an instance of the test class.
var ad = new MultiDemo();
//Should create 10 files in a loop.
for (var x = 0; x < 10; x++)
{
var y = x;
int threadId;
var myThread = new Thread(() => TestMethod("outpFile", y, out threadId));
myThread.Start();
myThread.Join();
//TestMethod("outpFile", y, out threadId);
}
stopWatch.Stop();
Console.WriteLine("Seconds Taken:\t{0}",stopWatch.Elapsed.TotalMilliseconds);
}
public static void TestMethod(string fileName, int hifi, out int threadId)
{
fileName = fileName + hifi;
var fs = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.ReadWrite);
var sw = new StreamWriter(fs, Encoding.UTF8);
for (int x = 0; x < 10000; x++)
{
sw.WriteLine(DateTime.Now.ToString());
}
sw.Close();
threadId = Thread.CurrentThread.ManagedThreadId;
Console.WriteLine("{0}",threadId);
}
}
}
Right now, if I comment the thread creation part of the code and just call testMethod 10 times in a loop, it is faster than the multiple threads that the thread creation attempts to process.
The threaded version of your code is doing extra work, so it's not suprising that it's slower.
When you do something like:
var myThread = new Thread(() => TestMethod("outpFile", y, out threadId));
myThread.Start();
myThread.Join();
...you're creating a thread, having it call TestMethod, then waiting for it to finish. The additional overhead of creating and starting a thread will make things slower than just calling TestMethod without any threads.
It's possible that you'll see better performance if you start all of the threads working and then wait for them to finish, e.g.:
var workers = new List<Thread>();
for (int i = 0; i < 10; ++i)
{
var y = x;
int threadId;
var myThread = new Thread(() => TestMethod("outpFile", y, out threadId));
myThread.Start();
workers.Add(myThread);
}
foreach (var worker in workers) worker.Join();
Perhaps this doesn't directly answer your question but here is my thought on the matter. The bottleneck in that code is unlikely to be the processor. I would bet the disk IO would take way more time than the CPU processing. As such, I don't believe that creating new threads will help at all (all the threads will attempt to write to the same disk). I think this is a case of premature optimization. If I were you, I would just do it all on one thread.
The reason you're slower is that all you're doing is starting up a new thread and waiting for it to complete so it has to be slower because your other method is simply not doing 3 steps.
Try this out (assuming .Net 4.0 because of TPL). On my machine, it's consistently 100ms faster when done in parallel.
[Test]
public void Y()
{
var sw = Stopwatch.StartNew();
Parallel.For(0, 10, n => TestMethod("parallel", n));
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Restart();
for (int i = 0; i < 10; i++)
TestMethod("forloop", i);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
private static void TestMethod(string fileName, int hifi)
{
fileName = fileName + hifi;
var fs = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.ReadWrite);
var sw = new StreamWriter(fs, Encoding.UTF8);
for (int x = 0; x < 10000; x++)
{
sw.WriteLine(DateTime.Now.ToString());
}
sw.Close();
}
The primary thing to observe in your case is Amdahl's Law. Your algorithm makes roughly equal use of each of the following resources:
Processor usage
Memory access
Drive access
Of these, the drive access is by far the slowest item, so to see speedup you'll need to parallelize your algorithm across this resource. In other words, if you parallelize your program by writing the 10 different files to 10 different drives, you'll see a substantial performance improvement compared to just parallelizing the computation of the file contents. In fact if you create the files on 10 different threads, the serialization involved with drive access could actually reduce the overall performance of your program.
Although both imply multi-threaded programming, parallelization should NOT be treated the same as asynchronous programming in the case of IO. While I would not recommend parallelizing your use of the file system, it is almost always beneficial to use asynchronous methods for reading/writing to files.
it's wrong way to get up speed, multithreading for parallel work, but not for accelerate
So why did you decide to use multi threading? The price of starting a new thread might be higher than a simple loop. Its not something you can blindly decide about... If you insist on using threads, you can also check the managed ThreadPool / usage of async delegates, which can reduce the cost of creating new threads, by re-using existing ones.
You're negating the benefit of multiple threads because you Join each thread and thus wait for it to complete before you create and start the next thread.
Instead, add the threads to a list as you create and start them, and then loop through the list of threads, joining them in sequence until they finish.
using System.Collections.Generic;
List<Thread> threads= new List<Thread>();
//Should create 10 files in a loop.
for (var x = 0; x < 10; x++)
{
var y = x;
int threadId;
var myThread = new Thread(() => TestMethod("outpFile", y, out threadId));
threads.Add(myThread);
myThread.Start();
//myThread.Join();
//TestMethod("outpFile", y, out threadId);
}
foreach (var thread in threads) thread.Join();
try something like:
for (int i = 0; i < 10; ++i)
{
new Action(() => { TestMethod("outpFile"); }).BeginInvoke(null,null);
}
Console.ReadLine();
if it wont be quicker than serial calls then indeed your IO is a botleneck and nothing you can do about it.
Related
This question already has answers here:
How would I wait for multiple threads to stop?
(5 answers)
Wait for threads to complete
(4 answers)
Closed 4 months ago.
If I used this code to create threads, how can I wait for all of these threads to complete before proceeding with the rest of the code? Or is there a different way to do this?
for(int i = 0; i < 25; i ++)
{
Program x = new Program(); // Make temporary
Thread myThread = new Thread(() => x.doSomething(someParameter));
myThread.Start();
}
I want to avoid making a big chunk of code for initializng, creating and joining threads.
Thread myThread1 = new Thread(() => x.doSomething(someParameter));
myThread1.Start();
Thread myThread2 = new Thread(() => x.doSomething(someParameter));
myThread2.Start();
Thread myThread3 = new Thread(() => x.doSomething(someParameter));
myThread3.Start();
myThread1.Join();
myThread2.Join();
myThread3.Join();
This code works but my goal is to avoid doing this 50-100 / n times depending on how many threads I need.
The simplest solution is probably to put the threads into an array, then iterate over those threads to wait for all of them.
int numThreads = 25;
var threads = new Thread[numThreads];
for (int i = 0; i < numThreads; i++)
{
Program x = new Program();
Thread myThread = new Thread(() => x.doSomething(someParameter));
myThread.Start();
threads[i] = myThread;
}
foreach (var thread in threads)
{
thread.Join();
}
That said, I'd recommend looking into whether tasks would work better than threads: threads are a bit more ... brute-force, though there are reasons to use them. I'd also look into the Parallel class, which lets you control the level of parallelism, if your threads can be run there (they probably can, but there may be reasons not to do so). In that case, you'd just do something like
Parallel.For(0, 25, i => { new Program().doSomething(someParameter); });
... note that i is available inside the curly braces as the loop index; you could do something like new Program().doSomething(i); to pass a different parameter to each invocation of doSomething.
An even better way is to use the Task Parallel Library (TPL).
int numTasks = 25;
var tasks = new Task[numTasks];
for (int i = 0; i < numTasks; i++)
{
tasks[i] = Task.Run(() => new Program().doSomething(someParameter));
}
Task.WaitAll(tasks);
Starting a Thread consumes 1MB of RAM as is resource intensive. Tasks are managed better by the framework.
And, again, even better, you could do this:
var tasks =
Enumerable
.Range(0, numTasks)
.Select(i => Task.Run(() => new Program().doSomething(someParameter)))
.ToArray();
Task.WaitAll(tasks);
And possibly better again, you can await the whole lot:
async Task Main()
{
var someParameter = new object();
int numTasks = 25;
var tasks =
Enumerable
.Range(0, numTasks)
.Select(i => Task.Run(() => new Program().doSomething(someParameter)))
.ToArray();
await Task.WhenAll(tasks);
}
Edit: As per the discussion in the comments, I was overestimating how much many threads would help, and have gone back to Parallell.ForEach with a reasonable MaxDegreeOfParallelism, and just have to wait it out.
I have a 2D array data structure, and perform work on slices of the data. There will only ever be around 1000 threads required to work on all the data simultaneously. Basically there are around 1000 "days" worth of data for all ~7000 data points, and I would like to process the data for each day in a new thread in parallel.
My issue is that doing work in the child threads dramatically slows the time in which the main thread starts them. If I have no work being done in the child threads, the main thread starts them all basically instantly. In my example below, with just a bit of work, it takes ~65ms to start all the threads. In my real use case, the worker threads will take around 5-10 seconds to compute all what they need, but I would like them all to start instantly otherwise, I am basically running the work in sequence. I do not understand why their work is slowing down the main thread from starting them.
How the data is setup shouldn't matter (I hope). The way it's setupmight look weird I was just simulating exactly how I receive the data. What's important is that if you comment out the foreach loop in the DoThreadWork method, the time it takes to start the threads is waaay lower.
I have the for (var i = 0; i < 4; i++) loop just to run the simulation multiple times to see 4 sets of timing results to make sure that it wasn't just slow the first time.
Here is a code snippet to simulate my real code:
public static void Main(string[] args)
{
var fakeData = Enumerable
.Range(0, 7000)
.Select(_ => Enumerable.Range(0, 400).ToArray())
.ToArray();
const int offset = 100;
var dataIndices = Enumerable
.Range(offset, 290)
.ToArray();
for (var i = 0; i < 4; i++)
{
var s = Stopwatch.StartNew();
var threads = dataIndices
.Select(n =>
{
var thread = new Thread(() =>
{
foreach (var fake in fakeData)
{
var sliced = new ArraySegment<int>(fake, n - offset, n - (n - offset));
DoThreadWork(sliced);
}
});
return thread;
})
.ToList();
foreach (var thread in threads)
{
thread.Start();
}
Console.WriteLine($"Before Join: {s.Elapsed.Milliseconds}");
foreach (var thread in threads)
{
thread.Join();
}
Console.WriteLine($"After Join: {s.Elapsed.Milliseconds}");
}
}
private static void DoThreadWork(ArraySegment<int> fakeData)
{
// Commenting out this foreach loop will dramatically increase the speed
// in which all the threads start
var a = 0;
foreach (var fake in fakeData)
{
// Simulate thread work
a += fake;
}
}
Use the thread/task pool and limit thread/task count to 2*(CPU Cores) at most. Creating more threads doesn't magically make more work get done as you need hardware "threads" to run them (1 per CPU core for non-SMT CPU's, 2 per core for Intel HT, AMD's SMT implementation). Executing hundreds to thousands of threads that don't have to passively await asynchronous callbacks (i.e. I/O) makes running the threads far less efficient due to thrashing the CPU with context switches for no reason.
I've been working on a multithreaded file archiver for a week now, it works exclusively on clean threads. Synchronization is achieved by monitors and AutoResetEvent.
I allocated the number of threads to the number of cores like that:
public static int GetCoreCount()
{
int coreCount = 0;
foreach (var item in new System.Management.ManagementObjectSearcher("Select * from Win32_Processor").Get())
{
coreCount += int.Parse(item["NumberOfCores"].ToString());
}
return coreCount;
}
But that load my CPU max ~65%.
And this load is far from uniform, it constantly falls and rises.
Tell me. Does anyone have any idea how to use 100% processor capability?
This is my Run() code :
public void Run()
{
var readingThread = new Thread(new ThreadStart(ReadInFile));
var compressingThreads = new List<Thread>();
for (var i = 0; i < CoreManager.GetCoreCount(); i++)
{
var j = i;
ProcessEvents[j] = new AutoResetEvent(false);
compressingThreads.Add(new Thread(() => Process(j)));
}
var writingThread = new Thread(new ThreadStart(WriteOutFile));
readingThread.Start();
foreach (var compressThread in compressingThreads)
{
compressThread.Start();
}
writingThread.Start();
WaitHandle.WaitAll(ProcessEvents);
OutputDictionary.SetCompleted();
writingThread.Join();
It's not possible to tell what is limiting your core usage without profiling, and also knowing how much data you are compressing in your test.
However I can say that in order to get good efficiency, which includes both full core utilization and close to a factor of n speedup for n threads over one thread, in pigz I have to create pools of threads that are always there, either running or waiting for more work. It is a huge impact to create and destroy threads for every chunk of data to be processed. I also have pools of pre-allocated blocks of memory for the same reason.
The source code at the link, in C, may be of help.
I have a very simple program counting the characters in a string. An integer threadnum sets the number of threads and divides the data by threadnum accordingly into chunks for each thread to process.
Each thread increments the values contained in a shared dictionary, building a character historgram.
private Dictionary<UInt32, int> dict = new Dictionary<UInt32, int>();
In order to wait for all threads to finish and continue with the main process, I invoke Thread.Join
Initially I had a local dictionary for each thread which get merged afterwards, but a shared dictionary worked fine, without locking.
No references are locked in the method BuildDictionary, though locking the dictionary did not significantly impact thread-execution time.
Each thread is timed, and the resulting dictionary compared.
The dictionary content is the same regardless of a single or multiple threads - as it should be.
Each thread takes a fraction determined by threadnum to complete - as it should be.
Problem:
The total time is roughly a multiple of threadnum , that is to say the execution time increases ?
(Unfortunately I cannot run a C# Profiler at the moment. Additionally I would prefer C# 3 code compatibility. )
Others are likely struggling as well. It may be that the VS 2010 express edition vshost process stacks and schedules threads to be run sequentially?
Another MT-performance issue was posted recently posted here as "Visual Studio C# 2010 Express Debug running Faster than Release":
Code:
public int threadnum = 8;
Thread[] threads = new Thread[threadnum];
Stopwatch stpwtch = new Stopwatch();
stpwtch.Start();
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
threads[threadidx].Join(); //Blocks the calling thread, till thread completion
}
WriteLine("Total - time: {0} msec", stpwtch.ElapsedMilliseconds);
Can you help please?
Update:
It appears that the strange behavior of an almost linear slowdown with increasing thread-number is an artifact due to the numerous hooks of the IDE's Debugger.
Running the process outside the developer environment, I actually do get a 30% speed increase on a 2 logical/physical core machine. During debugging I am already at the high end of CPU utilization, and hence I suspect it is wise to have some leeway during development through additional idle cores.
As initially, I let each thread compute on its own local data-chunk, which is locked and written back to a shared list and aggregated after all threads have finished.
Conclusion:
Be heedful of the environment the process is running in.
We can put the dictionary synchronization issues Tony the Lion mentions in his answer aside for the moment, because in your current implementation you are in fact not running anything in parallel!
Let's take a look at what you are currently doing in your loop:
Start a thread.
Wait for the thread to complete.
Start the next thread.
In other words, you should not be calling Join inside the loop.
Instead, you should start all threads as you are doing, but use a singaling construct such as an AutoResetEvent to determine when all threads have completed.
See example program:
class Program
{
static EventWaitHandle _waitHandle = new AutoResetEvent(false);
static void Main(string[] args)
{
int numThreads = 5;
for (int i = 0; i < numThreads; i++)
{
new Thread(DoWork).Start(i);
}
for (int i = 0; i < numThreads; i++)
{
_waitHandle.WaitOne();
}
Console.WriteLine("All threads finished");
}
static void DoWork(object id)
{
Thread.Sleep(1000);
Console.WriteLine(String.Format("Thread {0} completed", (int)id));
_waitHandle.Set();
}
}
Alternatively you could just as well be calling Join in the second loop if you have references to the threads available.
After you have done this you can and should worry about the dictionary synchronization problems.
A Dictionary can support multiple readers concurrently, as long as the collection is not modified. From MSDN
You say:
but a shared dictionary worked fine, without locking.
Each thread increments the values contained in a shared dictionary
Your program is by definition broken, if you alter the data in the dictionary without proper locking, you will end up with bugs. Nothing more needs to be said.
I wouldn't use some shared static Dictionary, if each thread worked on a local copy you could amalgamate your results once all threads had signalled completion.
WaitHandle.WaitAll avoids any deadlocking on an AutoResetEvent.
class Program
{
static void Main()
{
char[] text = "Some String".ToCharArray();
int numThreads = 5;
// I leave the implementation of the next line to the OP.
Partition[] partitions = PartitionWork(text, numThreads);
completions = new WaitHandle[numThreads];
results = IDictionary<char, int>[numThreads];
for (int i = 0; i < numThreads; i++)
{
results[i] = new IDictionary<char, int>();
completions[i] = new ManualResetEvent(false);
new Thread(DoWork).Start(
text,
partitions[i].Start,
partitions[i].End,
results[i],
completions[i]);
}
if (WaitHandle.WaitAll(completions, new TimeSpan(366, 0, 0, 0))
{
Console.WriteLine("All threads finished");
}
else
{
Console.WriteLine("Timed out after a year and a day");
}
// Merge the results
IDictionary<char, int> result = results[0];
for (int i = 1; i < numThreads - 1; i ++)
{
foreach(KeyValuePair<char, int> item in results[i])
{
if (result.ContainsKey(item.Key)
{
result[item.Key] += item.Value;
}
else
{
result.Add(item.Key, item.Value);
}
}
}
}
static void BuildDictionary(
char[] text,
int start,
int finish,
IDictionary<char, int> result,
WaitHandle completed)
{
for (int i = start; i <= finish; i++)
{
if (result.ContainsKey(text[i])
{
result[text[i]]++;
}
else
{
result.Add(text[i], 1);
}
}
completed.Set();
}
}
With this implementation the only variable that is ever shared is the char[] of the text and that is always read only.
You do have the burden of merging the dictionaries at the end but, that is a small price for avoiding any concurrencey issues. In a later version of the framework I would have used TPL and ConcurrentDictionary and possibly Partitioner<TSource>.
I totally agree with TonyTheLion and others, and as you fix the actual problem with join'ing at the wrong place, there still will be problem with (no) locks and updating the shared dictionary. I wanted to drop you a quick workaround: just wrap your integer value into some object:
instead of:
Dictionary<uint, int> dict = new Dictionary<uint, int>();
use:
class Entry { public int value; }
Dictionary<uint, Entry> dict = new Dictionary<uint, Entry>();
and now increment the Entry::value instead. That way, the Dictionary will not notice any changes and it will be safe without locking the dictionary.
Note: this will however work only if you are guaranteed if one thread would use only its own one Entry. I've just noticed this is not true as you said 'histogram of characters'. You will have to lock over each Entry during the increment, or some increments may be lost. Still, locking at Entry layer will speed up signinificantly when compared to locking at whole dictionary
Roem saw it.
Your main thread should Join the X other Threads after having started all of them.
Else it waits for the 1st thread to be finished, to start and wait for the 2nd one.
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
}
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx].Join(); //Blocks the calling thread, till thread completion
}
As Rotem points out, by joining in the loop you are waiting for each thread to complete before going continuing.
The hint for why this is can be found on the Thread.Join documentation on MSDN
Blocks the calling thread until a thread terminates
So you loop will not continue until that one thread has completed it's work. To start all the threads then wait for them to complete, join them outside the loop:
public int threadnum = 8;
Thread[] threads = new Thread[threadnum];
Stopwatch stpwtch = new Stopwatch();
stpwtch.Start();
// Start all the threads doing their work
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx] = new Thread(BuildDictionary);
threads[threadidx].Start(threadidx);
}
// Join to all the threads to wait for them to complete
for (var threadidx = 0; threadidx < threadnum; threadidx++)
{
threads[threadidx].Join();
}
System.Diagnostics.Debug.WriteLine("Total - time: {0} msec", stpwtch.ElapsedMilliseconds);
You will really need to post your BuildDictionary function. It is very likely that the operation will be no faster with multiple threads and the threading overhead will actually increase execution time.
I tried a very minimal example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Collections.Concurrent;
using System.Diagnostics;
namespace TPLExample {
class Program {
static void Main(string[] args) {
int[] dataItems = new int[100];
double[] resultItems = new double[100];
for (int i = 0; i < dataItems.Length; ++i) {
dataItems[i] = i;
}
Stopwatch stopwatch = new Stopwatch();
stopwatch.Reset();
stopwatch.Start();
Parallel.For(0, dataItems.Length, (index) => {
resultItems[index] = Math.Pow(dataItems[index], 2);
});
stopwatch.Stop();
Console.WriteLine("TPL Time elapsed: {0}", stopwatch.Elapsed);
stopwatch.Reset();
stopwatch.Start();
for (int i = 0; i < dataItems.Length; ++i) {
resultItems[i] = Math.Pow(dataItems[i], 2);
}
stopwatch.Stop();
Console.WriteLine("Sequential Time elapsed: {0}", stopwatch.Elapsed);
WaitForEnterKey();
}
public static void WaitForEnterKey() {
Console.WriteLine("Press enter to finish");
Console.ReadLine();
}
public static void PrintMessage() {
Console.WriteLine("Message printed");
}
}
}
The output was:
TPL Time elapsed: 00:00:00.0010670
Sequential Time elapsed: 00:00:00.0000178
Press enter to finish
The sequential loop is way faster than TPL! How is this possible? From my understanding, calculation within the Parallel.For will be executed in parallel, so must it be faster?
Simply put: For only iterating over a hundred items and performing a small mathematical operation, spawning new threads and waiting for them to complete produces more overhead than just running through the loop would.
From my understanding, calculation within the Parallel.For will be executed in parallel, so must it be faster?
As generally happens when people make sweeping statements about computer performance, there are far more variables at play here, and you can't really make that assumption. For example, inside your for loop, you are doing nothing more than Math.Pow, which the processor can perform very quickly. If this were an I/O intensive operation, requiring each thread to wait a long time, or even if it were a series of processor-intensive operations, you would get more out of Parallel processing (assuming you have a multi-threaded processor). But as it is, the overhead of creating and synchronizing these threads is far greater than any advantage that parallelism might give you.
Parallel loop processing is beneficial when the operation performed within the loop is relatively costly. All you're doing in your example is calculating an exponent, which is trivial. The overhead of multithreading is far outweighing the gains that you're getting in this case.
This code example is practical proof really nice answers above.
I've simulated intensive processor operation by simply blocking thread by Thead.Sleep.
The output was:
Sequential Loop - 00:00:09.9995500
Parallel Loop - 00:00:03.0347901
_
class Program
{
static void Main(string[] args)
{
const int a = 10;
Stopwatch sw = new Stopwatch();
sw.Start();
//for (long i = 0; i < a; i++)
//{
// Thread.Sleep(1000);
//}
Parallel.For(0, a, i =>
{
Thread.Sleep(1000);
});
sw.Stop();
Console.WriteLine(sw.Elapsed);
Console.ReadLine();
}
}
The overhead of parallelization is far greater than simply running Math.Pow 100 times sequentially. The others have said this.
More importantly, though, the memory access is trivial in the sequential version, but with the parallel version, the threads have to share memory (resultItems) and that kind of thing will really kill you even if you have a million items.
See page 44 of this excellent Microsoft whitepaper on parallel programming:
http://www.microsoft.com/en-us/download/details.aspx?id=19222. Here is an MSDN magazine article on the subject: http://msdn.microsoft.com/en-us/magazine/cc872851.aspx