I am having a Windows Service that needs to pick the jobs from database and needs to process it.
Here, each job is a scanning process that would take approx 10 mins to complete.
I am very new to Task Parallel Library. I have implemented in the following way as sample logic:
Queue queue = new Queue();
for (int i = 0; i < 10000; i++)
{
queue.Enqueue(i);
}
for (int i = 0; i < 100; i++)
{
Task.Factory.StartNew((Object data ) =>
{
var Objdata = (Queue)data;
Console.WriteLine(Objdata.Dequeue());
Console.WriteLine(
"The current thread is " + Thread.CurrentThread.ManagedThreadId);
}, queue, TaskCreationOptions.LongRunning);
}
Console.ReadLine();
But, this is creating lot of threads. Since loop is repeating 100 times, it is creating 100 threads.
Is it right approach to create that many number of parallel threads ?
Is there any way to limit the number of threads to 10 (concurrency level)?
An important factor to remember when allocating new Threads is that the OS has to allocate a number of logical entities in order for that current thread to run:
Thread kernel object - an object for describing the thread,
including the thread's context, cpu registers, etc
Thread environment block - For exception handling and thread local
storage
User-mode stack - 1MB of stack
Kernel-mode stack - For passing arguments from user mode to kernel
mode
Other than that, the number of concurrent Threads that may run depend on the number of cores your machine is packing, and creating an amount of threads that is larger than the number of cores your machine owns will start causing Context Switching, which in the long run may slow your work down.
So after the long intro, to the good stuff. What we actually want to do is limit the number of threads running and reuse them as much as possible.
For this kind of job, i would go with TPL Dataflow which is based on the Producer-Consumer pattern. Just a small example of what can be done:
// a BufferBlock is an equivalent of a ConcurrentQueue to buffer your objects
var bufferBlock = new BufferBlock<object>();
// An ActionBlock to process each object and do something with it
var actionBlock = new ActionBlock<object>(obj =>
{
// Do stuff with the objects from the bufferblock
});
bufferBlock.LinkTo(actionBlock);
bufferBlock.Completion.ContinueWith(t => actionBlock.Complete());
You may pass each Block a ExecutionDataflowBlockOptions which may limit the Bounded Capacity (The number of objects inside the BufferBlock) and MaxDegreeOfParallelism which tells the block the number of maximum concurrency you may want.
There is a good example here to get you started.
Glad you asked, because you're right in the sense that - this is not the best approach.
The concept of Task should not be confused with a Thread. A Thread can be compared to a chef in a kitchen, while a Task is a dish ordered by a customer. You have a bunch of chefs, and they process the dish orders in some ordering (usually FIFO). A chef finishes a dish then moves on to the next. The concept of Thread Pool is the same. You create a bunch of Tasks to be completed, but you do not need to assign a new thread to each task.
Ok so the actual bits to do it. There are a few. The first one is ThreadPoll.QueueUserWorkItem. (http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuserworkitem(v=vs.110).aspx). Using the Parallel library, Parallel.For can also be used, it will automatically spawn threads based on the number of actual CPU cores available in the system.
Parallel.For(0, 100, i=>{
//here, this method will be called 100 times, and i will be 0 to 100
WaitForGrassToGrow();
Console.WriteLine(string.Format("The {0}-th task has completed!",i));
});
Note that there is no guarantee that the method called by Parallel.For is called in sequence (0,1,2,3,4,5...). The actual sequence depends on the execution.
Related
I am trying to understand how Parallel.Invoke creates and reuses threads.
I ran the following example code (from MSDN, https://msdn.microsoft.com/en-us/library/dd642243(v=vs.110).aspx):
using System;
using System.Threading;
using System.Threading.Tasks;
class ThreadLocalDemo
{
static void Main()
{
// Thread-Local variable that yields a name for a thread
ThreadLocal<string> ThreadName = new ThreadLocal<string>(() =>
{
return "Thread" + Thread.CurrentThread.ManagedThreadId;
});
// Action that prints out ThreadName for the current thread
Action action = () =>
{
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread.
bool repeat = ThreadName.IsValueCreated;
Console.WriteLine("ThreadName = {0} {1}", ThreadName.Value, repeat ? "(repeat)" : "");
};
// Launch eight of them. On 4 cores or less, you should see some repeat ThreadNames
Parallel.Invoke(action, action, action, action, action, action, action, action);
// Dispose when you are done
ThreadName.Dispose();
}
}
As I understand it, Parallel.Invoke tries to create 8 threads here - one for each action. So it creates the first thread, runs the first action, and by that gives a ThreadName to the thread. Then it creates the next thread (which gets a different ThreadName) and so on.
If it cannot create a new thread, it will reuse one of the threads created before. In this case, the value of repeat will be true and we can see this in the console output.
Is this correct until here?
The second-last comment ("Launch eight of them. On 4 cores or less, you should see some repeat ThreadNames") implies that the threads created by Invoke correspond to the available cpu threads of the processor: on 4 cores we have 8 cpu threads, at least one is busy (running the operating system and stuff), so Invoke can only use 7 different threads, so we must get at least one "repeat".
Is my interpretation of this comment correct?
I ran this code on my PC which has an Intel® Core™ i7-2860QM processor (i.e. 4 cores, 8 cpu threads). I expected to get at least one "repeat", but I didn't. When I changed the Invoke to take 10 instead of 8 actions, I got this output:
ThreadName = Thread6
ThreadName = Thread8
ThreadName = Thread6 (repeat)
ThreadName = Thread5
ThreadName = Thread3
ThreadName = Thread1
ThreadName = Thread10
ThreadName = Thread7
ThreadName = Thread4
ThreadName = Thread9
So I have at least 9 different threads in the console application. This contradicts the fact that my processor only has 8 threads.
So I guess some of my reasoning from above is wrong. Does Parallel.Invoke work differently than what I described above? If yes, how?
If you pass less then 10 items to Parallel.Invoke, and you don't specify MaxDegreeOfParallelism in options (so - your case), it will just run them all in parallel on thread pool sheduler using rougly the following code:
var actions = new [] { action, action, action, action, action, action, action, action };
var tasks = new Task[actions.Length];
for (int index = 1; index < tasks.Length; ++index)
tasks[index] = Task.Factory.StartNew(actions[index]);
tasks[0] = new Task(actions[0]);
tasks[0].RunSynchronously();
Task.WaitAll(tasks);
So just a regular Task.Factory.StartNew. If you will look at max number of threads in thread pool
int th, io;
ThreadPool.GetMaxThreads(out th, out io);
Console.WriteLine(th);
You will see some big number, like 32767. So, number of threads on which Parallel.Invoke will be executed (in your case) are not limited to number of cpu cores at all. Even on 1-core cpu it might run 8 threads in parallel.
You might then think, why some threads are reused at all? Because when work is done on thread pool thread - that thread is returned to the pool and is ready to accept new work. Actions from your example basically do no work at all and complete very fast. So sometimes first thread started via Task.Factory.StartNew has already completed your action and is returned to the pool before all subsequent threads were started. So that thread is reused.
By the way, you can see (repeat) in your example with 8 actions, and even with 7 if you try hard enough, on a 8 core (16 logical cores) processor.
UPDATE to answer your comment. Thread pool scheduler will not necessary create new threads immediately. There is min and max number of threads in thread pool. How to see max I already shown above. To see min number:
int th, io;
ThreadPool.GetMinThreads(out th, out io);
This number will usually be equal to the number of cores (so for example 8). Now, when you request new action to be performed on thread pool thread, and number of threads in a thread pool is less than minimum - new thread will be created immeditely. However, if number of available threads is greater than minimum - certain delay will be introduced before creating new thread (I don't remember how long exactly unfortunately, about 500ms).
Statement you added in your comment I highly doubt can execute in 2-3 seconds. For me it executes for 0.3 seconds max. So when first 8 threads are created by thread pool, there is that 500ms delay before creating 9th. During that delay, some (or all) of first 8 threads are completed their job and are available for new work, so there is no need to create new thread and they can be reused.
To verify this, introduce bigger delay:
static void Main()
{
// Thread-Local variable that yields a name for a thread
ThreadLocal<string> ThreadName = new ThreadLocal<string>(() =>
{
return "Thread" + Thread.CurrentThread.ManagedThreadId;
});
// Action that prints out ThreadName for the current thread
Action action = () =>
{
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread.
bool repeat = ThreadName.IsValueCreated;
Console.WriteLine("ThreadName = {0} {1}", ThreadName.Value, repeat ? "(repeat)" : "");
Thread.Sleep(1000000);
};
int th, io;
ThreadPool.GetMinThreads(out th, out io);
Console.WriteLine("cpu:" + Environment.ProcessorCount);
Console.WriteLine(th);
Parallel.Invoke(Enumerable.Repeat(action, 100).ToArray());
// Dispose when you are done
ThreadName.Dispose();
Console.ReadKey();
}
You will see that now thread pool has to create new threads every time (much more than there are cores), because it cannot reuse previous threads while they are busy.
You can also increase number of min threads in thread pool, like this:
int th, io;
ThreadPool.GetMinThreads(out th, out io);
ThreadPool.SetMinThreads(100, io);
This will remove the delay (until 100 threads are created) and in above example you will notice that.
Behind the scenes, threads are organized (and possessed by) the task scheduler. Primary purpose of the task scheduler is to keep all CPU cores used as much as possible with useful work.
Under the hood, scheduler is using the thread pool, and then size of the thread pool is the way to fine-tune usefulness of operations executed on CPU cores.
Now this requires some analysis. For instance, thread switching costs CPU cycles and it is not useful work. On the other hand, when one thread executes one task on a core, all other tasks are stalled and they are not progressing on that core. I believe that is the core reason why the scheduler is usually starting two threads per core, so that at least some movement is visible in case that one task takes longer to complete (like several seconds).
There are corollaries to this basic mechanism. When some tasks take long time to complete, scheduler starts new threads to compensate. That means that long-running task will now have to compete for the core with short-running tasks. In that way, short tasks will be completed one after another, and long task will slowly progress to its completion as well.
Bottom line is that your observations about threads are generally correct, but not entirely true in specific situations. In concrete execution of a number of tasks, scheduler might choose to raise more threads, or to keep going with the default. That is why you will sometimes notice that number of threads differs.
Remember the goal of the game: Utilize CPU cores with useful work as much as possible, while at the same time making all tasks move, so that the application doesn't look like frozen. Historically, people used to try to reach these goals with many different techniques. Analysis had shown that many of those techniques were applied randomly and didn't really increase CPU utilization. That analysis has lead to introduction of task schedulers in .NET, so that fine-tuning can be coded once and be done well.
So I have at least 9 different threads in the console application. This contradicts the fact that my processor only has 8 threads.
A thread is a very much overloaded term. It can mean, at the very least: (1) something you sew with, (2) a bunch of code with associated state, that is represented by an OS handle, and (3) an execution pipeline of a CPU. The Thread.CurrentThread refers to (2), the "processor thread" that you mentioned refers to (3).
The existence of a (2)-thread is not predicated on the existence of (3)-thread, and the number of (2)-threads that exist on any particular system is pretty much limited by available memory and OS design. The existence of (2)-thread doesn't imply execution of (2)-thread at any given time (unless you use APIs that guarantee that).
Furthermore, if a (2)-thread executes at some point - implying a temporary 1:1 binding between (2)-thread and (3)-thread, there is no implication that the thread will continue executing in general, and of course neither is there an implication that the thread will continue executing on the same (3)-thread if it continues executing at all.
So, even if you have "caught" the execution of a (2)-thread on a (3)-thread by some side effect, e.g. console output, as you did, that doesn't necessarily imply anything about any other (2)-threads and (3)-threads at that point.
On to your code:
// If ThreadName.IsValueCreated is true, it means that we are not the
// first action to run on this thread. <-- this refers to (2)-thread, NOT (3)-thread.
Parallel.Invoke is not precluded from (in terms of specifications) creating as many new (2)-threads as there are arguments passed to it. The actual number of (2)-threads created may be all the way from zero to a hero, since to call Parallel.Invoke there must be an existing (2)-thread with some code that calls this API. So, no new (2)-threads need to be created at all, for example. Whether the (2)-threads created by Parallel.Invoke execute on any particular number of (3)-threads concurrently is beyond your control either.
So that explains the behavior you see. You conflated (2)-threads with (3)-threads, and assumed that Parallel.Invoke does something specific it in fact is not guaranteed to do. Citing documentation:
No guarantees are made about the order in which the operations execute or whether they execute in parallel.
This implies that Invoke is free to run the actions on dedicated (2)-threads if it so wishes. And that is what you observed.
private static void Main(string[] args)
{
for (int i = 0; i < 1000; i++)
{
Task.Factory.StartNew(() =>
{
Thread.Sleep(1000);
Console.WriteLine("hej");
Thread.Sleep(10000);
});
}
Console.ReadLine();
}
Why this code won't print 1000 times "hej" after one second? Why Thread.Sleep(10000) has an impact on code behavior?
Factory.StartNew effectively delegates the work to ThreadPool.
Threadpool will create number of threads immediately to respond the request as long as threads count is less than or equal to processor count. Once it reaches processor count, threadpool will stop creating new threads immediately. That makes sense, because creating number of threads more than processor count introduces Thread scheduling overhead and returns nothing.
Instead it will throttle the creation of threads. It waits for 500 ms to see if any work still pending and no threads to process the request. If pending works are there, it will introduce a new thread(only one). This process keeps on going as long as you have enough works to do.
When work queue's traffic is cleared, threadpool will destroy the threads. And above mentioned process keeps on going.
Also, There is a max limit for number of threads threadpool can run simultaneously. If you hit that, threadpool will stop creating more threads and wait for previous work items to complete, So that it can reuse the existing thread.
That's not the end of story, It is convoluted! These are few decisions taken by ThreadPool.
I hope now that will be clear why you see what you see.
There are a multitude of factors that would alter the result.
Some being (but not limited to):
The inherent time for the iteration of the loop
The size of the thread pool
Thread management overhead
The way you code behaves is intended behaviour. You wait 1000 milliseconds to print hej and after printing you do Thread.sleep for another 10000 millesconds. If you want to print 1000 times hej after one second remove Thread.sleep(10000).
I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.
The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.
Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());
Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..
I've got an I/O intensive operation.
I only want a MAX of 5 threads ever running at one time.
I've got 8000 tasks to queue and complete.
Each task takes approximately 15-20seconds to execute.
I've looked around at ThreadPool, but
ThreadPool.SetMaxThreads(5, 0);
List<task> tasks = GetTasks();
int toProcess = tasks.Count;
ManualResetEvent resetEvent = new ManualResetEvent(false);
for (int i = 0; i < tasks.Count; i++)
{
ReportGenerator worker = new ReportGenerator(tasks[i].Code, id);
ThreadPool.QueueUserWorkItem(x =>
{
worker.Go();
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
});
}
resetEvent.WaitOne();
I cannot figure out why... my code is executing more than 5 threads at one time. I've tried to setmaxthreads, setminthreads, but it keeps executing more than 5 threads.
What is happening? What am I missing? Should I be doing this in another way?
Thanks
There is a limitation in SetMaxThreads in that you can never set it lower than the number of processors on the system. If you have 8 processors, setting it to 5 is the same as not calling the function at all.
Task Parallel Library can help you:
List<task> tasks = GetTasks();
Parallel.ForEach(tasks, new ParallelOptions { MaxDegreeOfParallelism = 5 },
task => {ReportGenerator worker = new ReportGenerator(task.Code, id);
worker.Go();});
What does MaxDegreeOfParallelism do?
I think there's a different and better way to approach this. (Pardon me if I accidentally Java-ize some of the syntax)
The main thread here has a lists of things to do in "Tasks" -- instead of creating threads for each task, which is really not efficient when you have so many items, create the desired number of threads and then have them request tasks from the list as needed.
The first thing to do is add a variable to the class this code comes from, for use as a pointer into the list. We'll also add one for the maximum desired thread count.
// New variable in your class definition
private int taskStackPointer;
private final static int MAX_THREADS = 5;
Create a method that returns the next task in the list and increments the stack pointer. Then create a new interface for this:
// Make sure that only one thread has access at a time
[MethodImpl(MethodImplOptions.Synchronized)]
public task getNextTask()
{
if( taskStackPointer < tasks.Count )
return tasks[taskStackPointer++];
else
return null;
}
Alternately, you could return tasks[taskStackPointer++].code, if there's a value you can designate as meaning "end of list". Probably easier to do it this way, however.
The interface:
public interface TaskDispatcher
{
[MethodImpl(MethodImplOptions.Synchronized)] public task getNextTask();
}
Within the ReportGenerator class, change the constructor to accept the dispatcher object:
public ReportGenerator( TaskDispatcher td, int idCode )
{
...
}
You'll also need to alter the ReportGenerator class so that the processing has an outer loop that starts off by calling td.getNextTask() to request a new task, and which exits the loop when it gets back a NULL.
Finally, alter the thread creation code to something like this: (this is just to give you an idea)
taskStackPointer = 0;
for (int i = 0; i < MAX_THREADS; i++)
{
ReportGenerator worker = new ReportGenerator(this,id);
worker.Go();
}
That way you create the desired number of threads and keep them all working at max capacity.
(I'm not sure I got the usage of "[MethodImpl(MethodImplOptions.Synchronized)]" exactly right... I am more used to Java than C#)
Your tasks list will have 8k items in it because you told the code to put them there:
List<task> tasks = GetTasks();
That said, this number has nothing to do with how many threads are being used in the sense that the debugger is always going to show how many items you added to the list.
There are various ways to determine how many threads are in use. Perhaps one of the simplest is to break into the application with the debugger and take a look at the threads window. Not only will you get a count, but you'll see what each thread is doing (or not) which leads me to...
There is significant discussion to be had about what your tasks are doing and how you arrived at a number to 'throttle' the thread pool. In most use cases, the thread pool is going to do the right thing.
Now to answer your specific question...
To explicitly control the number of concurrent tasks, consider a trivial implementation that would involve changing your task collection from a List to BlockingCollection (that will internally use a ConcurrentQueue) and the following code to 'consume' the work:
var parallelOptions = new ParallelOptions
{
MaxDegreeOfParallelism = 5
};
Parallel.ForEach(collection.GetConsumingEnumerable(), options, x =>
{
// Do work here...
});
Change MaxDegreeOfParallelism to whatever concurrent value you have determined is appropriate for the work you are doing.
The following might be of interest to you:
Parallel.ForEach Method
BlockingCollection
Chris
Its works for me. This way you can't use a number of workerthreads smaller than "minworkerThreads". The problem is if you need five "workerthreads" maximum and the "minworkerThreads" is six doesn't work.
{
ThreadPool.GetMinThreads(out minworkerThreads,out minportThreads);
ThreadPool.SetMaxThreads(minworkerThreads, minportThreads);
}
MSDN
Remarks
You cannot set the maximum number of worker threads or I/O completion threads to a number smaller than the number of processors on the computer. To determine how many processors are present, retrieve the value of the Environment.ProcessorCount property. In addition, you cannot set the maximum number of worker threads or I/O completion threads to a number smaller than the corresponding minimum number of worker threads or I/O completion threads. To determine the minimum thread pool size, call the GetMinThreads method.
If the common language runtime is hosted, for example by Internet Information Services (IIS) or SQL Server, the host can limit or prevent changes to the thread pool size.
Use caution when changing the maximum number of threads in the thread pool. While your code might benefit, the changes might have an adverse effect on code libraries you use.
Setting the thread pool size too large can cause performance problems. If too many threads are executing at the same time, the task switching overhead becomes a significant factor.
When you spawn multiple tasks, like so:
for (int i = 0; i < 1000000; i++) {
// create a new task
tasks[i] = new Task<int>((stateObject) => {
tls.Value = (int)stateObject;
for (int j = 0; j < 1000; j++) {
// update the TLS balance
tls.Value++;
}
return tls.Value;
}, account.Balance);
tasks[i].Start();
}
Those tasks are basically operating on a ProcessThread. So therefore we can slice 1 process thread 1,000,000 times for 1,000,000 tasks.
Is it the TPL task scheduler that looks at the OS and determines that we have 8 virtual process threads in a multicore machine, and then allocates the load of 1,000,000 tasks across these 8 virtual process threads?
Nows tasks are basically operating on a ProcessThread..so therefore we can slice 1 process thread 1000000 times for 1000000 tasks.
This is not true. A Task != a thread, and especially does not equate to a ProcessThread. Multiple tasks will get scheduled onto a single thread.
Is it the TPL task schduler that looks at the OS and determines that we have 8 virtual Process threads in a multicore machine, and so therefore allocates the load of 1000000 tasks across these 8 vgirtual process threads ?
Effectively, yes. When using the default TaskScheduler (which you're doing above), the tasks are run on ThreadPool threads. The 1000000 tasks will not create 1000000 threads (though it will use more than the 8 you mention...)
That being said, data parallelism (such as looping in a giant for loop) is typically much better handled via Parallel.For or Parallel.ForEach. The Parallel class will, internally, use a Partitioner<T> to split up the work into fewer tasks, which will give you better overall performance since it will have far less overhead. For more details, see my post on Partitioning in the TPL.
Roughly, your current code pushes 1000000 tasks on the ThreadPool. When thosee Tasks take some significant time you could run into problems.
In a situation like this, always use
Parallel.For(0, 1000000, ...);
and then you not only have the scheduler but more important also a partitioner helping you to distribute the load.
Not to mention it's much more readable.