C# SpinWait for long-term waiting

C# SpinWait for long-term waiting - c#

This code consumes near zero CPU (i5 family)
public void SpinWait() {
for (int i = 0; i < 10000; i++)
{
Task.Factory.StartNew(() =>
{
var sw = new SpinWait();
while (true)
{
sw.SpinOnce();
}
});
}
}
In my code performance difference compared to SemaphoreSlim is 3x times or more for the case when spinning is really justified (5 Mops). However, I am concerned about using it for long-term wait. The standard advice is to implement two-phase wait operation. I could check NextSpinWillYield property and introduce a counter+reset to increase the default spinning iterations without yielding, and than back off to a semaphore.
But what are downsides of using just SpinWait.SpinOnce for long-term waiting? I have looked through its implementation and it properly yields when needed. It uses Thread.SpinWait which on modern CPUs uses PAUSE instruction and is quite efficient according to Intel.
One issue that I have found while monitoring Task Manager is that number of threads if gradually increasing due to the default ThreadPool algorithm (it adds a thread every second when all tasks are busy). This could be solved by using ThreadPool.SetMaxThreads, and then the number of threads is fixed and CPU usage is still near zero.
If the number of long-waiting tasks is bounded, what are other pitfalls of using SpinWait.SpinOnce for long-term waiting. Does it depend on CPU family, OS, .NET version?
(Just to clarify: I will still implement two-phase waiting, I am just curios why not using SpinOnce all the time?)

Well, the down-side is exactly the one you see, your code is occupying a thread without accomplishing anything. Preventing other code from running and forcing the threadpool manager to do something about it. Tinkering with ThreadPool.SetMaxThreads() is just a band-aid on what is likely to be a profusely bleeding wound, only ever use it when you need to catch the plane home.
Spinning should only ever be attempted when you have a very good guarantee that doing so is more efficient than a thread context switch. Which means that you have to be sure that the thread can continue within ~10,000 cpu cycles or less. That is only 5 microseconds, give or take, a wholeheckofalot less than what most programmers consider "long-term".
Use a sync object that will trigger a thread context switch instead. Or the lock keyword.
Not only will that yield the processor so other waiting threads can get their job done, thus accomplishing a lot more work, it also provides an excellent cue to the OS thread scheduler. A sync object that is signaled will bump the priority of the thread so it is very likely to get the processor next.

Related

Is there a better way to implement multithreading?

The source code portion below makes the CPU consumes high! Is there a better way to implement multithreading?
// This application will run 24 hours 7 days per week
// in server
static void Main(string[] args)
{
const int NUM = 10;
Thread[] t = new Thread[NUM];
for (int i = 0; i < t.Length; i++)
t[i] = new Thread(new ThreadStart(DoWork));
foreach (Thread u in t)
u.Start();
}
static void DoWork()
{
while (true)
{
// Perform scanning work
// If some conditions are satisfied,
// it will perform some actions
}
}

The question is hard to answer if it isn't clear what's inside the while(true) spinning loop. If you start 10 threads where each one does spinning without any waits (using something like AutoResetEvent.WaitOne(), Thread.Sleep(), etc.), the CPU will be consumed - since you are asking the CPU for it.
To improve overall performance threads shouldn't spin if it isn't absolutely necessary to do - and in most cases it isn't. Threads should do their work and then if there aren't more work to do, they should go to sleep. They should be woken up only if they have more work items to process. If your thread is running all the time - checking some conditions that are met only from time to time and if you have a mechanism to inform your thread that conditions are true - then the spinning is wasting of your CPU cycles.
Conceptually a thread should work in this way.
while(true)
{
// Wait until a thread has something meaningful to do. Waiting can be done for instance by calling AutoResetEvent.WaitOne().
// Do a meaningful work here.
}
If you do your thread code in this way, when a thread is waiting, it doesn't spend the CPU, so system is not overworked and other threads/processes can do their work.
The principal question here is, if you have some kind of notification mechanism that allows your thread to wake up when a new work item arrives. For instance, most IO operations like TCP sockets, HTTP communication, reading from files, etc. support asynchronous communication that allows your thread to go to sleep and awake only if new data arrive.
On the other hand, if you don't have such mechanism - for instance you are using a 3rd party library that doesn't notify you when something meaningful happened, you have to do some kind of spinning. But even it this case, the question is, how often do you need to check if conditions were met - so there is some work to do.
If let say, you need to check only every second if conditions were met, add Thread.Sleep(1000) calls to your thread code. This will greatly increases overal performance. Even Thread.Sleep(0) is much better then wait-less spinning.
One important note here. In modern C#, threads should not be used as your primary asynchronous programming mechanism. Using task based asynchronous programming using class Task is much easier to implement, especially if you are using await-async C# keywords and in most cases leads to better performance.
Tasks use threads internally creating new ones if there are too many work items to process with current number of threads and releasing threads, if there isn't enough work to do. Optimal number of threads reduces overal memory consumption.
In these days virtually all standard .NET APIs support task based asynchronous programming so it should be your primary tool how to achieve paralel execution of your code.
Here is some example of task based asynchronous programming.

Manage many repetitive, CPU intensive tasks, running parallelly?

I need to constantly perform 20 repetitive, CPU intensive calculations as fast as possible. So there is 20 tasks which contain looped methods in :
while(!token.IsCancellationRequested)
to repeat them as fast as possible. All calculations are performed at the same time. Unfortunatelly this makes the program unresponsive, so added :
await Task.Delay(15);
At this point program doesn't hang but adding Delay is not correct approach and it unnecessarily slows down the speed of calculations. It is WPF program without MVVM. What approach would you suggest to keep all 20 tasks working at the same time? Each of them will be constantly repeated as soon as it finished. I would like to keep CPU (all cores) utilisation at max values (or near) to ensure best efficiency.
EDIT:
There is 20 controls in which user adjusts some parameters. Calculations are done in:
private async Task Calculate()
{
Task task001 = null;
task001 = Task.Run(async () =>
{
while (!CTSFor_task001.IsCancellationRequested)
{
await Task.Delay(15);
await CPUIntensiveMethod();
}
}, CTSFor_task001.Token);
}
Each control is independent. Calcullations are 100% CPU-bound, no I/O activity. (All values come from variables) During calculations values of some UI items are changed:
this.Dispatcher.BeginInvoke(new Action(() =>
{
this.lbl_001.Content = "someString";
}));

Let me just write the whole thing as an answer. You're confusing two related, but ultimately separate concepts (thankfully - that's why you can benefit from the distinction). Note that those are my definitions of the concepts - you'll hear tons of different names for the same things and vice versa.
Asynchronicity is about breaking the imposed synchronicity of operations (ie. op 1 waits for op 2, which waits for op 3, which waits for op 4...). For me, this is the more general concept, but nowadays it's more commonly used to mean what I'd call "inherent asynchronicity" - ie. the algorithm itself is asynchronous, and we're only using synchronous programming because we have to (and thanks to await and async, we don't have to anymore, yay!).
The key thought here is waiting. I can't do anything on the CPU, because I'm waiting for the result of an I/O operation. This kind of asynchronous programming is based on the thought that asynchronous operations are almost CPU free - they are I/O bound, not CPU-bound.
Parallelism is a special kind of the general asynchronicity, in which the operations don't primarily wait for one another. In other words, I'm not waiting, I'm working. If I have four CPU cores, I can ideally use four computing threads for this kind of processing - in an ideal world, my algorithm will scale linearly with the number of available cores.
With asynchronicity (waiting), using more threads will improve the apparent speed regardless of the number of the available logical cores. This is because 99% of the time, the code doesn't actually do any work, it's simply waiting.
With parallelism (working), using more threads is directly tied to the number of available work cores.
The lines blur a lot. That's because of things you may not even know are happening, for example the CPU (and the computer as a whole) is incredibly asynchronous on its own - the apparent synchronicity it shows is only there to allow you to write code synchronously; all the optimalizations and asynchronicity is limited by the fact that on output, everything is synchronous again. If the CPU had to wait for data from memory every time you do i ++, it wouldn't matter if your CPU was operating at 3 GHz or 100 MHz. Your awesome 3 GHz CPU would sit there idle 99% of the time.
With that said, your calculation tasks are CPU-bound. They should be executed using parallelism, because they are doing work. On the other hand, the UI is I/O bound, and it should be using asynchronous code.
In reality, all your async Calculate method does is that it masks the fact that it's not actually inherently asynchronous. Instead, you want to run it asynchronously to the I/O.
In other words, it's not the Calculate method that's asynchronous. It's the UI that wants this to run asynchronously to itself. Remove all that Task.Run clutter from there, it doesn't belong.
What to do next? That depends on your use case. Basically, there's two scenarios:
You want the tasks to always run, always in the background, from start to end. In that case, simply create a thread for each of them, and don't use Task at all. You might also want to explore some options like a producer-consumer queue etc., to optimize the actual run-time of the different possible calculation tasks. The actual implementation is quite tightly bound to what you're actually processing.
Or, you want to start the task on an UI action, and then work with the resulting values back in the UI method that started them when the results are ready. In that case, await finally comes to play:
private btn_Click(object sender, EventArgs e)
{
var result = await Task.Run(Calculate);
// Do some (little) work with the result once we get it
tbxResult.Text = result;
}
The async keyword actually has no place in your code at all.
Hope this is more clear now, feel free to ask more questions.

So what you actually seek is a clarification of a good practice to maximize performance while keeping the UI responsive. As Luaan clarified, the async and await sections in your proposal will not benefit your problem, and Task.Run is not suited for your work; using threads is a better approach.
Define an array of Threads to run one on each logical processor. Distribute your task data between them and control your 20 repetitive calculations via BufferBlock provided in TPL DataFlow library.
To keep UI responsive, I suggest two approaches:
Your calculations demand many frequent UI updates: Put their required update information in a queue and update them in Timer event.
Your calculations demand scarce UI updates: Update UI with an invocation method like Control.BeginInvoke

As #Luaan says, I would strongly recommend reading up on async/await, the key point being it doesn't introduce any parallelism.
I think what you're trying to do is something like the simple example below, where you kick off CPUIntensiveMethod on the thread pool and await its completion. await returns control from the Calculate method (allowing the UI thread to continue working) until the task completes, at which point it continues with the while loop.
private async Task Calculate()
{
while (!CTSFor_task001.IsCancellationRequested)
{
await Task.Run(CPUIntensiveMethod);
}
}

Sending messages to Tasks

I am writing a program that shows the mandelbrot set depending on some conditions provided by the user. As the calculation takes long (more than 500 ms), I have decided to use more than one thread. Without any previous experience, I have managed to do it by using the System.Threading.Tasks class, which works just fine. The only thing that I don't like is that every time that the mandelbrot is generated, the threads are created and then destroyed.
This is an example of how it works. It creates the threads (Tasks) every time that the method is called.
for (int i = 0; i < maxThreads; i++) {
int a = i;
tasks[a] = Task.Factory.StartNew(() => generateSector(a));
}
I don't know really how that affects performance, but it looks like creating and destroying threads is time expensive, and that it would be more efficient to have the threads ready and waiting for a trigger message, and when they are done go back to that waiting state. May be the following example code is useful to understand this idea.
for (int i = 0; i < maxThreads; i++)
tasks[i].sendMessage("Start"); // Tells the running thread to begin its work
So each thread would execute an infinite loop in which it waits until they are required to do calculations. Then, it would continue with waiting. Something like this:
// Into the method that a thread executes
while(true) {
Wait(); // Waits for the start signal
calculate(); // Do some calculations
} // Go back to waiting
Would that be more efficient? Is there any way to do that?

Leave your code as it is.
1) Tasks use ThreadPool threads, so there is no problem
2) "I don't know really how that affects performance" - this is where you should start. Never optimize before measuring. Do you have performance issues? Is your code running slow? I guess no, so you should not be bothered.

When you use Task.Factory.StartNew(...), you are not necessarily creating and destroying threads. The task library uses a ThreadPool to do this, so you don't need to manage it yourself, like you would if you created new Thread()s yourself.

It sounds like you're trying to use a set of of threads and setting up a system for scheduling work to run on those threads. This is a great idea but in fact, it's so great of an idea that it's built into the .NET framework and you don't need to build it yourself. This is actually exactly what Tasks are made for.
Tasks are a relatively lightweight abstraction over the Thread Pool which is managed by the .NET Runtime. Threads are an operating-system construct that are relatively heavy and it's somewhat expensive to start, stop, and context-switch between threads. When you create a Task, it schedules that task to execute on a the next available thread in the pool and the .NET runtime will automatically increase and decrease the size of the pool based on whether there's work getting queued up and waiting for a thread to execute. You can customize the minimum and maximum thread counts if you need to but usually this is not nessecary.
So by simply creating short-lived Tasks that exist for the lifetime of the single unit of work, they're already going to have your work be run on a managed collection of actual threads.

Best way to limit the number of active Tasks running via the Parallel Task Library

Consider a queue holding a lot of jobs that need processing. Limitation of queue is can only get 1 job at a time and no way of knowing how many jobs there are. The jobs take 10s to complete and involve a lot of waiting for responses from web services so is not CPU bound.
If I use something like this
while (true)
{
var job = Queue.PopJob();
if (job == null)
break;
Task.Factory.StartNew(job.Execute);
}
Then it will furiously pop jobs from the queue much faster than it can complete them, run out of memory and fall on its ass. >.<
I can't use (I don't think) ParallelOptions.MaxDegreeOfParallelism because I can't use Parallel.Invoke or Parallel.ForEach
3 alternatives I've found
Replace Task.Factory.StartNew with
Task task = new Task(job.Execute,TaskCreationOptions.LongRunning)
task.Start();
Which seems to somewhat solve the problem but I am not clear exactly what this is doing and if this is the best method.
Create a custom task scheduler that limits the degree of concurrency
Use something like BlockingCollection to add jobs to collection when started and remove when finished to limit number that can be running.
With #1 I've got to trust that the right decision is automatically made, #2/#3 I've got to work out the max number of tasks that can be running myself.
Have I understood this correctly - which is the better way, or is there another way?
EDIT - This is what I've come up with from the answers below, producer-consumer pattern.
As well as overall throughput aim was not to dequeue jobs faster than could be processed and not have multiple threads polling queue (not shown here but thats a non-blocking op and will lead to huge transaction costs if polled at high frequency from multiple places).
// BlockingCollection<>(1) will block if try to add more than 1 job to queue (no
// point in being greedy!), or is empty on take.
var BlockingCollection<Job> jobs = new BlockingCollection<Job>(1);
// Setup a number of consumer threads.
// Determine MAX_CONSUMER_THREADS empirically, if 4 core CPU and 50% of time
// in job is blocked waiting IO then likely be 8.
for(int numConsumers = 0; numConsumers < MAX_CONSUMER_THREADS; numConsumers++)
{
Thread consumer = new Thread(() =>
{
while (!jobs.IsCompleted)
{
var job = jobs.Take();
job.Execute();
}
}
consumer.Start();
}
// Producer to take items of queue and put in blocking collection ready for processing
while (true)
{
var job = Queue.PopJob();
if (job != null)
jobs.Add(job);
else
{
jobs.CompletedAdding()
// May need to wait for running jobs to finish
break;
}
}

I just gave an answer which is very applicable to this question.
Basically, the TPL Task class is made to schedule CPU-bound work. It is not made for blocking work.
You are working with a resource that is not CPU: waiting for service replies. This means the TPL will mismange your resource because it assumes CPU boundedness to a certain degree.
Manage the resources yourself: Start a fixed number of threads or LongRunning tasks (which is basically the same). Decide on the number of threads empirically.
You can't put unreliable systems into production. For that reason, I recommend #1 but throttled. Don't create as many threads as there are work items. Create as many threads which are needed to saturate the remote service. Write yourself a helper function which spawns N threads and uses them to process M work items. You get totally predictable and reliable results that way.

Potential flow splits and continuations caused by await, later on in your code or in a 3rd party library, won't play nicely with long running tasks (or threads), so don't bother using long running tasks. In the async/await world, they're useless. More details here.
You can call ThreadPool.SetMaxThreads but before you make this call, make sure you set the minimum number of threads with ThreadPool.SetMinThreads, using values below or equal to the max ones. And by the way, the MSDN documentation is wrong. You CAN go below the number of cores on your machine with those method calls, at least in .NET 4.5 and 4.6 where I used this technique to reduce the processing power of a memory limited 32 bit service.
If however you don't wish to restrict the whole app but just the processing part of it, a custom task scheduler will do the job. A long time ago, MS released samples with several custom task schedulers, including a LimitedConcurrencyLevelTaskScheduler. Spawn the main processing task manually with Task.Factory.StartNew, providing the custom task scheduler, and every other task spawned by it will use it, including async/await and even Task.Yield, used for achieving asynchronousy early on in an async method.
But for your particular case, both solutions won't stop exhausting your queue of jobs before completing them. That might not be desirable, depending on the implementation and purpose of that queue of yours. They are more like "fire a bunch of tasks and let the scheduler find the time to execute them" type of solutions. So perhaps something a bit more appropriate here could be a stricter method of control over the execution of the jobs via semaphores. The code would look like this:
semaphore = new SemaphoreSlim(max_concurrent_jobs);
while(...){
job = Queue.PopJob();
semaphore.Wait();
ProcessJobAsync(job);
}
async Task ProcessJobAsync(Job job){
await Task.Yield();
... Process the job here...
semaphore.Release();
}
There's more than one way to skin a cat. Use what you believe is appropriate.

Microsoft has a very cool library called DataFlow which does exactly what you want (and much more). Details here.
You should use the ActionBlock class and set the MaxDegreeOfParallelism of the ExecutionDataflowBlockOptions object. ActionBlock plays nicely with async/await, so even when your external calls are awaited, no new jobs will begin processing.
ExecutionDataflowBlockOptions actionBlockOptions = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 10
};
this.sendToAzureActionBlock = new ActionBlock<List<Item>>(async items => await ProcessItems(items),
actionBlockOptions);
...
this.sendToAzureActionBlock.Post(itemsToProcess)

The problem here doesn't seem to be too many running Tasks, it's too many scheduled Tasks. Your code will try to schedule as many Tasks as it can, no matter how fast they are executed. And if you have too many jobs, this means you will get OOM.
Because of this, none of your proposed solutions will actually solve your problem. If it seems that simply specifying LongRunning solves your problem, then that's most likely because creating a new Thread (which is what LongRunning does) takes some time, which effectively throttles getting new jobs. So, this solution only works by accident, and will most likely lead to other problems later on.
Regarding the solution, I mostly agree with usr: the simplest solution that works reasonably well is to create a fixed number of LongRunning tasks and have one loop that calls Queue.PopJob() (protected by a lock if that method is not thread-safe) and Execute()s the job.
UPDATE: After some more thinking, I realized the following attempt will most likely behave terribly. Use it only if you're really sure it will work well for you.
But the TPL tries to figure out the best degree of parallelism, even for IO-bound Tasks. So, you might try to use that to your advantage. Long Tasks won't work here, because from the point of view of TPL, it seems like no work is done and it will start new Tasks over and over. What you can do instead is to start a new Task at the end of each Task. This way, TPL will know what's going on and its algorithm may work well. Also, to let the TPL decide the degree of parallelism, at the start of a Task that is first in its line, start another line of Tasks.
This algorithm may work well. But it's also possible that the TPL will make a bad decision regarding the degree of parallelism, I haven't actually tried anything like this.
In code, it would look like this:
void ProcessJobs(bool isFirst)
{
var job = Queue.PopJob(); // assumes PopJob() is thread-safe
if (job == null)
return;
if (isFirst)
Task.Factory.StartNew(() => ProcessJobs(true));
job.Execute();
Task.Factory.StartNew(() => ProcessJob(false));
}
And start it with
Task.Factory.StartNew(() => ProcessJobs(true));

TaskCreationOptions.LongRunning is useful for blocking tasks and using it here is legitimate. What it does is it suggests to the scheduler to dedicate a thread to the task. The scheduler itself tries to keep number of threads on same level as number of CPU cores to avoid excessive context switching.
It is well described in Threading in C# by Joseph Albahari

I use a message queue/mailbox mechanism to achieve this. It's akin to the actor model. I have a class that has a MailBox. I call this class my "worker." It can receive messages. Those messages are queued and they, essentially, define tasks that I want the worker to run. The worker will use Task.Wait() for its Task to finish before dequeueing the next message and starting the next task.
By limiting the number of workers I have, I am able to limit the number of concurrent threads/tasks that are being run.
This is outlined, with source code, in my blog post on a distributed compute engine. If you look at the code for IActor and the WorkerNode, I hope it makes sense.
https://long2know.com/2016/08/creating-a-distributed-computing-engine-with-the-actor-model-and-net-core/

Launching multiple tasks from a WCF service

I need to optimize a WCF service... it's quite a complex thing. My problem this time has to do with tasks (Task Parallel Library, .NET 4.0). What happens is that I launch several tasks when the service is invoked (using Task.Factory.StartNew) and then wait for them to finish:
Task.WaitAll(task1, task2, task3, task4, task5, task6);
Ok... what I see, and don't like, is that on the first call (sometimes the first 2-3 calls, if made quickly one after another), the final task starts much later than the others (I am looking at a case where it started 0.5 seconds after the others). I tried calling
ThreadPool.SetMinThreads(12*Environment.ProcessorCount, 20);
at the beginning of my service, but it doesn't seem to help.
The tasks are all database-related: I'm reading from multiple databases and it has to take as little time as possible.
Any idea why the last task is taking so long? Is there something I can do about it?
Alternatively, should I use the thread pool directly? As it happens, in one case I'm looking at, one task had already ended before the last one started - I would had saved 0.2 seconds if I had reused that thread instead of waiting for a new one to be created. However, I can not be sure that that task will always end so quickly, so I can't put both requests in the same task.
[Edit] The OS is Windows Server 2003, so there should be no connection limit. Also, it is hosted in IIS - I don't know if I should create regular threads or using the thread pool - which is the preferred version?
[Edit] I've also tried using Task.Factory.StartNew(action, TaskCreationOptions.LongRunning); - it doesn't help, the last task still starts much later (around half a second later) than the rest.
[Edit] MSDN1 says:
The thread pool has a built-in delay
(half a second in the .NET Framework
version 2.0) before starting new idle
threads. If your application
periodically starts many tasks in a
short time, a small increase in the
number of idle threads can produce a
significant increase in throughput.
Setting the number of idle threads too
high consumes system resources
needlessly.
However, as I said, I'm already calling SetMinThreads and it doesn't help.

I have had problems myself with delays in thread startup when using the (.Net 4.0) Task-object. So for time-critical stuff I now use dedicated threads (... again, as that is what I was doing before .Net 4.0.)
The purpose of a thread pool is to avoid the operative system cost of starting and stopping threads. The threads are simply being reused. This is a common model found in for example internet servers. The advantage is that they can respond quicker.
I've written many applications where I implement my own threadpool by having dedicated threads picking up tasks from a task queue. Note however that this most often required locking that can cause delays/bottlenecks. This depends on your design; are the tasks small then there would be a lot of locking and it might be faster to trade some CPU in for less locking: http://www.boyet.com/Articles/LockfreeStack.html
SmartThreadPool is a replacement/extension of the .Net thread pool. As you can see in this link it has a nice GUI to do some testing: http://www.codeproject.com/KB/threads/smartthreadpool.aspx
In the end it depends on what you need, but for high performance I recommend implementing your own thread pool. If you experience a lot of thread idling then it could be beneficial to increase the number of threads (beyond the recommended cpucount*2). This is actually how HyperThreading works inside the CPU - using "idle" time while doing operations to do other operations.
Note that .Net has a built-in limit of 25 threads per process (ie. for all WCF-calls you receive simultaneously). This limit is independent and overrides the ThreadPool setting. It can be increased, but it requires some magic: http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=201

Following from my prior question (yep, should have been a Q against original message - apologies):
Why do you feel that creating 12 threads for each processor core in your machine will in some way speed-up your server's ability to create worker threads? All you're doing is slowing your server down!
As per MSDN do
As per the MSDN docs: "You can use the SetMinThreads method to increase the minimum number of threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorith for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.".
Issues like this are usually caused by bumping into limits or contention on a shared resource.
In your case, I am guessing that your last task(s) is/are blocking while they wait for a connection to the DB server to come available or for the DB to respond. Remember - if your invocation kicks off 5-6 other tasks then your machine is going to have to create and open numerous DB connections and is going to kick the DB with, potentially, a lot of work. If your WCF server and/or your DB server are cold, then your first few invocations are going to be slower until the machine's caches etc., are populated.
Have you tried adding a little tracing/logging using the stopwatch to time how long it takes for your tasks to connect to the DB server and then execute their operations?
You may find that reducing the number of concurrent tasks you kick off actually speeds things up. Try spawning 3 tasks at a time, waiting for them to complete and then spawn the next 3.

When you call Task.Factory.StartNew, it uses a TaskScheduler to map those tasks into actual work items.
In your case, it sounds like one of your Tasks is delaying occasionally while the OS spins up a new Thread for the work item. You could, potentially, build a custom TaskScheduler which already contained six threads in a wait state, and explicitly used them for these six tasks. This would allow you to have complete control over how those initial tasks were created and started.
That being said, I suspect there is something else at play here... You mentioned that using TaskCreationOptions.LongRunning demonstrates the same behavior. This suggests that there is some other factor at play causing this half second delay. The reason I suspect this is due to the nature of TaskCreationOptions.LongRunning - when using the default TaskScheduler (LongRunning is a hint used by the TaskScheduler class), starting a task with TaskCreationOptions.LongRunning actually creates an entirely new (non-ThreadPool) thread for that Task. If creating 6 tasks, all with TaskCreationOptions.LongRunning, demonstrates the same behavior, you've pretty much guaranteed that the problem is NOT the default TaskScheduler, since this is going to always spin up 6 threads manually.
I'd recommend running your code through a performance profiler, and potentially the Concurrency Visualizer in VS 2010. This should help you determine exactly what is causing the half second delay.

What is the OS? If you are not running the server versions of windows, there is a connection limit. Your many threads are probably being serialized because of the connection limit.
Also, I have not used the task parallel library yet, but my limited experience is that new threads are cheap to make in the context of networking.

These articles might explain the problem you're having:
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-are-wcf-responses-slow-and-setminthreads-does-not-work.aspx
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-does-wcf-become-slow-after-being-idle-for-15-seconds.aspx
seeing as you're using .Net 4, the first article probably doesn't apply, but as the second article points out the ThreadPool terminates idle threads after 15 seconds which might explain the problem you're having and offers a simple (though a little hacky) solution to get around it.
Whether or not you should be using the ThreadPool directly wouldn't make any difference as I suspect the task library is using it for you underneath anyway.
One third-party library we have been using for a while might help you here - Smart Thread Pool. You still get the same benefits of using the task libraries, in that you can have the return values from the threads and get any exception information from them too.
Also, you can instantiate threadpools so that when you have multiple places each needing a threadpool (so that a low priority process doesn't start eating into the quota of some high priority process) and oh yeah you can set the priority of the threads in the pool too which you can't do with the standard ThreadPool where all the threads are background threads.
You can find plenty of info on the codeplex page, I've also got a post which highlights some of the key differences:
http://theburningmonk.com/2010/03/threading-introducing-smartthreadpool/
Just on a side note, for tasks like the one you've mentioned, which might take some time to return, you probably shouldn't be using the threadpool anyway. It's recommended that we should avoid using the threadpool for any blocking tasks like that because it hogs up the threadpool which is used by all sorts of things by the framework classes, like handling timer events, etc. etc. (not to mention handling incoming WCF requests!). I feel like I'm spamming here but here's some of the info I've gathered around the use of the threadpool and some useful links at the bottom:
http://theburningmonk.com/2010/03/threading-using-the-threadpool-vs-creating-your-own-threads/
well, hope this helps!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.