Using thousands of Tasks with a timeout efficiently

Using thousands of Tasks with a timeout efficiently - c#

I am implementing a Library L that communicates via Sockets with another application A.
Basic workflow is as followed:
L connects to A.
L sends ~50.000 pieces of information I to A, and
creates a task T for every I that is sent out.
L listens for incoming results from A, and once reuslts are there, uses a
TaskCompletionSource to set the results of the Tasks T
L creates a Task T2 with a set Timeout (Task.WhenAny(T,Task.Delay(xx))
L uses Task.WhenAll(T2) to wait for timeout or results on all sent information.
Managing the underlying data structure is no problem at all. The main problem is that assembling the "main" Task.WhenAll(T2) costs around 5-6 seconds on my computer with ca. 50.000 entries (creating 50.000*2+1 tasks).
I can't think off a more lightweight way that accomplishes the same, however. It should use all Cores available and be non-blocking, and support timeouts aswell.
Is there a way to accomplish the same using the Parallel- or ThreadPool classes which enhances the performance?
EDIT:
Code showing how the basic setup is:
https://dotnetfiddle.net/gIq2DP

Start a total of n LongRunningTasks, where n is the number of cores on your machine. Each task should run on one core. It would be a waste to create 50K new tasks for every I that you want to send. Instead design the tasks to accept I and the socket information - where this information is to be sent.
Create a BlockingCollection<Tuple<I, SocketInfo>>. Start one task to populate this blocking collection. The other n long running tasks that you created earlier can keep taking tuples of information and the address to send the information and then perform the job for you in a loop that will break when blocking collection is done.
Timeouts can be set in the long running tasks itself.
This entire setup will keep your CPU busy to the maximum with useful work rather than keeping it needlessly busy with a "job" of 50K tasks' creation.
Since the operations (like this network operation) which happen beyond the main memory are very very slow for the CPU, feel free to set n not just equal to number of cores in your machine but even thrice that value. In my code demonstration I have set it equal to the number of cores only.
With the code at the provided link, this is one way...
using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Linq.Expressions;
using System.Net.NetworkInformation;
using System.Threading.Tasks;
namespace TestConsoleApplication
{
public static class Test
{
public static void Main()
{
TaskRunningTest();
}
private static void TaskRunningTest()
{
var s = new Stopwatch();
const int totalInformationChunks = 50000;
var baseProcessorTaskArray = new Task[Environment.ProcessorCount];
var taskFactory = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);
var tcs = new TaskCompletionSource<int>();
var itemsToProcess = new BlockingCollection<Tuple<Information, Address>>(totalInformationChunks);
s.Start();
//Start a new task to populate the "itemsToProcess"
taskFactory.StartNew(() =>
{
// Add Tuples of Information and Address to which this information is to be sent to.
Console.WriteLine("Done intializing all the jobs...");
// Finally signal that you are done by saying..
itemsToProcess.CompleteAdding();
});
//Initializing the base tasks
for (var index = 0; index < baseProcessorTaskArray.Length; index++)
{
var thisIndex = index;
baseProcessorTaskArray[index] = taskFactory.StartNew(() =>
{
while (!itemsToProcess.IsAddingCompleted && itemsToProcess.Count != 0)
{
Tuple<Information, Address> item;
itemsToProcess.TryTake(out item);
//Process the item
tcs.TrySetResult(thisIndex);
}
});
}
// Need to provide new timeout logic now
// Depending upon what you are trying to achieve with timeout, you can devise out the way
// Wait for the base tasks to completely empty OR
// timeout and then stop the stopwatch.
Task.WaitAll(baseProcessorTaskArray);
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
}
private class Address
{
//This class should have the socket information
}
private class Information
{
//This class will have the Information to send
}
}
}

Profiling shows that most time (90%?) is spent in timer setup, expiration and disposal. This seems plausible to me.
Maybe you can create your own super cheap timeout mechanism. Enqueue timeouts into a priority queue ordered by expiration time. Then, run a single timer every 100ms and make that timer expire everything in the priority queue that is due.
The cost of doing this would be one TaskCompletionSource per timeout and some small further processing.
You can even cancel timeouts by removing them from the queue and just dropping the TaskCompletionSource.

Related

Why is Parallel.ForEach much faster then AsParallel().ForAll() even though MSDN suggests otherwise?

I've been doing some investigation to see how we can create a multithreaded application that runs through a tree.
To find how this can be implemented in the best way I've created a test application that runs through my C:\ disk and opens all directories.
class Program
{
static void Main(string[] args)
{
//var startDirectory = #"C:\The folder\RecursiveFolder";
var startDirectory = #"C:\";
var w = Stopwatch.StartNew();
ThisIsARecursiveFunction(startDirectory);
Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);
Console.ReadKey();
}
public static void ThisIsARecursiveFunction(String currentDirectory)
{
var lastBit = Path.GetFileName(currentDirectory);
var depth = currentDirectory.Count(t => t == '\\');
//Console.WriteLine(depth + ": " + currentDirectory);
try
{
var children = Directory.GetDirectories(currentDirectory);
//Edit this mode to switch what way of parallelization it should use
int mode = 3;
switch (mode)
{
case 1:
foreach (var child in children)
{
ThisIsARecursiveFunction(child);
}
break;
case 2:
children.AsParallel().ForAll(t =>
{
ThisIsARecursiveFunction(t);
});
break;
case 3:
Parallel.ForEach(children, t =>
{
ThisIsARecursiveFunction(t);
});
break;
default:
break;
}
}
catch (Exception eee)
{
//Exception might occur for directories that can't be accessed.
}
}
}
What I have encountered however is that when running this in mode 3 (Parallel.ForEach) the code completes in around 2.5 seconds (yes I have an SSD ;) ). Running the code without parallelization it completes in around 8 seconds. And running the code in mode 2 (AsParalle.ForAll()) it takes a near infinite amount of time.
When checking in process explorer I also encounter a few strange facts:
Mode1 (No Parallelization):
Cpu: ~25%
Threads: 3
Time to complete: ~8 seconds
Mode2 (AsParallel().ForAll()):
Cpu: ~0%
Threads: Increasing by one per second (I find this strange since it seems to be waiting on the other threads to complete or a second timeout.)
Time to complete: 1 second per node so about 3 days???
Mode3 (Parallel.ForEach()):
Cpu: 100%
Threads: At most 29-30
Time to complete: ~2.5 seconds
What I find especially strange is that Parallel.ForEach seems to ignore any parent threads/tasks that are still running while AsParallel().ForAll() seems to wait for the previous Task to either complete (which won't soon since all parent Tasks are still waiting on their child tasks to complete).
Also what I read on MSDN was: "Prefer ForAll to ForEach When It Is Possible"
Source: http://msdn.microsoft.com/en-us/library/dd997403(v=vs.110).aspx
Does anyone have a clue why this could be?
Edit 1:
As requested by Matthew Watson I've first loaded the tree in memory before looping through it. Now the loading of the tree is done sequentially.
The results however are the same. Unparallelized and Parallel.ForEach now complete the whole tree in about 0.05 seconds while AsParallel().ForAll still only goes around 1 step per second.
Code:
class Program
{
private static DirWithSubDirs RootDir;
static void Main(string[] args)
{
//var startDirectory = #"C:\The folder\RecursiveFolder";
var startDirectory = #"C:\";
Console.WriteLine("Loading file system into memory...");
RootDir = new DirWithSubDirs(startDirectory);
Console.WriteLine("Done");
var w = Stopwatch.StartNew();
ThisIsARecursiveFunctionInMemory(RootDir);
Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);
Console.ReadKey();
}
public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
{
var depth = currentDirectory.Path.Count(t => t == '\\');
Console.WriteLine(depth + ": " + currentDirectory.Path);
var children = currentDirectory.SubDirs;
//Edit this mode to switch what way of parallelization it should use
int mode = 2;
switch (mode)
{
case 1:
foreach (var child in children)
{
ThisIsARecursiveFunctionInMemory(child);
}
break;
case 2:
children.AsParallel().ForAll(t =>
{
ThisIsARecursiveFunctionInMemory(t);
});
break;
case 3:
Parallel.ForEach(children, t =>
{
ThisIsARecursiveFunctionInMemory(t);
});
break;
default:
break;
}
}
}
class DirWithSubDirs
{
public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();
public String Path { get; private set; }
public DirWithSubDirs(String path)
{
this.Path = path;
try
{
SubDirs = Directory.GetDirectories(path).Select(t => new DirWithSubDirs(t)).ToList();
}
catch (Exception eee)
{
//Ignore directories that can't be accessed
}
}
}
Edit 2:
After reading the update on Matthew's comment I've tried to add the following code to the program:
ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);
This however does not change how the AsParallel peforms. Still the first 8 steps are being executed in an instant before slowing down to 1 step / second.
(Extra note, I'm currently ignoring the exceptions that occur when I can't access a Directory by the Try Catch block around the Directory.GetDirectories())
Edit 3:
Also what I'm mainly interested in is the difference between Parallel.ForEach and AsParallel.ForAll because to me it's just strange that for some reason the second one creates one Thread for every recursion it does while the first once handles everything in around 30 threads max. (And also why MSDN suggests to use the AsParallel even though it creates so much threads with a ~1 second timeout)
Edit 4:
Another strange thing I found out:
When I try to set the MinThreads on the Thread pool above 1023 it seems to ignore the value and scale back to around 8 or 16:
ThreadPool.SetMinThreads(1023, 16);
Still when I use 1023 it does the first 1023 elements very fast followed by going back to the slow pace I've been experiencing all the time.
Note: Also literally more then 1000 threads are now created (compared to 30 for the whole Parallel.ForEach one).
Does this mean Parallel.ForEach is just way smarter in handling tasks?
Some more info, this code prints twice 8 - 8 when you set the value above 1023: (When you set the values to 1023 or lower it prints the correct value)
int threadsMin;
int completionMin;
ThreadPool.GetMinThreads(out threadsMin, out completionMin);
Console.WriteLine("Cur min threads: " + threadsMin + " and the other thing: " + completionMin);
ThreadPool.SetMinThreads(1023, 16);
ThreadPool.SetMaxThreads(1023, 16);
ThreadPool.GetMinThreads(out threadsMin, out completionMin);
Console.WriteLine("Now min threads: " + threadsMin + " and the other thing: " + completionMin);
Edit 5:
As of Dean's request I've created another case to manually create tasks:
case 4:
var taskList = new List<Task>();
foreach (var todo in children)
{
var itemTodo = todo;
taskList.Add(Task.Run(() => ThisIsARecursiveFunctionInMemory(itemTodo)));
}
Task.WaitAll(taskList.ToArray());
break;
This is also as fast as the Parallel.ForEach() loop. So we still don't have the answer to why AsParallel().ForAll() is so much slower.

This problem is pretty debuggable, an uncommon luxury when you have problems with threads. Your basic tool here is the Debug > Windows > Threads debugger window. Shows you the active threads and gives you a peek at their stack trace. You'll easily see that, once it gets slow, that you'll have dozens of threads active that are all stuck. Their stack trace all look the same:
mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout, bool exitContext) + 0x16 bytes
mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout) + 0x7 bytes
mscorlib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x182 bytes
mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x93 bytes
mscorlib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) + 0xba bytes
mscorlib.dll!System.Threading.Tasks.Task.RunSynchronously(System.Threading.Tasks.TaskScheduler scheduler) + 0x13 bytes
System.Core.dll!System.Linq.Parallel.SpoolingTask.SpoolForAll<ConsoleApplication1.DirWithSubDirs,int>(System.Linq.Parallel.QueryTaskGroupState groupState, System.Linq.Parallel.PartitionedStream<ConsoleApplication1.DirWithSubDirs,int> partitions, System.Threading.Tasks.TaskScheduler taskScheduler) Line 172 C#
// etc..
Whenever you see something like this, you should immediately think fire-hose problem. Probably the third-most common bug with threads, after races and deadlocks.
Which you can reason out, now that you know the cause, the problem with the code is that every thread that completes adds N more threads. Where N is the average number of sub-directories in a directory. In effect, the number of threads grows exponentially, that's always bad. It will only stay in control if N = 1, that of course never happens on an typical disk.
Do beware that, like almost any threading problem, that this misbehavior tends to repeat poorly. The SSD in your machine tends to hide it. So does the RAM in your machine, the program might well complete quickly and trouble-free the second time you run it. Since you'll now read from the file system cache instead of the disk, very fast. Tinkering with ThreadPool.SetMinThreads() hides it as well, but it cannot fix it. It never fixes any problem, it only hides them. Because no matter what happens, the exponential number will always overwhelm the set minimum number of threads. You can only hope that it completes finishing iterating the drive before that happens. Idle hope for a user with a big drive.
The difference between ParallelEnumerable.ForAll() and Parallel.ForEach() is now perhaps also easily explained. You can tell from the stack trace that ForAll() does something naughty, the RunSynchronously() method blocks until all the threads are completed. Blocking is something threadpool threads should not do, it gums up the thread pool and won't allow it to schedule the processor for another job. And has the effect you observed, the thread pool is quickly overwhelmed with threads that are waiting on the N other threads to complete. Which isn't happening, they are waiting in the pool and are not getting scheduled because there are already so many of them active.
This is a deadlock scenario, a pretty common one, but the threadpool manager has a workaround for it. It watches the active threadpool threads and steps in when they don't complete in a timely manner. It then allows an extra thread to start, one more than the minimum set by SetMinThreads(). But not more then the maximum set by SetMaxThreads(), having too many active tp threads is risky and likely to trigger OOM. This does solve the deadlock, it gets one of the ForAll() calls to complete. But this happens at a very slow rate, the threadpool only does this twice a second. You'll run out of patience before it catches up.
Parallel.ForEach() doesn't have this problem, it doesn't block so doesn't gum up the pool.
Seems to be the solution, but do keep in mind that your program is still fire-hosing the memory of your machine, adding ever more waiting tp threads to the pool. This can crash your program as well, it just isn't as likely because you have a lot of memory and the threadpool doesn't use a lot of it to keep track of a request. Some programmers however accomplish that as well.
The solution is a very simple one, just don't use threading. It is harmful, there is no concurrency when you have only one disk. And it does not like being commandeered by multiple threads. Especially bad on a spindle drive, head seeks are very, very slow. SSDs do it a lot better, it however still takes an easy 50 microseconds, overhead that you just don't want or need. The ideal number of threads to access a disk that you can't otherwise expect to be cached well is always one.

The first thing to note is that you are trying to parallelise an IO-bound operation, which will distort the timings significantly.
The second thing to note is the nature of the parallelised tasks: You are recursively descending a directory tree. If you create multiple threads to do this, each thread is likely to be accessing a different part of the disk simultaneously - which will cause the disk read head to be jumping all over the place and slowing things down considerably.
Try changing your test to create an in-memory tree, and access that with multiple threads instead. Then you will be able to compare the timings properly without the results being distorted beyond all usefulness.
Additionally, you may be creating a great number of threads, and they will (by default) be threadpool threads. Having a great number of threads will actually slow things down when they exceed the number of processor cores.
Also note that when you exceed the thread pool minimum threads (defined by ThreadPool.GetMinThreads()), a delay is introduced by the thread pool manager between each new threadpool thread creation. (I think this is around 0.5s per new thread).
Also, if the number of threads exceeds the value returned by ThreadPool.GetMaxThreads(), the creating thread will block until one of the other threads has exited. I think this is likely to be happening.
You can test this hypothesis by calling ThreadPool.SetMaxThreads() and ThreadPool.SetMinThreads() to increase these values, and see if it makes any difference.
(Finally, note that if you are really trying to recursively descend from C:\, you will almost certainly get an IO exception when it reaches a protected OS folder.)
NOTE: Set the max/min threadpool threads like this:
ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);
Follow Up
I have tried your test code with the threadpool thread counts set as described above, with the following results (not run on the whole of my C:\ drive, but on a smaller subset):
Mode 1 took 06.5 seconds.
Mode 2 took 15.7 seconds.
Mode 3 took 16.4 seconds.
This is in line with my expectations; adding a load of threading to do this actually makes it slower than single-threaded, and the two parallel approaches take roughly the same time.
In case anyone else wants to investigate this, here's some determinative test code (the OP's code is not reproducible because we don't know his directory structure).
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
namespace Demo
{
internal class Program
{
private static DirWithSubDirs RootDir;
private static void Main()
{
Console.WriteLine("Loading file system into memory...");
RootDir = new DirWithSubDirs("Root", 4, 4);
Console.WriteLine("Done");
//ThreadPool.SetMinThreads(4000, 16);
//ThreadPool.SetMaxThreads(4000, 16);
var w = Stopwatch.StartNew();
ThisIsARecursiveFunctionInMemory(RootDir);
Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);
Console.ReadKey();
}
public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
{
var depth = currentDirectory.Path.Count(t => t == '\\');
Console.WriteLine(depth + ": " + currentDirectory.Path);
var children = currentDirectory.SubDirs;
//Edit this mode to switch what way of parallelization it should use
int mode = 3;
switch (mode)
{
case 1:
foreach (var child in children)
{
ThisIsARecursiveFunctionInMemory(child);
}
break;
case 2:
children.AsParallel().ForAll(t =>
{
ThisIsARecursiveFunctionInMemory(t);
});
break;
case 3:
Parallel.ForEach(children, t =>
{
ThisIsARecursiveFunctionInMemory(t);
});
break;
default:
break;
}
}
}
internal class DirWithSubDirs
{
public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();
public String Path { get; private set; }
public DirWithSubDirs(String path, int width, int depth)
{
this.Path = path;
if (depth > 0)
for (int i = 0; i < width; ++i)
SubDirs.Add(new DirWithSubDirs(path + "\\" + i, width, depth - 1));
}
}
}

The Parallel.For and .ForEach methods are implemented internally as equivalent to running iterations in Tasks, e.g. that a loop like:
Parallel.For(0, N, i =>
{
DoWork(i);
});
is equivalent to:
var tasks = new List<Task>(N);
for(int i=0; i<N; i++)
{
tasks.Add(Task.Factory.StartNew(state => DoWork((int)state), i));
}
Task.WaitAll(tasks.ToArray());
And from the perspective of every iteration potentially running in parallel with every other iteration, this is an ok mental model, but does not happen in reality. Parallel, in fact, does not necessarily use one Task per iteration, as that is significantly more overhead than is necessary. Parallel.ForEach tries to use the minimum number of tasks necessary to complete the loop as fast as possible. It spins up tasks as threads become available to process those tasks, and each of those tasks participates in a management scheme (I think its called chunking): A task asks for multiple iterations to be done, gets them, and then processes that work, and then goes back for more. The chunk sizes vary based the number of tasks participating, the load on the machine, etc.
PLINQ’s .AsParallel() has a different implementation, but it ‘can’ still similarly fetch multiple iterations into a temporary store, do the calculations in a thread (but not as a task), and put the query results into a small buffer. (You get something based on ParallelQuery, and then further .Whatever() functions bind to an alternative set of extension methods that provide parallel implementations).
So now that we have a small idea of how these two mechanisms work, I will try to provide an answer to your original question:
So why is .AsParallel() slower than Parallel.ForEach? The reason stems from the following. Tasks (or their equivalent implementation here) do NOT block on I/O-like calls. They ‘await’ and free up the CPU to do something else. But (quoting C# nutshell book): “PLINQ cannot perform I/O-bound work without blocking threads”. The calls are synchronous. They were written with the intention that you increase the degree of parallelism if (and ONLY if) you are doing such things as downloading web pages per task that do not hog CPU time.
And the reason why your function calls are exactly analogous to I/O bound calls is this: One of your threads (call it T) blocks and does nothing until all of its child threads have finished, which can be a slow process here. T itself is not CPU-intensive while it waits for the children to unblock, it is doing nothing but waiting. Hence it is identical to a typical I/O bound function call.

Based on the accepted answer to How exactly does AsParallel work?
.AsParallel.ForAll() casts back to IEnumerable before calling .ForAll()
so it creates 1 new thread + N recursive calls (each of which generates a new thread).

Speed up reverse DNS lookups for large batch of IPs

For analytics purposes, I'd like to perform reverse DNS lookups on large batches of IPs. "Large" meaning, at least tens of thousands per hour. I'm looking for ways to increase the processing rate, i.e. lower the processing time per batch.
Wrapping the async version of Dns.GetHostEntry into await-able tasks has already helped a lot (compared to sequential requests), leading to a throughput of appox. 100-200 IPs/second:
static async Task DoReverseDnsLookups()
{
// in reality, thousands of IPs
var ips = new[] { "173.194.121.9", "173.252.110.27", "98.138.253.109" };
var hosts = new Dictionary<string, string>();
var tasks =
ips.Select(
ip =>
Task.Factory.FromAsync(Dns.BeginGetHostEntry,
(Func<IAsyncResult, IPHostEntry>) Dns.EndGetHostEntry,
ip, null)
.ContinueWith(t =>
hosts[ip] = ((t.Exception == null) && (t.Result != null))
? t.Result.HostName : null));
var start = DateTime.UtcNow;
await Task.WhenAll(tasks);
var end = DateTime.UtcNow;
Console.WriteLine("Resolved {0} IPs in {1}, that's {2}/sec.",
ips.Count(), end - start,
ips.Count() / (end - start).TotalSeconds);
}
Any ideas how to further improve the processing rate?
For instance, is there any way to send a batch of IPs to the DNS server?
Btw, I'm assuming that under the covers, I/O Completion Ports are used by the async methods - correct me if I'm wrong please.

Hello here are some tips so you can improve:
Cache the queries locally since this information don't usually change for
days or even years. This way you don't have to resolve every time.
Most DNS servers will automatically cache the information, so the next time it will resolve
pretty fast. Usually the cache is 4 hours, at least it is the default on Windows servers.
This means that if you run this process in a batch in a short period, it will perform better that
if you resolve the addresses several times during the day allowing cahce to expire.
It is good that you are using Task Parallelism but you are still asking the same DNS servers
configured on your machine. I think that having two machines using different DNS servers will
improve the process.
I hope this helps.

As always, I would suggest using TPL Dataflow's ActionBlock instead of firing all requests at once and waiting for all to complete. Using an ActionBlock with a high MaxDegreeOfParallelism lets the TPL decide for itself how many calls to fire concurrently, which can lead to a better utilization of resources:
var block = new ActionBlock<string>(
async ip =>
{
try
{
var host = (await Dns.GetHostEntryAsync(ip)).HostName;
if (!string.IsNullOrWhitespace(host))
{
hosts[ip] = host;
}
}
catch
{
return;
}
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5000});
I would also suggest adding a cache, and making sure you don't resolve the same ip more than once.
When you use .net's Dns class it includes some fallbacks beside DNS (e.g LLMNR), which makes it very slow. If all you need are DNS queries you might want to use a dedicated library like ARSoft.Tools.Net.
P.S: Some remarks about your code sample:
You should be using GetHostEntryAsync instead of FromAsync
The continuation can potentially run on different threads so you should really be using ConcurrentDictionary.

Is a non-blocking, single-threaded, asynchronous web server (like Node.js) possible in .NET?

I was looking at this question, looking for a way to create a single-threaded, event-based nonblocking asynchronous web server in .NET.
This answer looked promising at first, by claiming that the body of the code runs in a single thread.
However, I tested this in C#:
using System;
using System.IO;
using System.Threading;
class Program
{
static void Main()
{
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
var sc = new SynchronizationContext();
SynchronizationContext.SetSynchronizationContext(sc);
{
var path = Environment.ExpandEnvironmentVariables(
#"%SystemRoot%\Notepad.exe");
var fs = new FileStream(path, FileMode.Open,
FileAccess.Read, FileShare.ReadWrite, 1024 * 4, true);
var bytes = new byte[1024];
fs.BeginRead(bytes, 0, bytes.Length, ar =>
{
sc.Post(dummy =>
{
var res = fs.EndRead(ar);
// Are we in the same thread?
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
}, null);
}, null);
}
Thread.Sleep(100);
}
}
And the result was:
1
5
So it seems like, contrary to the answer, the thread initiating the read and the thread ending the read are not the same.
So now my question is, how do you to achieve a single-threaded, event-based nonblocking asynchronous web server in .NET?

The whole SetSynchronizationContext is a red herring, this is just a mechanism for marshalling, the work still happens in the IO Thread Pool.
What you are asking for is a way to queue and harvest Asynchronous Procedure Calls for all your IO work from the main thread. Many higher level frameworks wrap this kind functionality, the most famous one being libevent.
There is a great recap on the various options here: Whats the difference between epoll, poll, threadpool?.
.NET already takes care of scaling for you by have a special "IO Thread Pool" that handles IO access when you call the BeginXYZ methods. This IO Thread Pool must have at least 1 thread per processor on the box. see: ThreadPool.SetMaxThreads.
If single threaded app is a critical requirement (for some crazy reason) you could, of course, interop all of this stuff in using DllImport (see an example here)
However it would be a very complex and risky task:
Why don't we support APCs as a completion mechanism? APCs are really not a good general-purpose completion mechanism for user code. Managing the reentrancy introduced by APCs is nearly impossible; any time you block on a lock, for example, some arbitrary I/O completion might take over your thread. It might try to acquire locks of its own, which may introduce lock ordering problems and thus deadlock. Preventing this requires meticulous design, and the ability to make sure that someone else's code will never run during your alertable wait, and vice-versa. This greatly limits the usefulness of APCs.
So, to recap. If you want a single threaded managed process that does all its work using APC and completion ports, you are going to have to hand code it. Building it would be risky and tricky.
If you simply want high scale networking, you can keep using BeginXYZ and family and rest assured that it will perform well, since it uses APC. You pay a minor price marshalling stuff between threads and the .NET particular implementation.
From: http://msdn.microsoft.com/en-us/magazine/cc300760.aspx
The next step in scaling up the server is to use asynchronous I/O. Asynchronous I/O alleviates the need to create and manage threads. This leads to much simpler code and also is a more efficient I/O model. Asynchronous I/O utilizes callbacks to handle incoming data and connections, which means there are no lists to set up and scan and there is no need to create new worker threads to deal with the pending I/O.
An interesting, side fact, is that single threaded is not the fastest way to do async sockets on Windows using completion ports see: http://doc.sch130.nsc.ru/www.sysinternals.com/ntw2k/info/comport.shtml
The goal of a server is to incur as few context switches as possible by having its threads avoid unnecessary blocking, while at the same time maximizing parallelism by using multiple threads. The ideal is for there to be a thread actively servicing a client request on every processor and for those threads not to block if there are additional requests waiting when they complete a request. For this to work correctly however, there must be a way for the application to activate another thread when one processing a client request blocks on I/O (like when it reads from a file as part of the processing).

What you need is a "message loop" which takes the next task on a queue and executes it. Additionally, every task needs to be coded so that it completes as much work as possible without blocking, and then enqueues additional tasks to pick up a task that needs time later. There is nothing magical about this: never using a blocking call and never spawn additional threads.
For example, when processing an HTTP GET, the server can read as much data as is currently available on the socket. If this is not enough data to handle the request, then enqueue a new task to read from the socket again in the future. In the case of a FileStream, you want to set the ReadTimeout on the instance to a low value and be prepared to read fewer bytes than the entire file.
C# 5 actually makes this pattern much more trivial. Many people think that the async functionality implies multithreading, but that is not the case. Using async, you can essentially get the task queue I mentioned earlier without ever explicility managing it.

Yes, it's called Manos de mono
Seriously, the entire idea behind manos is a single threaded asynchronous event driven web server.
High performance and scalable. Modeled after tornadoweb, the technology that powers friend feed, Manos is capable of thousands of simultaneous connections, ideal for applications that create persistent connections with the server.
The project appears to be low on maintenance and probably wouldn't be production ready but it makes a good case study as a demonstration that this is possible.

Here's a great article series explaining what IO Completion Ports are and how they can be accessed via C# (i.e. you need to PInvoke into Win32 API calls from the Kernel32.dll).
Note: The libuv the cross platform IO framework behind node.js uses IOCP on Windows and libev on unix operating systems.
http://www.theukwebdesigncompany.com/articles/iocp-thread-pooling.php

i am wondering nobody mentioned kayak it's basicly C#s answer to Pythons twisted, JavaScripts node.js or Rubys eventmachine

I've been fiddling with my own simple implementation of such an architecture and I've put it up on github. I'm doing it more as a learning thing. But it's been a lot of fun and I think I'll flush it out more.
It's very alpha, so it's liable to change, but the code looks a little like this:
//Start the event loop.
EventLoop.Start(() => {
//Create a Hello World server on port 1337.
Server.Create((req, res) => {
res.Write("<h1>Hello World</h1>");
}).Listen("http://*:1337");
});
More information about it can be found here.

I developed a server based on HttpListener and an event loop, supporting MVC, WebApi and routing. For what i have seen the performances are far better than standard IIS+MVC, for the MVCMusicStore i moved from 100 requests per seconds and 100% CPU to 350 with 30% CPU.
If anybody would give it a try i am struggling for feedbacks!
Actually is present a template to create websites based on this structure.
Note that I DON'T USE ASYNC/AWAIT until absolutely necessary. The only tasks i use there are the ones for the I/O bound operations like writing on the socket or reading files.
PS any suggestion or correction is welcome!
Documentation
MvcMusicStore sample port on Node.Cs
Packages on Nuget

you can this framework SignalR
and this Blog about it

Some kind of the support from operating system is essential here. For example, Mono uses epoll on Linux with asynchronous I/O, so it should scale really well (still thread pool). If you are looking and performance and scalability, definitely try it.
On the other hand, the example of C# (with native libs) webserver which is based around idea you have mentioned can be Manos de Mono. Project has not been active lately; however, idea and code is generally available. Read this (especially the "A closer look at Manos" part).
Edit:
If you just want to have callback fired on your main thread, you can do a little abuse of existing synchronization contexts like the WPF dispatcher. Your code, translated to this approach:
using System;
using System.IO;
using System.Threading;
using System.Windows;
namespace Node
{
class Program
{
public static void Main()
{
var app = new Application();
app.Startup += ServerStart;
app.Run();
}
private static void ServerStart(object sender, StartupEventArgs e)
{
var dispatcher = ((Application) sender).Dispatcher;
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
var path = Environment.ExpandEnvironmentVariables(
#"%SystemRoot%\Notepad.exe");
var fs = new FileStream(path, FileMode.Open,
FileAccess.Read, FileShare.ReadWrite, 1024 * 4, true);
var bytes = new byte[1024];
fs.BeginRead(bytes, 0, bytes.Length, ar =>
{
dispatcher.BeginInvoke(new Action(() =>
{
var res = fs.EndRead(ar);
// Are we in the same thread?
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
}));
}, null);
}
}
}
prints what you wish. Plus you can set priorities with dispatcher. But agree, this is ugly, hacky and I do not know why I would do it that way for another reason than answer your demo request ;)

First about SynchronizationContext. It's just like Sam wrote. Base class won't give You single-thread functionality. You probably got that idea from WindowsFormsSynchronizationContext which provides functionality to execute code on UI thread.
You can read more here
I've written a piece of code that works with ThreadPool parameters. (Again something Sam already pointed out).
This code registers 3 asynchronous actions to be executed on free thread. They run in parallel until one of them changes ThreadPool parameters. Then each action is executed on the same thread.
It only proves that you can force .net app to use one thread.
Real implementation of web server that would receive and process calls on only one thread is something entirely different :).
Here's the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.IO;
namespace SingleThreadTest
{
class Program
{
class TestState
{
internal string ID { get; set; }
internal int Count { get; set; }
internal int ChangeCount { get; set; }
}
static ManualResetEvent s_event = new ManualResetEvent(false);
static void Main(string[] args)
{
Console.WriteLine(Thread.CurrentThread.ManagedThreadId);
int nWorkerThreads;
int nCompletionPortThreads;
ThreadPool.GetMaxThreads(out nWorkerThreads, out nCompletionPortThreads);
Console.WriteLine(String.Format("Max Workers: {0} Ports: {1}",nWorkerThreads,nCompletionPortThreads));
ThreadPool.GetMinThreads(out nWorkerThreads, out nCompletionPortThreads);
Console.WriteLine(String.Format("Min Workers: {0} Ports: {1}",nWorkerThreads,nCompletionPortThreads));
ThreadPool.QueueUserWorkItem(new WaitCallback(LetsRunLikeCrazy), new TestState() { ID = "A ", Count = 10, ChangeCount = 0 });
ThreadPool.QueueUserWorkItem(new WaitCallback(LetsRunLikeCrazy), new TestState() { ID = " B ", Count = 10, ChangeCount = 5 });
ThreadPool.QueueUserWorkItem(new WaitCallback(LetsRunLikeCrazy), new TestState() { ID = " C", Count = 10, ChangeCount = 0 });
s_event.WaitOne();
Console.WriteLine("Press enter...");
Console.In.ReadLine();
}
static void LetsRunLikeCrazy(object o)
{
if (s_event.WaitOne(0))
{
return;
}
TestState oState = o as TestState;
if (oState != null)
{
// Are we in the same thread?
Console.WriteLine(String.Format("Hello. Start id: {0} in thread: {1}",oState.ID, Thread.CurrentThread.ManagedThreadId));
Thread.Sleep(1000);
oState.Count -= 1;
if (oState.ChangeCount == oState.Count)
{
int nWorkerThreads = 1;
int nCompletionPortThreads = 1;
ThreadPool.SetMinThreads(nWorkerThreads, nCompletionPortThreads);
ThreadPool.SetMaxThreads(nWorkerThreads, nCompletionPortThreads);
ThreadPool.GetMaxThreads(out nWorkerThreads, out nCompletionPortThreads);
Console.WriteLine(String.Format("New Max Workers: {0} Ports: {1}", nWorkerThreads, nCompletionPortThreads));
ThreadPool.GetMinThreads(out nWorkerThreads, out nCompletionPortThreads);
Console.WriteLine(String.Format("New Min Workers: {0} Ports: {1}", nWorkerThreads, nCompletionPortThreads));
}
if (oState.Count > 0)
{
Console.WriteLine(String.Format("Hello. End id: {0} in thread: {1}", oState.ID, Thread.CurrentThread.ManagedThreadId));
ThreadPool.QueueUserWorkItem(new WaitCallback(LetsRunLikeCrazy), oState);
}
else
{
Console.WriteLine(String.Format("Hello. End id: {0} in thread: {1}", oState.ID, Thread.CurrentThread.ManagedThreadId));
s_event.Set();
}
}
else
{
Console.WriteLine("Error !!!");
s_event.Set();
}
}
}
}

LibuvSharp is a wrapper for libuv, which is used in the node.js project for async IO. BUt it only contains only low level TCP/UDP/Pipe/Timer functionality. And it will stay like that, writing a webserver on top of it is an entire different story. It doesn't even support dns resolving, since this is just a protocol on top of udp.

I believe it's possible, here is an open-source example written in VB.NET and C#:
https://github.com/perrybutler/dotnetsockets/
It uses Event-based Asynchronous Pattern (EAP), IAsyncResult Pattern and thread pool (IOCP). It will serialize/marshal the messages (messages can be any native object such as a class instance) into binary packets, transfer the packets over TCP, and then deserialize/unmarshal the packets at the receiving end so you get your native object to work with. This part is somewhat like Protobuf or RPC.
It was originally developed as a "netcode" for real-time multiplayer gaming, but it can serve many purposes. Unfortunately I never got around to using it. Maybe someone else will.
The source code has a lot of comments so it should be easy to follow. Enjoy!

Here is one more implementation of the event-loop web server called SingleSand. It executes all custom logic inside single-threaded event loop but the web server is hosted in asp.net.
Answering the question, it is generally not possible to run a pure single threaded app because of .NET multi-threaded nature. There are some activities that run in separate threads and developer cannot change their behavior.

C# Downloader: should I use Threads, BackgroundWorker or ThreadPool?

I'm writing a downloader in C# and stopped at the following problem: what kind of method should I use to parallelize my downloads and update my GUI?
In my first attempt, I used 4 Threads and at the completion of each of them I started another one: main problem was that my cpu goes 100% at each new thread start.
Googling around, I found the existence of BackgroundWorker and ThreadPool: stating that I want to update my GUI with the progress of each link that I'm downloading, what is the best solution?
1) Creating 4 different BackgroundWorker, attaching to each ProgressChanged event a Delegate to a function in my GUI to update the progress?
2) Use ThreadPool and setting max and min number of threads to the same value?
If I choose #2, when there are no more threads in the queue, does it stop the 4 working threads? Does it suspend them? Since I have to download different lists of links (20 links each of them) and move from one to another when one is completed, does the ThreadPool start and stop threads between each list?
If I want to change the number of working threads on live and decide to use ThreadPool, changing from 10 threads to 6, does it throw and exception and stop 4 random threads?
This is the only part that is giving me an headache.
I thank each of you in advance for your answers.

I would suggest using WebClient.DownloadFileAsync for this. You can have multiple downloads going, each raising the DownloadProgressChanged event as it goes along, and DownloadFileCompleted when done.
You can control the concurrency by using a queue with a semaphore or, if you're using .NET 4.0, a BlockingCollection. For example:
// Information used in callbacks.
class DownloadArgs
{
public readonly string Url;
public readonly string Filename;
public readonly WebClient Client;
public DownloadArgs(string u, string f, WebClient c)
{
Url = u;
Filename = f;
Client = c;
}
}
const int MaxClients = 4;
// create a queue that allows the max items
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>(MaxClients);
// queue of urls to be downloaded (unbounded)
Queue<string> UrlQueue = new Queue<string>();
// create four WebClient instances and put them into the queue
for (int i = 0; i < MaxClients; ++i)
{
var cli = new WebClient();
cli.DownloadProgressChanged += DownloadProgressChanged;
cli.DownloadFileCompleted += DownloadFileCompleted;
ClientQueue.Add(cli);
}
// Fill the UrlQueue here
// Now go until the UrlQueue is empty
while (UrlQueue.Count > 0)
{
WebClient cli = ClientQueue.Take(); // blocks if there is no client available
string url = UrlQueue.Dequeue();
string fname = CreateOutputFilename(url); // or however you get the output file name
cli.DownloadFileAsync(new Uri(url), fname,
new DownloadArgs(url, fname, cli));
}
void DownloadProgressChanged(object sender, DownloadProgressChangedEventArgs e)
{
DownloadArgs args = (DownloadArgs)e.UserState;
// Do status updates for this download
}
void DownloadFileCompleted(object sender, AsyncCompletedEventArgs e)
{
DownloadArgs args = (DownloadArgs)e.UserState;
// do whatever UI updates
// now put this client back into the queue
ClientQueue.Add(args.Client);
}
There's no need for explicitly managing threads or going to the TPL.

I think you should look into using the Task Parallel Library, which is new in .NET 4 and is designed for solving these types of problems

Having 100% cpu load has nothing to do with the download (as your network is practically always the bottleneck). I would say you have to check your logic how you wait for the download to complete.
Can you post some code of the thread's code you start multiple times?

By creating 4 different backgroundworkers you will be creating seperate threads that will no longer interfere with your GUI. Backgroundworkers are simple to implement and from what I understand will do exactly what you need them to do.
Personally I would do this and simply allow the others to not start until the previous one is finished. (Or maybe just one, and allow it to execute one method at a time in the correct order.)
FYI - Backgroundworker

Can we create 300,000 threads in a C# application and run it on a PC?

I am trying to imitate a scenario where 300,000 consumers are accessing a server. So I am trying to create the pseudo clients, by repeatedly querying the server from the concurrent threads.
But the first hurdle to be cleared is, whether it is possible to run 300,000 threads on a PC? Here is a code which I am using to see intially how many max threads I can get, and later then replace the test function with the actual function:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
namespace CheckThread
{
class Program
{
static int count;
public static void TestThread(int i)
{
while (true)
{
Console.Write("\rThread Executing : {0}", i);
Thread.Sleep(500);
}
}
static void Main(string[] args)
{
count = 0;
int limit = 0;
if (args.Length != 1)
{
Console.WriteLine("Usage CheckThread <number of threads>");
return;
}
else
{
limit = Convert.ToInt32(args[0]);
}
Console.WriteLine();
while (count < limit)
{
ThreadStart newThread = new ThreadStart(delegate { TestThread(count); });
Thread mythread = new Thread(newThread);
mythread.Start();
Console.WriteLine("Thread # {0}", count++);
}
while (true)
{
Thread.Sleep(30*1000);
}
} // end of main
} // end of CheckThread class
} // end of namespace
Now what I am trying might be unrealistic, but still, if there is a way out to do it and you know, then please help me.

Each thread will create its own stack and local storage, you are looking at roughly 512k of stack space per thread on a 32bit OS, I think the stack space doubles on a 64 bit OS. A quick back of the spreadsheet calc gives us 146.484375 gigs of stack space for your 300k clients.
So, no, don't create 300k threads, but rather use the threadpool to simulate 300k requests, although tbh I think you would be better off with several test clients spamming your server through a network interface.
There are a lot of web load-testing tools available. Good starting point : http://www.webperformance.com/library/reports/TestingAspDotNet/

You can alter the maximum nunmber of threads by calling the ThreadPool.SetMaxThreads method. 300,000 threads will probably make your PC explode*
*This is probably an exaggeration

Language-agnostic answer:
The better way to probably go about this is using the Reactor pattern, with a maximum of 1 or 2 concurrent threads per core.

As .net commits the entire stack (1MB) for each clr thread; as Ben says, your PC may actually explode. Or possibly OoM.

Well, what was the result of your test when you tried to create 300K threads? I'm not going to try it on mine!
You could not connect up 300K clients at once anyway because there are not enough sockets available on a single server, (hence farming).
I have done some server testing and, by tweaking the registry to make more sockets available, I have had 24K sockets connected to a server, all one one box. That was somewhat what I was expecting since the server<>client connection requires one socket object at each end and there are only 64K sockets available. I did not attempt to create 24K threads for my testing, I used a client thread class that opened/closed connections on multiple client socket objects in a list.
Rgds,
Martin

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.