We are implementing a C# application that needs to make large numbers of socket connections to legacy systems. We will (likely) be using a 3rd party component to do the heavy lifting around terminal emulation and data scraping. We have the core functionality working today, now we need to scale it up.
During peak times this may be thousands of concurrent connections - aka threads (and even tens of thousands several times a year) that need to be opened. These connections mainly sit idle (no traffic other than a periodic handshake) for minutes (or hours) until the legacy system 'fires an event' we care about, we then scrape some data from this event, perform some workflow, and then wait for the next event. There is no value in pooling (as far as we can tell) since threads will rarely need to be reused.
We are looking for any good patterns or tools that will help use this many threads efficiently. Running on high-end server hardware is not an issue, but we do need to limit the application to just a few servers, if possible.
In our testing, creating a new thread, and init'ing the 3rd party control seems to use a lot of CPU initially, but then drops to near zero. Memory use seems to be about 800Megs / 1000 threads
Is there anything better / more efficient than just creating and starting the number of threads needed?
PS - Yes we know it is bad to create this many threads, but since we have not control over the legacy applications, this seems to be our only alternative. There is not option for multiple events to come across a single socket / connection.
Thanks for any help or pointers!
Vans
You say this:
There is no value in pooling (as far
as we can tell) since threads will
rarely need to be reused.
But then you say this:
Is there anything better / more
efficient than just creating and
starting the number of threads needed?
Why the discrepancy? Do you care about the number of threads you are creating or not? Thread pooling is the proper way to handle large numbers of mostly-idle connections. A few busy threads can handle many idle connections easily and with fewer resources required.
Use the socket's asynchronous BeginReceive and BeginSend. These dispatch the IO operation to the operating system and return immediately.
You pass a delegate and some state to those methods that will be called when an IO operation completes.
Generally once you are done processing the IO then you immediately call BeginX again.
Socket sock = GetSocket();
State state = new State() { Socket = sock, Buffer = new byte[1024], ThirdPartyControl = GetControl() };
sock.BeginReceive(state.Buffer, 0, state.Buffer.Length, 0, ProcessAsyncReceive, state);
void ProcessAsyncReceive(IAsyncResult iar)
{
State state = iar.AsyncState as State;
state.Socket.EndReceive(iar);
// Process the received data in state.Buffer here
state.ThirdPartyControl.ScrapeScreen(state.Buffer);
state.Socket.BeginReceive(state.buffer, 0, state.Buffer.Length, 0, ProcessAsyncReceive, iar.AsyncState);
}
public class State
{
public Socket Socket { get; set; }
public byte[] Buffer { get; set; }
public ThirdPartyControl { get; set; }
}
BeginSend is used in a similar fashion, as well as BeginAccept if you are accepting incoming connections.
With low throughput operations Async communications can easily handle thousands of clients simultaneously.
I would really look into MPI.NET. More Info MPI. MPI.NET also has some Parallel Reduction; so this will work well to aggregate results.
I would suggest utilizing the Socket.Select() method, and pooling the handling of multiple socket connections within a single thread.
You could, for example, create a thread for every 50 connections to the legacy system. These master threads would just keep calling Socket.Select() waiting for data to arrive. Each of these master threads could then have a thread pool that sockets that have data are passed to for actual processing. Once the processing is complete, the thread could be passed back to the master thread.
The are a number of patterns using Microsoft's Coordination and Concurrency Runtime that make dealing with IO easy and light. It allows us to grab and process well over 6000 web pages a minute (could go much higher, but there's no need) in a crawler we are developing. Definitely worth a the time investment required to shift your head into the CCR way of doing things. There's a great article here:
http://msdn.microsoft.com/en-us/magazine/cc163556.aspx
Related
so im writing a udp server and client for the first time for a 1v1's game.
My idea is to have the server handling first connections made and creating a new thread every time 2 new players connect to handle all communication between them.
A typical client message would have the threadIndex (i have an array of threads), playerId (which player it came from) and whatever they need to be done.
Is it possible to receive the packet on all threads and analyze if its meant for them? Would this be efficient? How should i approach this?
The suitable approach depends of nature of server tasks, but creating a new thread for every pair of players is not the best idea probably. Basically lets imagine, that your server mostly performs:
I/O bound tasks. In other words most of time it waits for some I\O
opertiton - network respond, query to database or disk operation. In
this case you probably need asynchorous model, when all your
connections are handled in the same thread. It would be efficient
because you actually don't have much to do in your own code. I suppose
you more likely have kinda I/O bound tasks. For example you just need to route messages between players and push\pull some data from DB. All routed messages will have an Id of the game(between to plyers), so you will never miss any of them, and they won't be missent. Take a look on this video to see the ideas and goals of asynchronous approach.
CPU bound tasks. Here server must compute something, perform heavy algorithms or process huge amount of data. In this case you probably need multithreading, but again thread per players pair may not be the most suitable approach, because it is not well scaleable and eats too much resourses. If you have some heavy CPU tasks, try to hanlde them in queue with a set of background workers. And then push the messages in asynchronous manner. Take a look on producer-consumer implementation with BlockingCollection.
You may have a combination of two cases, and of cource you can combine the approaches above. Also see questions 1, 2, 3. Try and return with specific questions. Hope it helps.
I am working on a project where I am to extract information continually from multiple servers (fewer than 1000) and write most of the information into a database. I've narrowed down my choices to 2:
Edit: This is a client, so I will be generating the connections and requesting information periodically.
1 - Using the asynchronous approach, create N sockets to poll, decide whether the information will be written into the database on the callback and put the useful information into a buffer. Then write the information from the buffer using a timer.
2 - Using the multithreading approach, create N threads with one socket per thread. The buffer of the useful information would remain on the main thread and so would the cyclic writing.
Both options use in fact multiple threads, only the second one seems to add an extra difficulty of creating each of the threads manually. Are there any merits to it? Is the writing by using a timer wise?
With 1000 connections async IO is usually a good idea because it does not block threads while the IO is in progress. (It does not even use a background thread to wait.) That makes (1) the better alternative.
It is not clear from the question what you would need a timer for. Maybe for buffering writes? That would be valid but it seems to have nothing to do with the question.
Polling has no place in a modern async IO application. The system calls your callback (or completes your IO Task) when it is done. The callback is queued to the thread-pool. This allows you to not worry about that. It just happens.
The code that reads data should look like this:
while (true) {
var msg = await ReadMessageAsync(socket);
if (msg == null) break;
await WriteDataAsync(msg);
}
Very simple. No blocking of threads. No callbacks.
In answer to the "is using a timer wise" question, perhaps it is better to make your buffer autoflush when it reaches either a certain time, or a certain size. This is the way the in-memory cache works in the .NET framework. The cache is set to both a maximum size and a maximum stale-ness.
Resiliancy on failure might be a concern, as well as the possibility that peak loads might blow your buffer if its an in-memory one. You might consider making your buffer local but persistent - for instance using a MSMQ or similar high speed queue technology. I've seen this done successfully, especially if you make the buffer write async (i.e. "fire and forget") it has almost no impact on the ability to service the input queue, and allows the database population code to pull from the persistent buffer(s) whenever it needs to or whenever prompted to.
Another option is to have a dedicated thread whose only job is to service the buffer and write data to the database as fast as it can. So when you make a connection and get data, that data is placed in the buffer. But you have one thread that's always looking at the buffer and writing data to the database as it comes in from the other connections.
Create the buffer as a BlockingCollection< T >. Use asynchronous requests as suggested in a previous answer. And have a single dedicated thread that reads the data and writes it to the database:
BlockingCollection<DataType> _theQueue = new BlockingCollection<DataType>(MaxBufferSize);
// add data with
_theQueue.Add(Dataitem);
// service the queue with a simple loop
foreach (var dataItem in _theQueue.GetConsumingEnumerable())
{
// write dataItem to the database
}
When you want to shut down (i.e. no more data is being read from the servers), you mark the queue as complete for adding. The consumer thread will then empty the queue, note that it's marked as complete for adding, and the loop will exit.
// mark the queue as complete for adding
_theQueue.CompleteAdding();
You need to make the buffer large enough to handle bursts of information.
If writing one record at a time to the database isn't fast enough, you can modify the consumer loop to fill its own internal buffer with some number of records (10? 100? 1000?), and write them to the database all in one shot. How you do that will depend of course on your server. But you should be able to come up with some form of bulk insert that will reduce the number of round trips you make to the database.
For option (1) you could write qualifying information to a queue and then listen on the queue with your database writer. This will allow your database some breathing space during peak loads and avoid the requests backing up waiting for a timer.
A persistent queue would give you some resilience too.
I want to create a high performance server in C# which could take about ~10k clients. Now i started writing a TcpServer with C# and for each client-connection i open a new thread. I also use one thread to accept the connections. So far so good, works fine.
The server has to deserialize AMF incoming objects do some logic ( like saving the position of a player ) and send some object back ( serializing objects ). I am not worried about the serializing/deserializing part atm.
My main concern is that I will have a lot of threads with 10k clients and i've read somewhere that an OS can only hold like a few hunderd threads.
Are there any sources/articles available on writing a decent async threaded server ? Are there other possibilties or will 10k threads work fine ? I've looked on google, but i couldn't find much info about design patterns or ways which explain it clearly
You're going to run into a number of problems.
You can't spin up 10,000 threads for a couple of reasons. It'll trash the kernel scheduler. If you're running a 32-bit, then the default stack address space of 1MB means that 10k threads will reserve about 10GB of address space. That'll fail.
You can't use a simple select system either. At it's heart, select is O(N) for the number of sockets. With 10k sockets, that's bad.
You can use IO Completion Ports. This is the scenario they're designed for. To my knowledge there is no stable, managed IO Completion port library. You'll have to write your own using P/Invoke or Managed C++. Have fun.
The way to write an efficient multithreaded server is to use I/O completion ports (using a thread per request is quite inefficient, as #Marcelo mentions).
If you use the asynchronous version of the .NET socket class, you get this for free. See this question which has pointers to documentation.
You want to look into using IO completion ports. You basically have a threadpool and a queue of IO operations.
I/O completion ports provide an
efficient threading model for
processing multiple asynchronous I/O
requests on a multiprocessor system.
When a process creates an I/O
completion port, the system creates an
associated queue object for requests
whose sole purpose is to service these
requests. Processes that handle many
concurrent asynchronous I/O requests
can do so more quickly and efficiently
by using I/O completion ports in
conjunction with a pre-allocated
thread pool than by creating threads
at the time they receive an I/O
request.
You definitely don't want a thread per request. Even if you have fewer clients, the overhead of creating and destroying threads will cripple the server, and there's no way you'll get to 10,000 threads; the OS scheduler will die a horrible death long before then.
There are numerous articles online about asynchronous server programming in C# (e.g., here). Just google around a bit.
Everything that I read about sockets in .NET says that the asynchronous pattern gives better performance (especially with the new SocketAsyncEventArgs which saves on the allocation).
I think this makes sense if we're talking about a server with many client connections where its not possible to allocate one thread per connection. Then I can see the advantage of using the ThreadPool threads and getting async callbacks on them.
But in my app, I'm the client and I just need to listen to one server sending market tick data over one tcp connection. Right now, I create a single thread, set the priority to Highest, and call Socket.Receive() with it. My thread blocks on this call and wakes up once new data arrives.
If I were to switch this to an async pattern so that I get a callback when there's new data, I see two issues
The threadpool threads will have default priority so it seems they will be strictly worse than my own thread which has Highest priority.
I'll still have to send everything through a single thread at some point. Say that I get N callbacks at almost the same time on N different threadpool threads notifying me that there's new data. The N byte arrays that they deliver can't be processed on the threadpool threads because there's no guarantee that they represent N unique market data messages because TCP is stream based. I'll have to lock and put the bytes into an array anyway and signal some other thread that can process what's in the array. So I'm not sure what having N threadpool threads is buying me.
Am I thinking about this wrong? Is there a reason to use the Async patter in my specific case of one client connected to one server?
UPDATE:
So I think that I was mis-understanding the async pattern in (2) above. I would get a callback on one worker thread when there was data available. Then I would begin another async receive and get another callback, etc. I wouldn't get N callbacks at the same time.
The question still is the same though. Is there any reason that the callbacks would be better in my specific situation where I'm the client and only connected to one server.
The slowest part of your application will be the network communication. It's highly likely that you will make almost no difference to performance for a one thread, one connection client by tweaking things like this. The network communication itself will dwarf all other contributions to processing or context switching time.
Say that I get N callbacks at almost
the same time on N different
threadpool threads notifying me that
there's new data.
Why is that going to happen? If you have one socket, you Begin an operation on it to receive data, and you get exactly one callback when it's done. You then decide whether to do another operation. It sounds like you're overcomplicating it, though maybe I'm oversimplifying it with regard to what you're trying to do.
In summary, I'd say: pick the simplest programming model that gets you what you want; considering choices available in your scenario, they would be unlikely to make any noticeable difference to performance whichever one you go with. With the blocking model, you're "wasting" a thread that could be doing some real work, but hey... maybe you don't have any real work for it to do.
The number one rule of performance is only try to improve it when you have to.
I see you mention standards but never mention problems, if you are not having any, then you don't need to worry what the standards say.
"This class was specifically designed for network server applications that require high performance."
As I understand, you are a client here, having only a single connection.
Data on this connection arrives in order, consumed by a single thread.
You will probably loose performance if you instead receive small amounts on separate threads, just so that you can assemble them later in a serialized - and thus like single-threaded - manner.
Much Ado about Nothing.
You do not really need to speed this up, you probably cannot.
What you can do, however is to dispatch work units to other threads after you receive them.
You do not need SocketAsyncEventArgs for this. This might speed things up.
As always, measure & measure.
Also, just because you can, it does not mean you should.
If the performance is enough for the foreseeable future, why complicate matters?
I would like to implement a thread pool in Java, which can dynamically resize itself based on the computational and I/O behavior of the tasks submitted to it.
Practically, I want to achieve the same behavior as the new Thread Pool implementation in C# 4.0
Is there an implementation already or can I achieve this behavior by using mostly existing concurrency utilities (e.g. CachedThreadPool)?
The C# version does self instrumentation to achieve an optimal utilization. What self instrumentation is available in Java and what performance implications do the present?
Is it feasible to do a cooperative approach, where the task signals its intent (e.g. entering I/O intensive operation, entering CPU intensive operation phase)?
Any suggestions are welcome.
Edit Based on comments:
The target scenarios could be:
Local file crawling and processing
Web crawling
Multi-webservice access and aggregation
The problem of the CachedThreadPool is that it starts new threads when all existing threads are blocked - you need to set explicit bounds on it, but that's it.
For example, I have 100 web services to access in a row. If I create a 100 CTP, it will start 100 threads to perform the operation, and the ton of multiple I/O requests and data transfer will surely stumble upon each others feet. For a static test case I would be able to experiment and find out the optimal pool size, but I want it to be adaptively determined and applied in a way.
Consider creating a Map where the key is the bottleneck resource.
Each thread submitted to the pool will submit a resource which is it's bottleneck, ie "CPU", "Network", "C:\" etc.
You could start by allowing only one thread per resource and then maybe slowly ramp up until work completion rate stops increasing. Things like CPU could have a floor of the core count.
Let me present an alternative approach. Having a single thread pool is a nice abstraction, but it's not very performant, especially when the jobs are very IO-bound - then there's no good way to tune it, it's tempting to blow up the pool size to maximize IO throughput but you suffer from too many thread switches, etc.
Instead I'd suggest looking at the architecture of the Apache MINA framework for inspiration. (http://mina.apache.org/) It's a high-performance web framework - they describe it as a server framework, but I think their architecture works well for inverse scenarios as well, like spidering and multi-server clients. (Actually, you might even be able to use it out-of-the-box for your project.)
They use the Java NIO (non-blocking I/O) libraries for all IO operations, and divide up the work into two thread pools: a small and fast set of socket threads, and a larger and slower set of business logic threads. So the layers look as follows:
On the network end, a large set of NIO channels, each with a message buffer
A small pool of socket threads, which go through the channel list round-robin. Their only job is to check the socket, and move any data out into the message buffer - and if the message is done, close it out and transfer to the job queue. These guys are fast, because they just push bits around, and skip any sockets that are blocked on IO.
A single job queue that serializes all messages
A large pool of processing threads, which pull messages off the queue, parse them, and do whatever processing is required.
This makes for very good performance - IO is separated out into its own layer, and you can tune the socket thread pool to maximize IO throughput, and separately tune the processing thread pool to control CPU/resource utilization.
The example given is
Result[] a = new Result[N];
for(int i=0;i<N;i++) {
a[i] = compute(i);
}
In Java the way to paralellize this to every free core and have the work load distributed dynamically so it doesn't matter if one task takes longer than another.
// defined earlier
int procs = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(proc);
// main loop.
Future<Result>[] f = new Future<Result>[N];
for(int i = 0; i < N; i++) {
final int i2 = i;
a[i] = service.submit(new Callable<Result>() {
public Result call() {
return compute(i2);
}
}
}
Result[] a = new Result[N];
for(int i = 0; i < N; i++)
a[i] = f[i].get();
This hasn't changed much in the last 5 years, so its not as cool as it was when it was first available. What Java really lacks is closures. You can use Groovy instead if that is really a problem.
Additional: If you cared about performance, rather than as an example, you would calculate Fibonacci in parallel because its a good example of a function which is faster if you calculate it single threaded.
One difference is that each thread pool only has one queue, so there is no need to steal work. This potentially means that you have more overhead per task. However, as long as your tasks typically take more than about 10 micro-seconds it won't matter.
I think you should monitor CPU utilization, in a platform-specific manner. Find out how many CPUs/cores you have, and monitor the load. When you find that the load is low, and you still have more work, create new threads - but not more than x times num-cpus (say, x=2).
If you really want to consider IO threads also, try to find out what state each thread is in when your pool is exhausted, and deduct all waiting threads from the total number. One risk is that you exhaust memory by admitting too many tasks, though.