Design considerations for an adaptive thread pool in Java

Design considerations for an adaptive thread pool in Java - c#

I would like to implement a thread pool in Java, which can dynamically resize itself based on the computational and I/O behavior of the tasks submitted to it.
Practically, I want to achieve the same behavior as the new Thread Pool implementation in C# 4.0
Is there an implementation already or can I achieve this behavior by using mostly existing concurrency utilities (e.g. CachedThreadPool)?
The C# version does self instrumentation to achieve an optimal utilization. What self instrumentation is available in Java and what performance implications do the present?
Is it feasible to do a cooperative approach, where the task signals its intent (e.g. entering I/O intensive operation, entering CPU intensive operation phase)?
Any suggestions are welcome.
Edit Based on comments:
The target scenarios could be:
Local file crawling and processing
Web crawling
Multi-webservice access and aggregation
The problem of the CachedThreadPool is that it starts new threads when all existing threads are blocked - you need to set explicit bounds on it, but that's it.
For example, I have 100 web services to access in a row. If I create a 100 CTP, it will start 100 threads to perform the operation, and the ton of multiple I/O requests and data transfer will surely stumble upon each others feet. For a static test case I would be able to experiment and find out the optimal pool size, but I want it to be adaptively determined and applied in a way.

Consider creating a Map where the key is the bottleneck resource.
Each thread submitted to the pool will submit a resource which is it's bottleneck, ie "CPU", "Network", "C:\" etc.
You could start by allowing only one thread per resource and then maybe slowly ramp up until work completion rate stops increasing. Things like CPU could have a floor of the core count.

Let me present an alternative approach. Having a single thread pool is a nice abstraction, but it's not very performant, especially when the jobs are very IO-bound - then there's no good way to tune it, it's tempting to blow up the pool size to maximize IO throughput but you suffer from too many thread switches, etc.
Instead I'd suggest looking at the architecture of the Apache MINA framework for inspiration. (http://mina.apache.org/) It's a high-performance web framework - they describe it as a server framework, but I think their architecture works well for inverse scenarios as well, like spidering and multi-server clients. (Actually, you might even be able to use it out-of-the-box for your project.)
They use the Java NIO (non-blocking I/O) libraries for all IO operations, and divide up the work into two thread pools: a small and fast set of socket threads, and a larger and slower set of business logic threads. So the layers look as follows:
On the network end, a large set of NIO channels, each with a message buffer
A small pool of socket threads, which go through the channel list round-robin. Their only job is to check the socket, and move any data out into the message buffer - and if the message is done, close it out and transfer to the job queue. These guys are fast, because they just push bits around, and skip any sockets that are blocked on IO.
A single job queue that serializes all messages
A large pool of processing threads, which pull messages off the queue, parse them, and do whatever processing is required.
This makes for very good performance - IO is separated out into its own layer, and you can tune the socket thread pool to maximize IO throughput, and separately tune the processing thread pool to control CPU/resource utilization.

The example given is
Result[] a = new Result[N];
for(int i=0;i<N;i++) {
a[i] = compute(i);
}
In Java the way to paralellize this to every free core and have the work load distributed dynamically so it doesn't matter if one task takes longer than another.
// defined earlier
int procs = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(proc);
// main loop.
Future<Result>[] f = new Future<Result>[N];
for(int i = 0; i < N; i++) {
final int i2 = i;
a[i] = service.submit(new Callable<Result>() {
public Result call() {
return compute(i2);
}
}
}
Result[] a = new Result[N];
for(int i = 0; i < N; i++)
a[i] = f[i].get();
This hasn't changed much in the last 5 years, so its not as cool as it was when it was first available. What Java really lacks is closures. You can use Groovy instead if that is really a problem.
Additional: If you cared about performance, rather than as an example, you would calculate Fibonacci in parallel because its a good example of a function which is faster if you calculate it single threaded.
One difference is that each thread pool only has one queue, so there is no need to steal work. This potentially means that you have more overhead per task. However, as long as your tasks typically take more than about 10 micro-seconds it won't matter.

I think you should monitor CPU utilization, in a platform-specific manner. Find out how many CPUs/cores you have, and monitor the load. When you find that the load is low, and you still have more work, create new threads - but not more than x times num-cpus (say, x=2).
If you really want to consider IO threads also, try to find out what state each thread is in when your pool is exhausted, and deduct all waiting threads from the total number. One risk is that you exhaust memory by admitting too many tasks, though.

Related

processor usage (force full usage)

First of all, I did not study computer science, and I teached programming my self, said that;
I have a C# program that runs heavy power flow simulations for very large demand profiles.
I use a laptop with an intel i7 processor (4 cores -> 8 threads) under windows 7.
When I run the simulations the processor ussage is arround 32%.
I have read other threads about process prority, and I have more or less clear that when something runs on the OS, it runs at full speed, but the OS keeps the interfaces responsive (is this correct?)
Well I want to "completely flood the processor" with the simulation; get a 100% of usage (if possible) ?
Thanks in advance.
Ref#1: Is there a way of restricting an API's processor resource in c#?
Ref#2: Multiple Processors and PerformanceCounter C#
EDIT: piece of code that calls the simulations after removing the non relevant stuff
while ( current_step < sim_times.Count ) {
bool repeat_all = false;
power_flow( sim_times[current_step] );
current_step++;
}
I know it is super simple, and it is a while becausein the original code I may want to repeat a certain number of steps.
The power_flow() function calls a third party software, so I guess is this third party software the one that should do the multy threading, isn't it?

You can't really force full usage - you need to provide more work for the processor to do. You could do this by increasing the number of threads to process more data in parallel. If you provide your samples of your source code we could provide specific advice on how you could alter your code to achieve this.
If you are using a third party piece of software for data processing, this often makes it difficult to split into multiple threads. One tactic that's often helpful is to split up your input data, then start a new thread for each data set. This requires domain specific knowledge to know what you can split up. For simulations, once you have split up one run as much as possible, an alternative is to process multiple runs in parallel.
The Task Parallel Library can be really useful to break down your code into multiple threads without much refactoring. Particularly the data parallelism section.
One big note of caution - you need to make sure what you're doing is thread-safe. I'll provide some further reading for you. The basic principal is that you need to made sure if you're sharing data between threads then you need to be very careful they don't affect one another - this can cause bizarre problems!
As for your question regarding interfaces - within your process you can allocate thread priority to each thread. An interface thread is just a thread like any other. Usually a UI thread is given the highest priority to remain responsive, whereas a long background process is given a normal/below normal priority as it can be processed during idle time. You can set the priority manually, the default for any new thread is Normal.

You should process these simulations in parallel so that you use as many CPUs as possible. Do this by creating a Task for each simulation run.
using System.Threading.Tasks;
...
List<Task> tasks = new List<Task>();
for(;current_step < sim_times.Count; current_step++)
{
var simTime = sim_times[current_step]; //extract the sim parameter
Task.Factory.StartNew(() => power_flow(simTime)); //create a 'hot' task - one that is immediately scheduled for execution
}
Task.WaitAll(tasks.ToArray()); //wait for the simulations to finish so that you can process results.
Data Parallelism (Task Parallel Library)

Threads eating into CPU performance

I have a console application(c#) where I have to call various third party API's and collect data. This I have to do simultaneously for different users. I am using threads for it. But as the number of users are increasing this service is eating into the CPU performance. It is affecting other processes. Is there a way we can use threads for parallel processing but do not affect the CPU performance in a huge way.

I assume from your question that you're creating threads manually, and so the quick way to answer this is to suggest that you use an API like the Task Parallel Library, because this will take an arbitrary number of tasks and try to use a sensible number of threads to process them - so given 500 API requests, it would limit itself to just a few threads.
However, to answer in more detail: the typical reason that you would see this problem is that code is creating too many threads. Threads are not free resources - they are expensive.
A made up example based on your question might be this:
you have 5 3rd party APIs that you need to call, and each is going to return ~1MB of data per user
you call each API on a separate background thread, for each user
you have 100 users
you therefore have created 500 threads in total, each of which is waiting on data from the network
The problem here is that there are 500 threads the program is trying to manage, and they are all waiting on the slowest piece of the system - the network.
More simply, we are trying to download 500 pieces of data at once (which in this example would mean everything finishes slowly), rather than downloading them one at a time so that individual items will finish earlier. Because each thread will be doing nothing (just waiting for the network), the CPU will switch between idle threads continually. As you increase your number of users, the number of threads increases - which increases the CPU usage just for switch between threads, even though each thread is actually downloading more slowly. This is (approximately) why you'll be seeing slower performance as your user count goes up.
A better example would be to take the same scenario and use just one background thread:
you have 5 3rd party APIs that you need to call, and each is going to return ~1MB of data per user
each API call is put into a queue and the queue is processed by a single thread
you have 100 users
you therefore have 1 thread running in the background which is using the full available bandwidth of the network for each request
In this example, your CPU usage will be pretty consistent - no matter how many users you have, there is only one background thread running, so context switching is minimised. Each individual API call runs at the maximum rate of the network card and so finishes as quickly as possible.
The reality is that one thread is probably not enough: a single request is unlikely to saturate the network, as there will be limiting factors elsewhere. But this is something you can tune later: maybe 2 or 3 threads would be more performant, but 4 threads would be slower again. The general rule when threading is to start small and work up, not to create a thread for each piece of work.

First, run a profiler and checkout some refactoring tools to see if you can perform code optimization to resolve the issue. If your application is still overloading the server then setup or purchase load balancing. In the meantime, if you are running the latest OS's you could try setting a hacky CPU rate limit...however, that may not work for the needs you described.

Launching multiple tasks from a WCF service

I need to optimize a WCF service... it's quite a complex thing. My problem this time has to do with tasks (Task Parallel Library, .NET 4.0). What happens is that I launch several tasks when the service is invoked (using Task.Factory.StartNew) and then wait for them to finish:
Task.WaitAll(task1, task2, task3, task4, task5, task6);
Ok... what I see, and don't like, is that on the first call (sometimes the first 2-3 calls, if made quickly one after another), the final task starts much later than the others (I am looking at a case where it started 0.5 seconds after the others). I tried calling
ThreadPool.SetMinThreads(12*Environment.ProcessorCount, 20);
at the beginning of my service, but it doesn't seem to help.
The tasks are all database-related: I'm reading from multiple databases and it has to take as little time as possible.
Any idea why the last task is taking so long? Is there something I can do about it?
Alternatively, should I use the thread pool directly? As it happens, in one case I'm looking at, one task had already ended before the last one started - I would had saved 0.2 seconds if I had reused that thread instead of waiting for a new one to be created. However, I can not be sure that that task will always end so quickly, so I can't put both requests in the same task.
[Edit] The OS is Windows Server 2003, so there should be no connection limit. Also, it is hosted in IIS - I don't know if I should create regular threads or using the thread pool - which is the preferred version?
[Edit] I've also tried using Task.Factory.StartNew(action, TaskCreationOptions.LongRunning); - it doesn't help, the last task still starts much later (around half a second later) than the rest.
[Edit] MSDN1 says:
The thread pool has a built-in delay
(half a second in the .NET Framework
version 2.0) before starting new idle
threads. If your application
periodically starts many tasks in a
short time, a small increase in the
number of idle threads can produce a
significant increase in throughput.
Setting the number of idle threads too
high consumes system resources
needlessly.
However, as I said, I'm already calling SetMinThreads and it doesn't help.

I have had problems myself with delays in thread startup when using the (.Net 4.0) Task-object. So for time-critical stuff I now use dedicated threads (... again, as that is what I was doing before .Net 4.0.)
The purpose of a thread pool is to avoid the operative system cost of starting and stopping threads. The threads are simply being reused. This is a common model found in for example internet servers. The advantage is that they can respond quicker.
I've written many applications where I implement my own threadpool by having dedicated threads picking up tasks from a task queue. Note however that this most often required locking that can cause delays/bottlenecks. This depends on your design; are the tasks small then there would be a lot of locking and it might be faster to trade some CPU in for less locking: http://www.boyet.com/Articles/LockfreeStack.html
SmartThreadPool is a replacement/extension of the .Net thread pool. As you can see in this link it has a nice GUI to do some testing: http://www.codeproject.com/KB/threads/smartthreadpool.aspx
In the end it depends on what you need, but for high performance I recommend implementing your own thread pool. If you experience a lot of thread idling then it could be beneficial to increase the number of threads (beyond the recommended cpucount*2). This is actually how HyperThreading works inside the CPU - using "idle" time while doing operations to do other operations.
Note that .Net has a built-in limit of 25 threads per process (ie. for all WCF-calls you receive simultaneously). This limit is independent and overrides the ThreadPool setting. It can be increased, but it requires some magic: http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=201

Following from my prior question (yep, should have been a Q against original message - apologies):
Why do you feel that creating 12 threads for each processor core in your machine will in some way speed-up your server's ability to create worker threads? All you're doing is slowing your server down!
As per MSDN do
As per the MSDN docs: "You can use the SetMinThreads method to increase the minimum number of threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorith for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.".
Issues like this are usually caused by bumping into limits or contention on a shared resource.
In your case, I am guessing that your last task(s) is/are blocking while they wait for a connection to the DB server to come available or for the DB to respond. Remember - if your invocation kicks off 5-6 other tasks then your machine is going to have to create and open numerous DB connections and is going to kick the DB with, potentially, a lot of work. If your WCF server and/or your DB server are cold, then your first few invocations are going to be slower until the machine's caches etc., are populated.
Have you tried adding a little tracing/logging using the stopwatch to time how long it takes for your tasks to connect to the DB server and then execute their operations?
You may find that reducing the number of concurrent tasks you kick off actually speeds things up. Try spawning 3 tasks at a time, waiting for them to complete and then spawn the next 3.

When you call Task.Factory.StartNew, it uses a TaskScheduler to map those tasks into actual work items.
In your case, it sounds like one of your Tasks is delaying occasionally while the OS spins up a new Thread for the work item. You could, potentially, build a custom TaskScheduler which already contained six threads in a wait state, and explicitly used them for these six tasks. This would allow you to have complete control over how those initial tasks were created and started.
That being said, I suspect there is something else at play here... You mentioned that using TaskCreationOptions.LongRunning demonstrates the same behavior. This suggests that there is some other factor at play causing this half second delay. The reason I suspect this is due to the nature of TaskCreationOptions.LongRunning - when using the default TaskScheduler (LongRunning is a hint used by the TaskScheduler class), starting a task with TaskCreationOptions.LongRunning actually creates an entirely new (non-ThreadPool) thread for that Task. If creating 6 tasks, all with TaskCreationOptions.LongRunning, demonstrates the same behavior, you've pretty much guaranteed that the problem is NOT the default TaskScheduler, since this is going to always spin up 6 threads manually.
I'd recommend running your code through a performance profiler, and potentially the Concurrency Visualizer in VS 2010. This should help you determine exactly what is causing the half second delay.

What is the OS? If you are not running the server versions of windows, there is a connection limit. Your many threads are probably being serialized because of the connection limit.
Also, I have not used the task parallel library yet, but my limited experience is that new threads are cheap to make in the context of networking.

These articles might explain the problem you're having:
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-are-wcf-responses-slow-and-setminthreads-does-not-work.aspx
http://blogs.msdn.com/b/wenlong/archive/2010/02/11/why-does-wcf-become-slow-after-being-idle-for-15-seconds.aspx
seeing as you're using .Net 4, the first article probably doesn't apply, but as the second article points out the ThreadPool terminates idle threads after 15 seconds which might explain the problem you're having and offers a simple (though a little hacky) solution to get around it.
Whether or not you should be using the ThreadPool directly wouldn't make any difference as I suspect the task library is using it for you underneath anyway.
One third-party library we have been using for a while might help you here - Smart Thread Pool. You still get the same benefits of using the task libraries, in that you can have the return values from the threads and get any exception information from them too.
Also, you can instantiate threadpools so that when you have multiple places each needing a threadpool (so that a low priority process doesn't start eating into the quota of some high priority process) and oh yeah you can set the priority of the threads in the pool too which you can't do with the standard ThreadPool where all the threads are background threads.
You can find plenty of info on the codeplex page, I've also got a post which highlights some of the key differences:
http://theburningmonk.com/2010/03/threading-introducing-smartthreadpool/
Just on a side note, for tasks like the one you've mentioned, which might take some time to return, you probably shouldn't be using the threadpool anyway. It's recommended that we should avoid using the threadpool for any blocking tasks like that because it hogs up the threadpool which is used by all sorts of things by the framework classes, like handling timer events, etc. etc. (not to mention handling incoming WCF requests!). I feel like I'm spamming here but here's some of the info I've gathered around the use of the threadpool and some useful links at the bottom:
http://theburningmonk.com/2010/03/threading-using-the-threadpool-vs-creating-your-own-threads/
well, hope this helps!

What resources do blocked threads take-up

One of the main purposes of writing code in the asynchronous programming model (more specifically - using callbacks instead of blocking the thread) is to minimize the number of blocking threads in the system.
For running threads , this goal is obvious, because of context switches and synchronization costs.
But what about blocked threads? why is it so important to reduce their number?
For example, when waiting for a response from a web server a thread is blocked and doesn't take-up any CPU time and does not participate in any context switch.
So my question is:
other than RAM (about 1MB per thread ?) What other resources do blocked threads take-up?
And another more subjective question:
In what cases will this cost really justify the hassle of writing asynchronous code (the price could be, for example, splitting your nice coherent method to lots of beginXXX and EndXXX methods, and moving parameters and local variables to be class fields).
UPDATE - additional reasons I didn't mention or didn't give enough weight to:
More threads means more locking on communal resources
More threads means more creation and disposing of threads which is expensive
The system can definitely run-out of threads/RAM and then stop servicing clients (in a web server scenario this can actually bring down the service)

So my question is: other than RAM (about 1MB per thread ?) What other resources do blocked threads take-up?
This is one of the largest ones. That being said, there's a reason that the ThreadPool in .NET allows so many threads per core - in 3.5 the default was 250 worker threads per core in the system. (In .NET 4, it depends on system information such as virtual address size, platform, etc. - there isn't a fixed default now.) Threads, especially blocked threads, really aren't that expensive...
However, I would say, from a code management standpoint, it's worth reducing the number of blocked threads. Every blocked thread is an operation that should, at some point, return and become unblocked. Having many of these means you have quite a complicated set of code to manage. Keeping this number reduced will help keep the code base simpler - and more maintainable.
And another more subjective question: In what cases will this cost really justify the hassle of writing asynchronous code (the price could be, for example, splitting your nice coherent method to lots of beginXXX and EndXXX methods, and moving parameters and local variables to be class fields).
Right now, it's often a pain. It depends a lot on the scenario. The Task<T> class in .NET 4 dratically improves this for many scenarios, however. Using the TPL, it's much less painful than it was previously using the APM (BeginXXX/EndXXX) or even the EAP.
This is why the language designers are putting so much effort into improving this situation in the future. Their goals are to make async code much simpler to write, in order to allow it to be used more frequently.

Besides from any resources the blocked thread might hold a lock on, thread pool size is also of consideration. If you have reached the maximum thread pool size (if I recall correctly for .NET 4 is max thread count is 100 per CPU) you simply won't be able to get anything else to run on the thread pool until at least one thread gets freed up.

I would like to point out that the 1MB figure for stack memory (or 256KB, or whatever it's set to) is a reserve; while it does take away from available address space, the actual memory is only committed as it's needed.
On the other hand, having a very large number of threads is bound to bog down the task scheduler somewhat as it has to keep track of them (which have become runnable since the last tick, and so on).

Best pattern for creating large number of C# threads

We are implementing a C# application that needs to make large numbers of socket connections to legacy systems. We will (likely) be using a 3rd party component to do the heavy lifting around terminal emulation and data scraping. We have the core functionality working today, now we need to scale it up.
During peak times this may be thousands of concurrent connections - aka threads (and even tens of thousands several times a year) that need to be opened. These connections mainly sit idle (no traffic other than a periodic handshake) for minutes (or hours) until the legacy system 'fires an event' we care about, we then scrape some data from this event, perform some workflow, and then wait for the next event. There is no value in pooling (as far as we can tell) since threads will rarely need to be reused.
We are looking for any good patterns or tools that will help use this many threads efficiently. Running on high-end server hardware is not an issue, but we do need to limit the application to just a few servers, if possible.
In our testing, creating a new thread, and init'ing the 3rd party control seems to use a lot of CPU initially, but then drops to near zero. Memory use seems to be about 800Megs / 1000 threads
Is there anything better / more efficient than just creating and starting the number of threads needed?
PS - Yes we know it is bad to create this many threads, but since we have not control over the legacy applications, this seems to be our only alternative. There is not option for multiple events to come across a single socket / connection.
Thanks for any help or pointers!
Vans

You say this:
There is no value in pooling (as far
as we can tell) since threads will
rarely need to be reused.
But then you say this:
Is there anything better / more
efficient than just creating and
starting the number of threads needed?
Why the discrepancy? Do you care about the number of threads you are creating or not? Thread pooling is the proper way to handle large numbers of mostly-idle connections. A few busy threads can handle many idle connections easily and with fewer resources required.

Use the socket's asynchronous BeginReceive and BeginSend. These dispatch the IO operation to the operating system and return immediately.
You pass a delegate and some state to those methods that will be called when an IO operation completes.
Generally once you are done processing the IO then you immediately call BeginX again.
Socket sock = GetSocket();
State state = new State() { Socket = sock, Buffer = new byte[1024], ThirdPartyControl = GetControl() };
sock.BeginReceive(state.Buffer, 0, state.Buffer.Length, 0, ProcessAsyncReceive, state);
void ProcessAsyncReceive(IAsyncResult iar)
{
State state = iar.AsyncState as State;
state.Socket.EndReceive(iar);
// Process the received data in state.Buffer here
state.ThirdPartyControl.ScrapeScreen(state.Buffer);
state.Socket.BeginReceive(state.buffer, 0, state.Buffer.Length, 0, ProcessAsyncReceive, iar.AsyncState);
}
public class State
{
public Socket Socket { get; set; }
public byte[] Buffer { get; set; }
public ThirdPartyControl { get; set; }
}
BeginSend is used in a similar fashion, as well as BeginAccept if you are accepting incoming connections.
With low throughput operations Async communications can easily handle thousands of clients simultaneously.

I would really look into MPI.NET. More Info MPI. MPI.NET also has some Parallel Reduction; so this will work well to aggregate results.

I would suggest utilizing the Socket.Select() method, and pooling the handling of multiple socket connections within a single thread.
You could, for example, create a thread for every 50 connections to the legacy system. These master threads would just keep calling Socket.Select() waiting for data to arrive. Each of these master threads could then have a thread pool that sockets that have data are passed to for actual processing. Once the processing is complete, the thread could be passed back to the master thread.

The are a number of patterns using Microsoft's Coordination and Concurrency Runtime that make dealing with IO easy and light. It allows us to grab and process well over 6000 web pages a minute (could go much higher, but there's no need) in a crawler we are developing. Definitely worth a the time investment required to shift your head into the CCR way of doing things. There's a great article here:
http://msdn.microsoft.com/en-us/magazine/cc163556.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.