Problem I'm tasked to resolve is (from my understanding) a typical producer/consumer problem. We have data incoming 24/7/365. The incoming data (call it raw data) is stored in a table and is unusable for the end user. We then select all raw data that has not been processed and start processing one by one. After each unit of data is processed, its stored in another table and is now ready to be consumed by the client application.
The process from loading the raw data till persisting processed data takes 2 - 5 seconds on average. But its highly dependent on the third party web services that we use to process the data. If the web services are slow, we are no longer processing data as fast as we're getting it in and accumulate backlog, hence causing our customers to loose live feed.
We want to make this process a multithreaded one. From my research I can see that the process can be divided into three discreet parts:
LOADING - A loader task (producer) that runs indefinitely and loads unprocessed data from DB to BlockingCollection<T> (or some other variation of a concurrent collection). My choice of BlockingCollection is due to the fact that it is designed with Producer/Consumer pattern in mind and offers GetConsumingEnumerable() method.
PROCESSING - Multiple consumers that consume data from the above BlockingCollection<T>. In its current implementation I have a Parallel.ForEach loop through GetConsumingEnumerable() that on each iteration starts a task with two task continuations: First step of the task is to call a third party web service, wait for the result and output the result for the second task to consume. Second task does calculations based on the first task's output and outputs the result for the third task, which basically just stores that result into the second BlockingCollection<T> (this one being an output collection). So my consumers are effectively producers too. Ideally each unit of data that has been loaded by the task 1 would be queued for processing in parallel.
PERSISTING - A single consumer runs against the second BlockingCollection mentioned above and persists processed data into database.
Problem I'm facing is the item number 2 from the list above. It does not seem to be fast enough (just by using Parallel.ForEach). I tried inside Parallel.ForEach instead of directly starting a task with continuation, start a wrapping thread that will in turn start the processing task. But this caused OutOfMemory exception, because thread count went out of control and reached 1200 very soon. I also tried scheduling work using ThreadPool with no avail.
Could you please advise if my approach is good enough for what we need done, or is there a better way of doing it?
If the bottleneck is some 3rd party service and this will not handle parallel execution but will queue your request then you cannot do a thing.
But first you can try this:
use the ThreadPool or Tasks (those will use ThreadPool too) - don't fire up Threads yourself
try to make your request async instead of using the thread exclusively
run your service/app through an performance profiler and check where you are "wasting" your time
make a spike/check for the 3rd party service and see how it handles parallel requests
think about caching the answers from this service (if possible)
That's all I can think of without further info right now.
I recently faced a problem which was very much similar to yours,
Here's what i did, hope it might help:
It seems like your 1st and 3rd part are rather simple, and can be
managed on their respective threads without any problem,
The 2nd part must firstly be started on a new thread, Then use System.Threading.timer, to make your web-service calls,
the method that calls the web-service passes the response(result) to the processing method by Invoking it asynchronously and letting it process the data at it's own pace,
this solved my problem, i hope it helps you too, if any doubts ask me, i'll explain it here...
Related
I am looking to write a Windows Service that will start various "jobs".
Each "job" will:
be distinct in what it accomplishes
run for the lifetime of the Service, so "long running". Typically, a job will get 10 tasks from the database and process them, then sleep, and then repeat this cycle again and again.
Share the same "context". The application will be loosely coupled and call an IoC to get classes. It will also store some data on this context too
I need each job to be able to run in parallel and effectively run as separate programs.
My first thought was to create one thread per job. This is okay but has the drawback that a ManualResetEvent stops the thread in its tracks, and the Abort doesn't allow much chance for the Thread to exit in a graceful manner.
I then explored some of the new async framework in .NET 4.5 and boy does it seem to simplify coding.
However, whilst some of the data held on the context may be freely shared between each job, some can not: so each job requires it's own copy of certain data.
I attempted to solve this using ThreadLocal<T> properties. However, whilst this works fine for a specific thread that I've created, this doesn't work for the async methods. The thread that starts an async method is often not the thread that finishes the method, particularly when the method uses "await".
So, what is the preferred pattern for what I am attempting to accomplish?
FYI: Albahari's posting was a great help.
I have following scenario:
C# application (.net 4.0/4.5), with 5-6 different threads. Every thread has a different task, which is launched every x seconds (ranging from 5 to 300).
Each task has following steps:
Fetch items from Sql Server
Convert items in Json
Send data to webserver
Wait for reply from server.
Since this tasks can fail at some point (internet problems, timeout, etc) what is best solution in .NET world?
I thought about following solutions:
Spawn new thread every x seconds (if there is not another thread of this type in execution)
Spawn one thread for each type of task and loop steps every x seconds (to understand the way to manage exceptions)
Which would be more secure and robust? Application will run on unattended systems, so it should be able to remain in execution regardless of any possible exception.
Threads are pretty expensive to create. The first option isn't a great one. If the cycle of "do stuff" is pretty brief (between the pauses), you might consider using the ThreadPool or the TPL. If the threads are mostly busy, or the work takes any appreciable time, then dedicated workers are more appropriate.
As for exceptions: don't let exceptions escape workers. You must catch them. If all that means is that you give up and retry in a few seconds, that is probably fine.
You could have modeled the whole thing using a producer consumer pattern approach. You have a producer who puts the new task description in the queue and you can have multiple consumers (4 or 5 threads) who process from the queue. The number of consumers or the processing thread could vary depending on the load, length of the queue.
Each task involves reading from DB, converting the format, sending to web server and then process the response from web server. I assume each task would do all these steps.
In case of exceptions for an item in the queue, you could potentially mark the queue item as failed and schedule it for a retry later.
What i have now is a real-time API get bunch of messages from network and feed into pubsub manager class. there might be up to 1000 msg/sec or more at times. there are 2 different threads each connected to its own pubsub. subscribers are WPF windows. manager keeps list of windows and their DispatcherSynchornisationContext.
A thread calls the manager through interface method.
Manager publishes through Post:
foreach (var sub in Subscribers[subName])
{
sub.Context.Post(sub.WpfWindow.MyDelegate, data);
}
can this be optimised.
P.S. Please dont ask why do I think it is slow and all.. I dont have limits. Any solution is infinitely slow. I have to do my best to make it as fast as possible. I am asking for help to assess - can it be done faster? Thank you.
EDIT: found this: http://msdn.microsoft.com/en-us/library/aa969767.aspx
The argument with a queue stays. WHat I do is put stuff into the queue, the queue triggers a task that then invokes into the messaging thread and pulls X items of data (1000, or how many there are). The one thing that killed me was permanent single item invokation (which is slow), but doing it batchy works nicely. I can keep up wiitz nearly zero cpu load on a very busy ES data feed in crazy times for time and sales.
I have a special set of componetns for that that I will open source one of the next week and there is an ActionQueue (taking a delegate to call when items need processing). This is now a Task (was a queued thread work item before). I took time to process up 1000 messages per invocation - but if you do a price grid you may need more.
Note: use WPF hints to enable gpu caching of rendered bitmaps.
In addition:
Run every window on it's own thread / message pump
HEAVILY use async queues. The publisher should never block, every window has it's own target queue that is async.
You want processing as decoupled as possible. Brutally decoupled.
Here is my suggestion for you:
I would use a ConcurrentQueue (comes with the namespace System.Collections.Concurrent;) The background workers feed their messages in that queue. The UI Thread takes a timer and draws (let's say every 500 msec) a bunch of messages out of that queue and shows them to the user. Another possible way is, that the UI thread only will do that on demand of the user. The ConcurrentQueue is designed to be used from different thread and concurrently (as the name says ;-) )
Since it's a long question, cliff notes come first.
Cliff notes:
One client sends input to several services and they keep on working and sending results until the client tells them to stop or they have reached a pre-set maximum number of results.
Do you know how one should go about implementing this, or do you have a C#-example for sth. like this? Is WCF & streaming the right toolset for this ? (Consider that results are custom objects, so it's not exactly the same as streaming a file)
More Detailed Problem Definition:
Situation:
I have full control over the code of the client and the services (iow not dependent on closed 3rd party stuff)
everything is in C#
We have one client who wants to get one task done and has several equal independent services for that.
(equal = equal service-software, the hardware on which each service runs can vary -> service-speeds can vary)
One task consists of "1000 pieces of work" which are all independent from one another.
Within one task all of the 1000 pieces of work are based upon the same piece of input data.
I mention solutions A+B since I think they help explaining the problem:
Solution (A) - The slow non-parallel way:
1. Client sends input to one service.
2. Service initializes based upon the input.
3. Service processes all 1000 pieces of work
(results get added up(super fast btw) so the result of 1000 pieces of work has the same size as the result of one)
4. Service sends result to the client.
5. Client receives result and is happy
Solution (B) - Parallel faster way:
Let's say ten services, so we evenly split it up and each should process 100.
The problem is some services may be much faster than others so giving each the same number(100) is slower than
necessary.
Furthermore we can't split up according to an a priori speed-test since the speed of one service can change and
some might even go down during processing, these are the reasons why I think the following would be best for my purpose.
Solution (C) - The way I would like to implement it:
Client sends out the same request to all services. (same request still implies that the task get's processed in parallel, parallelization is super easy for my problem 1000 pieces of work are so independent that doing 1000 times the "first" piece of work means we are done)
A service keeps working and sending results until it is told to stop or has processed 1000 pieces of work.
One result gets sent for 10 pieces of work done.
This means all services work parallel on the task and when the client has gotten a sum of 1000 results from all service replies combined it will send the stop signal.
That means normally no single service should reach 1000, but with having 1000 we have covered the situation where there is only one service and we have a fail-safe to avoid infinite loops if the stop signal gets lost. (client neither needs to wait nor to be absolutely sure that the stop signal has reached a service)
Throwing away additional results beyond our goal of 1000 is fine.
(The alternative of instead making follow-up requests to services that have responded faster than others would come
with the overhead of wasted time due to messages going back and forth and additional initializations.
(Add. inits could be avoided but it would be complicated and you still have the other overhead))
I basically have solutions/would know how to implement A+B but I have no clue how I would go about realizing (C).
How do you implement a client/service-architecture in C# where the service keeps sending results and doesn't just return one object/value? (Results are custom objects, btw)
Does someone know about C#-example-code where sth. like that is implemented? Would streaming be the right way?
I've found the "writing a custom stream"-example but it seems like it's a pretty long way from there to what I want. (As a WCF-noob I can easily be wrong on that though.)
Streaming in the WCF doesn't work in the way that you will open a stream, return the stream to the client and service will still generate results to the stream. If you want to work this way you must go deeper and use sockets directly. In WCF the stream must be written prior to returning it from the operation (I tried to write to returned stream from other thread but it didn't work). Streaming in WCF is only for data transport.
I don't like any of your solution. I would try:
Variant of B. But tasks will not be divided equally upfront. If you have 10 services and 1000 tasks you will send first 10 tasks (each to one service) only only after the service returns the result it will get another task. If tasks can be completed within reasonable time you will need only multiple async calls to services and wait for responses. If any service will not be able to complete task within defined timeout you will send the task to another service. If tasks can be completed fast you can send small batches instead of single task. If task completion takes long you will need duplex communication.
Use Transactional Message queue - MSMQ. You client will generate 1000 messages to "producer queue" and services will take these messages one by one and process them. They will send results as message to another "consumer queue" where client will take results and process them (each result must have correlation to the task). Transactional queue will ensure that each task can be processed only by single service but if service fails or timeout of the transaction will occur the task will be available for processing in another service. MSMQ also offers some additional features like queue for faulty tasks etc. This is little bit advanced scenario. The main problem of this scenario can be limitation in size of messages (max. 4MB per message).
Edit:
Ok because of your clarification it looks like you need to send the same task to multiple services and task will just trigger series of the same computation on the same data. You can achieve it in this way:
Build a duplex service using Net.tcp binding
Service will implement service contract which will have operations to start computation and to stop computation (you can use IsInitiating and IsTerminating properties of OperationContract)
Service will do computation in separate thread started in start operation
Stop operation will abort computation thread
Client will implement callback contract to receive results from the service
Service will call the client callback when the processing thread has result (or multiple results) to send back
Here is an example of using duplex services with WsDualHttpBinding - don't use this binding in your scenario because it is much more complicated if you want to have single client to communicate with multiple same services over duplex HTTP.
What you describe as Solution (C) sounds like a good use for Asynchronous WCF. Some of these might help...
Synchronous and Asynchronous Operations
Asynchronous Programming Design Patterns
How to: Call WCF Service Operations Asynchronously
I've got a program I'm creating(in C#) and I see two approaches..
1) A job manager that waits for any number of X threads to finish, when finished it gets the next chunk of work and creates a new thread and gives it that chunk
or
2) We create X threads to start, give them each a chunk of work, and when a thread finishes a chunk its asks the job manager for more work. If there isn't any more work it sleeps and then asks again, with the sleep becoming progressively longer.
This program will be a run and done, tho I could see it turning into a service that continually looks for more jobs.
Each chunk will consists of a number of data ids, a call to the database to get some info or perform an operation on the data id, and then writing to the database info on the data id.
Assuming you are aware of the additional precautions that need to be taken when dealing with multithreaded database operations, it sounds like you're describing two different scenarios. In the first, you have several threads running, and once ALL of them finish it will look for new work. In the second, you have several threads running and their operations are completely parallel. Your environment is going to be what determines the proper approach to take; if there is something tying all of the work in the several threads where additional work cannot continue until all of them are finished, then with the former. If they don't have much affect on each other, go with the latter.
The second option isn't really right, as making the sleep time progressively longer means that you will unnecessarily keep those threads blocked.
Rather, you should have a pooled set of threads like the second option, but they use WaitHandles to wait for work and use a producer/consumer pattern. Basically, when the producer indicates that there is work, it sends a signal to a consumer (there will be a manager which will determine which thread will get the work, and then signal that thread) which will wake up and start working.
You might want to look into the Parallel Task Library. It's in beta now, but if you can use it and are comfortable with it, I would recommend it, as it will manage a great deal of this for you (and much better, taking into account the number of cores on a machine, the optimal number of threads, etc, etc).
The former solution (spawn a thread for each new piece of work), is easier to code, and not too bad, if the units of work are large enough.
The second solution (thread-pool, with a queue of work), is more complicated to code, but supports smaller units of work.
Instead of rolling your own solution, you should look at the ThreadPool class in the .NET framework. You could use the QueueUserWorkItem method. It should do exactly what you want to accomplish.