Handle many tasks, many times with Task Parallel - c#

This may not be suitable here, please feel free to move, shout or abuse if so.
We currently have a console application that get started by another and passed in an ID of the 'job', this job will have multiple records that need to be processed. A simple explanation of the flow would be;
Starts 50 threads
Gets records to be processed.
if records > 0 see what threads are not still busy and send it some information.
if records = 0 update something else and exit.
Get more records.
Loop.
Now, I am looking to convert this into a 'polling' service that is continually running and when new records are available, process them. To take what I have and convert this is fairly simple, but the threads stuff is old and probably outdated.
I was looking to refactor most if not all and use Task.Parallel to process the items. However, I am struggling to get a suitable framework for polling and then processing the items and was looking for suggestions on how to achieve this.
Pretty vague I know, but hopefully enough to give some kind of input.
Many thanks

From my experience and this msdn quote:
More efficient and more scalable use of system resources.
Behind the scenes, tasks are queued to the ThreadPool, which has been
enhanced with algorithms (like hill-climbing) that determine and
adjust to the number of threads that maximizes throughput. This makes
tasks relatively lightweight, and you can create many of them to
enable fine-grained parallelism. To complement this, widely-known
work-stealing algorithms are employed to provide load-balancing.
You simply shouldn't care about how many tasks is a good number, or how to create a system where you load balance the threading involved.
Simply use:
Task.Factory.StartNew(() => DoSomeWork());
Every time you want to run something asynchronously, it does all the smart job behind the curtain.
Now since you're likely to create tasks in a loop, please be extra-careful not to introduce a closure bug many people had (including me), which you can look up here.
I have a windows service that runs from 1 to 500 Tasks, and never had trouble.
Hope this helps,
Bab.

If you are polling for new records in a DB table, a better approach would be to install an INSERT-trigger (and possibly also UPDATE- and DELETE-triggers) on this table and to send a message to your service when a new records is inserted.
See Posting Message to MSMQ from SQL Server on MSDN.

The "polling service" sounds like a nice case for an observable collection. There's Rx, a nice way to handle them (http://rxwiki.wikidot.com/101samples), which I think uses the TPL.

Related

How can I keep my parallel application across multiple servers from grabbing the same mongodb document for work?

So the question is long but pretty self explanatory. I have an app that runs on multiple servers that uses parallel looping to handle objects coming out of a MongoDB Collection. Since MongoDB forces me to allow multi read access I cannot stop multiple processes and or servers from grabbing the same document from the collection and duplicating work.
The program is such that the app waits for information to appear, does some work to figure out what to do with it, then deletes it once it's done. What I hope to achieve is that if I could keep documents from being accessed at the same time, knowing that once one has been read it will eventually be deleted, I can speed up my throughput a bit overall by reducing the number of duplicates and allowing the apps to grab things that aren't being worked.
I don't think pessimistic is quite what I'm looking for but maybe I misunderstood the concept. Also if alternative setups are being used to solve the same problem I would love to hear what might be being used.
Thanks!
What I hope to achieve is that if I could keep documents from being accessed at the same time
The simplest way to achieve this is by introducing a dispatch process architecture. Add a dedicated process that just watch for changes then delegate or dispatch the tasks out to multiple workers.
The process could utilise MongoDB ChangeStreams to access real-time data changes on a single collection, a database or an entire deployment. Once it receives a stream/document, just sends to a worker for processing.
This should also reduce multiple workers trying to access the same tasks and have a logic to back-down.

Using TPL in .NET

I have to refactor a fairly time-consuming process in one of my applications and after doing some research I think it's a perfect match for using TPL. I wanted to clarify my understanding of it and ask if there are any more issues which I should take into account.
In few words, I have a windows service, which runs overnight and sends out emails with data updates to around 10000 users. At presence, the whole process takes around 8 hrs to complete. I would like to reduce it to 2 hrs max.
Application workflow follows steps below:
1. Iterate through all users list
2. Check if this user has to be notified
3. If so, create an email body by calling external service
4. Send an email
Analysis of the code has shown that step 3 is the most time-consuming one and takes around 3,5 sec to complete. It means, that when processing 10000 users, my application waits well over 6 hrs in total for a response from the external service! I think this is a reason good enough to try to introduce some asynchronous and parallel processing.
So, my plan is to use Parallel class and ForEach method to iterate through users in step 1. As I can understand this should distribute processing each user into a separate thread, making them run in parallel? Processes are completely independent of each other and each doesn't return any value. In the case of any exception being thrown it will be persisted in logs db. As with regards to step 3, I would like to convert a call to external service into an async call. As I can understand this would release the resources on the thread so it could be reused by the Parallel class to start processing next user from the list?
I had a read through MS documentation regarding TPL, especially Potential Pitfalls in Data and Task Parallelism document and the only point I'm not sure about is "Avoid Writing to Shared Memory Locations". I am using a local integer to count a total number of emails processed. As with regards to all of the rest, I'm quite positive they're not applicable to my scenario.
My question is, without any implementation as yet. Is what I'm trying to achieve possible (especially the async await part for external service call)? Should I be aware of any other obstacles that might affect my implementation? Is there any better way of improving the workflow?
Just to clarify I'm using .Net v4.0
Yes, you can use the TPL for your problem. If you cannot influence your external problem, then this might be the best way.
However, you can make the best gains if you can get your external source to accept batches. Because this source could actually optimize the performance. Right now you have a message overhead of 10000 messages to serialize, send, work on, receive and deserialize. This is stuff that could be done once. In addition, your external source might be able to optimize the work they do if they know they will get multiple records.
So the bottom line is: if you need to optimize locally, the TPL is fine. If you want to optimize your whole process for actual gains, try to find out if your external source can help you, because that is where you can make some real progress.
You didn't show any code, and I'm assuming that step 4 (send an e-mail) is not that fast either.
With the presented case, unless your external service from step 3 (create an email body by calling external service) processes requests in parallel and supports a good load of simultaneous requests, you will not gain much with this refactor.
In other words, test the external service and the e-mail server first for:
Parallel request execution
The way to test this is to send at least 2 simultaneous requests and observe how long it takes to process them.
If it takes about double the time of a single, the requests have some serial processing, either they're queued or some broad lock is being taken.
Load test
Go up to 4, 8, 12, 16, 20, etc, and see where it starts to degrade.
You should set a limit on the amount of simultaneous requests to something that keeps execution time above e.g. 80% of the time it takes to process a single request, assuming you're the sole consumer
Or a few requests before it starts degrading (e.g. divide by the number of consumers) to leave the external service available for other consumers.
Only then can you decide if the refactor is worth. If you can't change the external service or the e-mail server, you must weight it they offer enough parallel capability without degrading.
Even so, be realistic. Don't let your service push the external service and the e-mail server to their limits in production.

Asynchronous Processing of Data

At the minute I am trying to put together an asynchronous tcp server to receive data which I then want to process, extracting values and inserting to sql server.
The basic concept I thought would be best is once the data is received and confirmed as the entire message, the message should then be passed of to some sort of collection to await processing on a FIFO basis, which will parse the values and insert them to sql server. I suppose this is whats known as the consumer/producer pattern.
I have been doing some looking into the best collection / way of doing this and have so far seen the BlockingCollection,ConcurrentCollection and BufferBlock using async/await and i think this may be the way to go but to be honest im not sure.
The best example i have found is on Stephen Cleary's blog in particular this article,
http://blog.stephencleary.com/2012/11/async-producerconsumer-queue-using.html
My main reservations are that I in no way want to slow down or interrupt the receiving of messages which to me would suggest using the multiple producer/consumer example which can be seen at the above link, but what i want to know is;
Am i correct in this assumption or is there a more suitable way of doing this in my scenario.
And if im correct in my assumption could anyone suggest the best way of implementing this taking into consideration my use case.
Any and all help is much appreciated.
At the minute I am trying to put together an asynchronous tcp server to receive data which I then want to process, extracting values and inserting to sql server.
There's a common pitfall with this kind of scenario. It is usually wrong to report success back to the client when the work has yet to be done. Most of the time I've seen this design, it's because of an efficiency "requirement" self-imposed by the developer, not by the client or for technical reasons. So first, take a step back and make absolutely sure that you do want to return a "successful completion" message to the client when the operation has not actually completed yet.
If you are sure that's what you want to do, then there's another question you must ask: is it acceptable to lose requests? That is, after you tell the client that the operation successfully completed, will the system still be stable if the operation does not actually ever complete?
The answer to that question is usually "no." At that point, the most common architectural solution is to have an out-of-process reliable queue (such as an Azure queue or MSMQ), with an independent backend (such as an Azure worker role or Win32 service) that processes the queue messages. This definitely complicates the architecture, but it is a necessary complication if the system must return completion messages early and must not lose messages.
On the other hand, if losing messages is acceptable, then you can keep them in-memory. It is only in this case that you can use one of the in-memory producer/consumer types mentioned on my blog. This is a very rare situation, but it does happen from time to time.
In general, I would avoid using BlockingCollection and friends for this sort of work. Doing so encourages you to architect the entire system into a single process, which is the enemy of scalability and reliability.
I second Stephen Cleary's suggestion of using an out-of-process queue to manage the work. I disagree that this necessarily complicates the architecture, though - in fact, I think it can make things quite a bit simpler. Specifically, a major complication of the original requirement ("put together an asynchronous tcp server") disappears. Asynchronous TCP servers are a pain in the butt to write and easy to screw up - why not just skip that part altogether and be free to focus all of your energy on the post-processing code?
When I built a system like this, I used a Redis List as the task queue. Tasks were serialized to JSON, and clients would add their task to the queue with a RPUSH command. Worker processes retrieve the next task from the queue BLPOP, do their thing, then go back to waiting for the next task.
Advantages:
No locks. All synchronization comes for free from Redis (or whatever task queue you choose).
Everything in the system is single-threaded. Multi-threading is hard.
I'm free to spin up as many worker processes as I want, across as many nodes as I want.

Advice on a TCP/IP based server (C#)

I was looking for some advice on the best approach to a TCP/IP based server. I have done quite a bit of looking on here and other sites and cant help think what I have saw is overkill for the purpose I need it for.
I have previously written one on a thread per connection basis which I now know wont scale well, but what I was thinking was rather that creating a new thread per connection I could use a ThreadPool and queue the incoming connections for processing as time isn't a massive issue (provided they will be processed in less that a minute or two of coming in).
The server itself will be used essentially for obtaining data from devices and will only occasionally have to send a response to the sending device to update settings (Again not really time critical as the devices are setup to stay connected for as long as they can and if for some reason if it becomes disconnected the response will be able to wait until the next time it sends a message).
What I wanted to know is will this scale better than the thread per connection scenario (I assume that it will due to the thread reuse) and roughly what kind of number of devices could this kind of setup support.
Also if this isn't deemed suitable could someone possibly provide a link or explanation of the SocketAsyncEventArgs method. I have done quite a bit of reading on the topic and seen examples but cant quite get my head around the order of events etc and why certain methods are called at the time the are.
Thanks for any and all help.
I have read the comments but could anybody elaborate on these?
Though to be honest i would prefer the initial approach of of rolling my own.

Windows service threading with looping and WCF

First off, I will be talking about some legacy code and we are trying to avoid changing it as much as possible. Also, my experience with windows services and WCF is a bit limited so some of the questions may be a bit newbie. Just to give a bit of context before the question.
We have an existing service that loops. It checks via a database call to see if it has records to process. If it does not find any records, it sleeps for 30 seconds and then wakes back up to try again.
I would like to add an entry point to this service that would allow me to pass a record to this service in addition to it processing the records from the database. So the basic flow would be.
Loop
* Read record from database
* If no record from DB, process any records that were passed in via the entry point.
* No records at all, sleep for 30 seconds.
My concern is this. Is it possible to implement this in one service such that I have the looping process but I also allow for calls to come in at any time and add additional items to a queue that can be processed within the loop. My concern is with concurrency and keeping the loop and the listener from stepping on each other.
I know this question may not be worded quite right but I am on the new side with working with this. Any help would be appreciated.
My concern is with concurrency and keeping the loop and the listener from stepping on each other.
This shouldn't be an issue, provided you synchronize access correctly.
The simplest option might be to use a thread safe collection, such as a ConcurrentQueue<T>, to hold your items to process. The WCF service can just add items to the collection without worry, and your next processing step would handle it. The synchronization in this case is really minimal, as the queue would already be fully thread safe.
In addition to Reed's excellent answer, you might want to persist the records in a MSMQ queue to prevent your service from losing records on shutdown, restart, or crash of your service.

Categories