multi threading from multiple machines

multi threading from multiple machines - c#

I have researched a lot and I haven't found anything that meets my needs. I'm hoping someone from SO can throw some insight into this.
I have an application where the expected load is thousands of jobs per customer and I can have 100s of customers. Currently it is 50 customers and close to 1000 jobs per each. These jobs are time sensitive (scheduled by customer) and can run up to 15 minutes (each job).
In order to scale and match the schedules, I'm planning to run this as multi threaded on a single server. So far so good. But the business wants to scale more (as needed) by adding more servers into the mix. Currently the way I have it is when it becomes ready in the database, a console application picks up first 500 and uses Task Parallel library to spawn 10 threads and waits until they are complete. I can't scale this to another server because that one could pick up the same records. I can't update a status on the db record as being processed because if the application crashes on one server, the job will be in limbo.
I could do a message queue and have multiple machines pick from it. The problem with this is the queue has to be transactional to support handling for any crashes. MSMQ supports only MS DTC transaction since it involves database and I'm not really comfortable with DTC transactions, especially with multi threads and multiple machines. Too much maintenance and set up and possibly unknown issues.
Is SQL service broker a good approach instead? Has anyone done something like this in a production environment? I also want to keep the transactions short (A job could run for 15,20 minutes - mostly streaming data from a service). The only reason I'm doing a transaction is to keep the message integrity of queue. I need the job to be re-picked if it crashes (re-appear in the queue)
Any words of wisdom?

Why not having an application receive the jobs and insert them in a table that will contain the queue of jobs. Each work process can then pick up a set of jobs and set the status as processing, then complete the work and set the status as done. Other info such as server name that processed each job, start and end time-stamp could also be logged. Moreover, instead of using multiple threads, you could use independent work processes so as to make your programming easier.
[EDIT]
SQL Server supports record level locking and lock escalation can also be prevented. See Is it possible to force row level locking in SQL Server?. Using such mechanism, you can have your work processes take exclusive locks on jobs to be processed, until they are done or crash (thereby releasing the lock).

Related

Asynchrously download data from api then multiprocess concurrently

I have a problem that has three components and two bottlenecks:
Downloading data from an api (I/O bound)
Processing the data (CPU bound)
Saving results to a database (CPU bound)
Going through the process of querying the api, processing the data, and saving the results takes me about 8 hours. There are a total of about 830 jobs to process. I would like to speed up my .Net console application in any way by using parallelism. I've read many posts about the Producer Consumer problem, but I don't know how to apply that knowledge in this situation.
Here is what I'm imagining: There are two queues. The first queue stores api responses. I only want a certain number of workers at a time querying the api and putting things in the queue. If the queue is full, they will have to wait before putting their data on the queue.
At the same time, other workers (as many as possible) are pulling responses off the queue and processing them. Then, they are putting their processed results onto the second queue.
Finally, a small number of workers are pulling processed results from the second queue, and uploading them to the database. For context, my production database is SQL Server, but my Dev database is Sqlite3 (which only allows 1 write connection at a time).
How can I implement this in .Net? How can I combine I/O based concurrency with CPU-based concurrency, while having explicit control over the number of workers at each step? And finally, how do implement these queues and wire everything up? Any help/guidance is much appreciated!

when MS DTC is not used in a message queue transaction

I'm creating a design to process a large number of jobs using MSMQ to scale out. Each job is processed and the database is updated for that job Id saying it is complete. If error, it should go back to the queue. So, I need a transactional MSMQ. Now, I could process the job and updating the db record might fail for whatever reason. In this case also, I would need the job back in the queue so it can be re-attempted and saved back to db with success. This means I will need to enable MS DTC to manage transaction across the database server.
I am reading Pro MSMQ book and it mentions "For most applications, transactions are not needed at all. There is a tendency to overuse transactions and affect the performance of the entire system unnecessarily. Before deciding to use transactions, analyze the ACID properties requirement of the entire system and the performance impact.".
I'm failing to understand which cases would not involve a database update. Wouldn't any record that is picked up from a queue needs to be processed and updated somewhere? I would think about 90% of the systems will need this type of functionality. I get it if it was a middle layer queue that passes it to another queue or something like that, but for processing the last record in the pipeline, isn't MS DTC always required?
Thoughts?
Edit:- the full text:
Transactional messaging offers a lot of benefits, such as message
integrity and message order, over nontransactional messaging, but the
performance price you pay for using transactional messaging is huge.
Use internal transactions only if it is absolutely essential to
maintain the order of messages in the queue. External transactions
offer the benefit of propagating the transaction context across
multiple resource managers. Such transactions are useful when there is
a large-scale distributed system with multiple databases and message
queues, and the transactional integrity between the messages
exchanged between these resource managers is critical. The overhead
incurred while using external transactions is significantly more
than the one incurred by internal Message Queuing transactions. For
most applications, transactions are not needed at all. There is a
tendency to overuse transactions and affect the performance of the
entire system unnecessarily. Before deciding to use transactions,
analyze the ACID properties requirement of the entire system and the
resulting performance impact.

As the book says, the main problem with enlisting in these transactions is that you are sacrificing performance at high volumes by locking multiple resources. There are alternative ways for you to achieve your goals without losing consistency. There is always a trade-off between consistency, availability and partition tolerance (read about the CAP theorem) and you need to decide which attributes of the system are needed to successfully meet the business requirements.
To tackle your problem without a transaction, instead of popping the message off the queue, you can Peek the message, and if your processing succeeds you pop the message and discard it. If the processing fails you move the message to an error queue. Error queues can be automatically retried (it could be a transitory issue). The error queue(s) should be monitored actively to ensure your system is processing correctly, ie. you need alerts to fire if the error queue is over a threshold or increasing at a rate.
Note that this approach won't work for a single queue with multiple processors as you have commented. This will work where you partition your data and messages and bind a processor to a queue.
E.g. I'm doing central sales processing for a chain of retailers. I might say that I process us east coast retailers on queue QA and west coast on queue QB. I have a processor PA bound queue QA (can be muliple exe's or threads in a single exe) and processor PB bound to QB. This way messages are processed in order for a distinct entity in the system.
The key is to pick the right data partitioning scheme so that work is spread evenly.

Windows service to read from database and invoke applications based on dates

I am developing a data optimization project. It has a client side web app to receive tasks from users. Let's say the tasks are some heavy calculations that cannot be done easily by normal systems. The program takes a long while for big amount of data to be calculated. So what I am trying to do is receive calculation orders from my web application and have a windows service on my server side to listen for new tasks to be done.
I would like my service to be listening to the data that is being inserted into my Tasks table and run the calculator based on the time of those dates. I will of course would have to deal with some multi-threading. And maybe if the program is busy, the other processes would have to wait.
I also don't mind having a small GUI for my application to see which orders are now being processed and whether my service is busy or idle.
I first thought about adding SQL Server jobs in my data base to query the table frequently and run the application based on the dates. But that does not look like a nice solution to me. What I want is a nimble ready-to-serve service who becomes aware when we have new data in the database and decides what to do.
I don't insist on Windows Services particularly here. So any good idea is welcome.

I've done something similar in Powershell.
What I have:
A "queue" table where requests gather
Requests come in by clients calling a procedure
A "queue history" table - request already "treated" go from "queue" to "queue_history" (delete trigger)
A powershell loop reading queue table
Requests from queue are started independently via start-job command, those are actually separate Powershell instances (my tasks are mostly some external exe calls, but I think could be usefull for stored procedure calls also).
The powershell loop also report his "heartbeat" to a table.
Also watches a table and file existance for STOP flag so I can stop it.

Database Insert Performance When Grabbing Items From Queue

We're using RabbitMQ for storing lightweight messages that we eventually want to store in our SQL Server database. There will be times when the queue is empty and times when there is a spike of traffic - 30,000 messages.
We have a C# console app running in the same server.
Do we have the console app run every minute or so and grab a designated number of items off the queue for insertion into the database? (taking manageable bites)
OR
Do we have the console app always "listen" and hammer items into the database as they come in? (more aggressive approach)

Personally I'd go for the first approach. During those "spike" times, you're going to be hammering the database with potentially 30,000 inserts. Whilst this potentially could complete quite quickly (depending on many variables outside the scope of this question), we could do this a little smarter.
Firstly, by periodically polling, you can grab "x" messages from the queue and bulk insert them in a single go (performance-wise, you might want to tweak the the 2 variables here... polling time and how many you take from the queue).
One problem with this approach is that you might end up falling behind during busy periods. So you could make your application change it's polling time based on how many it is receiving, whilst keeping between some min/max thresholds. E.g. if you suddenly get a spike and grab 500 messages... you might decrease your poll time. If the next poll, you can still get thousand, do it again, decrease poll time. As the number you are able to get drops off, you can then begin increasing your polling time under a particular threshold.
This would give you the best of both world imho and be reactive to the spikes/lull periods.

It depends a bit on your requirement but I would create a service that calls SQLBulkCopy to that bulk inserts every couple of minutes. This is by far the fastests approach. Also if your Spike is 30k records I would not worry too much about falling behind.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
We have a C# console app running in the same server.
Why not a Service?

What I would do is have the console app always listen to the rabbitmq and then in the console app build your own queue for inserting into the database that way you can throttle the database insertion. By doing this you can control the flow in busy a time by only allowing so many tasks at once and then in slow times you get a faster reaction then polling every so often. The way I would do this is by raising an event and the you know there is something to do in the queue and you can check the queue length to see how many transactions you want to process.

Instead of using a Console Application, you could set up a Windows Service, and set up a timer on the service to poll every n minutes. Take a look at the links below:
http://www.codeproject.com/Questions/189250/how-to-use-a-timer-in-windows-service
http://msdn.microsoft.com/en-us/library/zt39148a.aspx
With a Windows Service, if the server is re-booted, the service can be set up to restart.

Scalability and availability

I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.

I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.