I have a problem that has three components and two bottlenecks:
Downloading data from an api (I/O bound)
Processing the data (CPU bound)
Saving results to a database (CPU bound)
Going through the process of querying the api, processing the data, and saving the results takes me about 8 hours. There are a total of about 830 jobs to process. I would like to speed up my .Net console application in any way by using parallelism. I've read many posts about the Producer Consumer problem, but I don't know how to apply that knowledge in this situation.
Here is what I'm imagining: There are two queues. The first queue stores api responses. I only want a certain number of workers at a time querying the api and putting things in the queue. If the queue is full, they will have to wait before putting their data on the queue.
At the same time, other workers (as many as possible) are pulling responses off the queue and processing them. Then, they are putting their processed results onto the second queue.
Finally, a small number of workers are pulling processed results from the second queue, and uploading them to the database. For context, my production database is SQL Server, but my Dev database is Sqlite3 (which only allows 1 write connection at a time).
How can I implement this in .Net? How can I combine I/O based concurrency with CPU-based concurrency, while having explicit control over the number of workers at each step? And finally, how do implement these queues and wire everything up? Any help/guidance is much appreciated!
Related
I'm using Azure CosmosDB with Azure Function. In function, I'm trying to save around 27 Million records.
The method AddBulk looks as below:
public async Task AddBulkAsync(List<TEntity> documents)
{
try
{
List<Task> concurrentTasks = new();
foreach (var item in documents)
{
concurrentTasks.Add(_container.CreateItemAsync<TEntity>(item, new PartitionKey(Convert.ToString(_PartitionKey?.GetValue(item)))));
}
await Task.WhenAll(concurrentTasks);
}
catch
{
throw;
}
}
I've around 27 Million records categorized by a property called batch. That means each batch holds around 82-85K records.
Currently, I'm creating a list of 27 Million records and sending that list to AddBulkAsync. It is causing a problem in terms of Memory and Throughput. Instead, I'm planning to send the data in batch:
Get distinct batchId from all the docs and create a batch list.
Loop batch list and get documents (i.e around 85K records) for the first batch (in a loop).
Call AddBulk method for documents (i.e around 85K records) of the first batch.
Continue the loop for the next batchId and so on.
Here, I wanted to understand as it calls AddBulk thousands of times and in turn, AddBulk calls await Task.WhenAll(concurrentTasks) those many times internally.
Will it cause a problem? Are there any better approaches available to achieve such a scenario? Please guide.
Some suggestions in scenarios like this.
When doing bulk operations like these, be sure you enable bulkMode = true in the Connection Options for the Cosmos client when instantiating. Bulk Mode works by queuing up divided into groups based upon a partition key range for a physical partition within a container. (Logical key values, when hashed, fall within a range of hashes that are then mapped to the physical partition where they reside). As the queue fills up the Cosmos client will dispatch all the queued-up items for that partition. If it does not fill up, it will dispatch what it has, full or not.
One way you can help with this is to try to batch updates such that the number of items you send are roughly evenly spread across logical partition key values within your container. This will allow the Cosmos client to more fully saturate the available throughput with batches that are as full as possible and sent in parallel. When data is not evenly distributed across logical partition key values, the volume of data ingested will be slower since dispatches will be sent as the queues fill up. It is less efficient and will not fully utilize every ounce of throughput provisioned.
One other thing to suggest is, if at all possible, stream your data rather than batch it. Cosmos DB measures provisioned throughput per second (RU/s). You get only the amount of throughput you've provisioned on a per second basis. Anything over that and you are either rate limited (i.e. 429s) or the speed of operations is slower than it otherwise could be. This is because throughput is evenly spread across all partitions. If you provision 60K RU/s you'll end up with ~6 physical partitions, each of them get 10K RU/s. There is no sharing of throughput. As a result, operations like this are less efficient than they could be.
One effective way to deal with deal with all of this is to amortize your throughput over a longer period of time. Streaming is the perfect way to accomplish this. Streaming data, in general, requires less overall provisioned throughput because you are utilizing throughput over a longer period of time. This allows you to work with a nominal amount of throughput, rather than having to scale up and down again and is just overall more efficient. If streaming from the source is not an option, you can accomplish this with queue-based load leveling by first sending data to a queue, then streaming from there into Cosmos DB.
Either way you go, these docs can help squeeze the most performance out of Azure Functions hosting bulk ingestion workloads into Cosmos DB.
Blog post on Bulk Mode
Bulk support improvements blog post
Advanced Configurations for Azure Function Cosmos DB Triggers
Manage Azure Function Connections This is important because Direct TCP Mode in Cosmos client can create LOTS of connections.
I have an Azure WebJob that loops through the pages of a file and processes them. The job also has an ICollector to an output queue:
[Queue("batch-pages-to-process")] ICollector<QueueMessageBatchPage> outputQueueMessage
I need to wait until all of the pages are processed before I send everything to the output queue, so instead of adding each message to the ICollector in my file processing loop, I add the messages to a list of queue messages:
List<QueueMessageBatchPage>
After all of the pages have been dealt with, I then loop through the list and add the messages to the ICollector:
foreach (var m in outputMessages)
{
outputQueueMessage.Add(m);
}
But this last part seems to take a long time. To add 300 queue messages, it takes almost 50 seconds. I don't have much to gauge by, but that seems slow. Is this normal?
There's no objective standard of slow vs. fast to offer you, but a few thoughts:
a) Part of the queuing time will be serialization of each QueueMessageBatchPage instance... the performance of that will be inversely related to the breadth and depth of the object graphs those instances represent. More data obviously takes more time to write to the queue.
b) I know you mentioned that you can't write to the queue until all file lines have been processed, but if at all possible you might reconsider that choice. To the extent you could parallelize both the processing of lines in the file and subsequent writing to the output queue (using either multiple WebJob instances or perhaps TPL Tasks within a single WebJob instance), you could potentially get this work done a lot faster. Again, I realize you stated upfront that you can't do that, so I'm just suggesting you consider the full implications of that choice (if you haven't already).
c) One other possibility to look at... make sure the region where your storage queue lives is the same as where your WebJob lives, to minimize latency.
Best of luck!
I have researched a lot and I haven't found anything that meets my needs. I'm hoping someone from SO can throw some insight into this.
I have an application where the expected load is thousands of jobs per customer and I can have 100s of customers. Currently it is 50 customers and close to 1000 jobs per each. These jobs are time sensitive (scheduled by customer) and can run up to 15 minutes (each job).
In order to scale and match the schedules, I'm planning to run this as multi threaded on a single server. So far so good. But the business wants to scale more (as needed) by adding more servers into the mix. Currently the way I have it is when it becomes ready in the database, a console application picks up first 500 and uses Task Parallel library to spawn 10 threads and waits until they are complete. I can't scale this to another server because that one could pick up the same records. I can't update a status on the db record as being processed because if the application crashes on one server, the job will be in limbo.
I could do a message queue and have multiple machines pick from it. The problem with this is the queue has to be transactional to support handling for any crashes. MSMQ supports only MS DTC transaction since it involves database and I'm not really comfortable with DTC transactions, especially with multi threads and multiple machines. Too much maintenance and set up and possibly unknown issues.
Is SQL service broker a good approach instead? Has anyone done something like this in a production environment? I also want to keep the transactions short (A job could run for 15,20 minutes - mostly streaming data from a service). The only reason I'm doing a transaction is to keep the message integrity of queue. I need the job to be re-picked if it crashes (re-appear in the queue)
Any words of wisdom?
Why not having an application receive the jobs and insert them in a table that will contain the queue of jobs. Each work process can then pick up a set of jobs and set the status as processing, then complete the work and set the status as done. Other info such as server name that processed each job, start and end time-stamp could also be logged. Moreover, instead of using multiple threads, you could use independent work processes so as to make your programming easier.
[EDIT]
SQL Server supports record level locking and lock escalation can also be prevented. See Is it possible to force row level locking in SQL Server?. Using such mechanism, you can have your work processes take exclusive locks on jobs to be processed, until they are done or crash (thereby releasing the lock).
I'm doing a project with some timing constraints right now. Setup is: A web service accepts (tiny) xml files and I have to process these, fast.
First and most naive idea was to handle this processing in the request dispatcher itself, but that didn't scale and was doomed from the start.
So now I'm looking at a varying load of incoming requests that each produce ~ 50 jobs on my side. Technologies available for use are limited due to the customers' rules. If it's not Sql Server or MS MQ it probably won't fly.
I thought about going down the MS MQ route (Web service just submitting messages, multiple consumer processes lateron) and small proof of concept modules worked like a charm.
There's one problem though: The priority of these jobs might change a lot, in the queue. The system is fairly time critical, so if we - for whatever reasons - cannot process incoming jobs in a timely fashion, we need to prefer the latest ones.
Basically the usecase changes from reliable messaging in general to LIFO under (too) heavy load. Old entries still have to be processed, but just lost all of their priority.
Is there any manageable way to build something like this in MS MQ?
Expanding the business side, as requested:
The processing of the incoming job is bound to some tracks, where physical goods are moved around. If I cannot process the messages in time, the things are "gone".
I still want the results for statistical purpose, but really need to focus on the newer messages now.
Think of me being able to influence mechanical things and reroute things moving on a track - if they didn't move past point X yet..
So, if i understand this, you want to be able to switch between sorting the queue by priority OR by arrival time, depending on the situation. MSMQ can only sort the queue by priority AND by arrival time.
Although I understand what you are trying to do, I don't quite see the business justification for it. Can you expand on this?
I would propose using a service to move messages from the incoming queue to a number of work queues for processing. Under normal load, there would be a several queues, each with a monitoring thread.
Under heavy load, new traffic would all go to just one "panic" queue under the load dropped. The threads on the other work queues could be paused if necessary.
CheersJohn Breakwell
I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
Receive the data via http handler (c#);
Save the received data; <- HERE
Request the saved data to be processed;
Process the requested data;
Save the processed data. <- HERE
My pre-solution is:
Receive the data via http handler (c#);
Save the received data to Message Queue;
Request from MSQ the saved data to be processed using a windows services;
Process the requested data;
Save the processed data to Microsoft SQL Server (here's the bottleneck);
6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.
I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.
Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.