Azure CosmosDB Bulk Insert with Batching

Azure CosmosDB Bulk Insert with Batching - c#

I'm using Azure CosmosDB with Azure Function. In function, I'm trying to save around 27 Million records.
The method AddBulk looks as below:
public async Task AddBulkAsync(List<TEntity> documents)
{
try
{
List<Task> concurrentTasks = new();
foreach (var item in documents)
{
concurrentTasks.Add(_container.CreateItemAsync<TEntity>(item, new PartitionKey(Convert.ToString(_PartitionKey?.GetValue(item)))));
}
await Task.WhenAll(concurrentTasks);
}
catch
{
throw;
}
}
I've around 27 Million records categorized by a property called batch. That means each batch holds around 82-85K records.
Currently, I'm creating a list of 27 Million records and sending that list to AddBulkAsync. It is causing a problem in terms of Memory and Throughput. Instead, I'm planning to send the data in batch:
Get distinct batchId from all the docs and create a batch list.
Loop batch list and get documents (i.e around 85K records) for the first batch (in a loop).
Call AddBulk method for documents (i.e around 85K records) of the first batch.
Continue the loop for the next batchId and so on.
Here, I wanted to understand as it calls AddBulk thousands of times and in turn, AddBulk calls await Task.WhenAll(concurrentTasks) those many times internally.
Will it cause a problem? Are there any better approaches available to achieve such a scenario? Please guide.

Some suggestions in scenarios like this.
When doing bulk operations like these, be sure you enable bulkMode = true in the Connection Options for the Cosmos client when instantiating. Bulk Mode works by queuing up divided into groups based upon a partition key range for a physical partition within a container. (Logical key values, when hashed, fall within a range of hashes that are then mapped to the physical partition where they reside). As the queue fills up the Cosmos client will dispatch all the queued-up items for that partition. If it does not fill up, it will dispatch what it has, full or not.
One way you can help with this is to try to batch updates such that the number of items you send are roughly evenly spread across logical partition key values within your container. This will allow the Cosmos client to more fully saturate the available throughput with batches that are as full as possible and sent in parallel. When data is not evenly distributed across logical partition key values, the volume of data ingested will be slower since dispatches will be sent as the queues fill up. It is less efficient and will not fully utilize every ounce of throughput provisioned.
One other thing to suggest is, if at all possible, stream your data rather than batch it. Cosmos DB measures provisioned throughput per second (RU/s). You get only the amount of throughput you've provisioned on a per second basis. Anything over that and you are either rate limited (i.e. 429s) or the speed of operations is slower than it otherwise could be. This is because throughput is evenly spread across all partitions. If you provision 60K RU/s you'll end up with ~6 physical partitions, each of them get 10K RU/s. There is no sharing of throughput. As a result, operations like this are less efficient than they could be.
One effective way to deal with deal with all of this is to amortize your throughput over a longer period of time. Streaming is the perfect way to accomplish this. Streaming data, in general, requires less overall provisioned throughput because you are utilizing throughput over a longer period of time. This allows you to work with a nominal amount of throughput, rather than having to scale up and down again and is just overall more efficient. If streaming from the source is not an option, you can accomplish this with queue-based load leveling by first sending data to a queue, then streaming from there into Cosmos DB.
Either way you go, these docs can help squeeze the most performance out of Azure Functions hosting bulk ingestion workloads into Cosmos DB.
Blog post on Bulk Mode
Bulk support improvements blog post
Advanced Configurations for Azure Function Cosmos DB Triggers
Manage Azure Function Connections This is important because Direct TCP Mode in Cosmos client can create LOTS of connections.

Related

Asynchrously download data from api then multiprocess concurrently

I have a problem that has three components and two bottlenecks:
Downloading data from an api (I/O bound)
Processing the data (CPU bound)
Saving results to a database (CPU bound)
Going through the process of querying the api, processing the data, and saving the results takes me about 8 hours. There are a total of about 830 jobs to process. I would like to speed up my .Net console application in any way by using parallelism. I've read many posts about the Producer Consumer problem, but I don't know how to apply that knowledge in this situation.
Here is what I'm imagining: There are two queues. The first queue stores api responses. I only want a certain number of workers at a time querying the api and putting things in the queue. If the queue is full, they will have to wait before putting their data on the queue.
At the same time, other workers (as many as possible) are pulling responses off the queue and processing them. Then, they are putting their processed results onto the second queue.
Finally, a small number of workers are pulling processed results from the second queue, and uploading them to the database. For context, my production database is SQL Server, but my Dev database is Sqlite3 (which only allows 1 write connection at a time).
How can I implement this in .Net? How can I combine I/O based concurrency with CPU-based concurrency, while having explicit control over the number of workers at each step? And finally, how do implement these queues and wire everything up? Any help/guidance is much appreciated!

Retrieve 1+ million records from Azure Table Storage

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.
It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.
Query all records then filter results after
Do a table scan
Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.
I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.
Looking at Microsoft docs, here are some key Table Storage limits (https://learn.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):
Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.
My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?
Any thoughts or advice appreciated.

I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).
Here's my blog post:
https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables
I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.
This looks to be essentially what user3603467 is suggesting in his answer.

I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.
First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.
Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.
I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.

You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).

BigQuery quota limits from query table append apply or not?

In my C# app I use a BigQueryClient.CreateQueryJobAsync to regularly append to a partitioned table table1. Currently this is happening only up to a 50-100 per day.
await BqClient.CreateQueryJobAsync(
...,
new QueryOptions
{
DestinationTable = "table1",
WriteDisposition = WriteDisposition.WriteAppend
})
I understand that Big Query has many limits and these are usually well documented. But for this particular scenario I am not sure if the limits apply or not. The quotas page says there are "1,000 updates per table per day" but the documentation also explicitly lists which operations are affected by quota. Assuming there is an explicit list there must also be a list of "everything else" where the quota does not apply. For instance, "classic UI" is under the quota which should imply that the "new UI" is not. Similarly, the page states that jobs.query API is affected by quota but since I am using the official C# driver, it leaves me wondering as to whether this applies to my scenario or not.
Apparently, I could write a script to try to do the append operation 1001 times in 24 hours and see whether I hit the quota but I wish I could simply read this from documentation and understand without any ambiguity.
Does anyone know from first-hand experience how this actually works?

BigQuery C# library internally uses jobs.query API, so the limit/quota applies in your case.
One more thing to be aware of is, since you're writing to a partitioned table, below quota also applies:
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day
for a partitioned table. A partition can be modified by using an
operation that appends to or overwrites data in the partition.
Operations that modify partitions include: a load job, a query that
writes results to a partition, or a DML statement (INSERT, DELETE,
UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example,
a DML statement can update data in multiple partitions (for both
ingestion-time and partitioned tables). Query jobs and load jobs can
also write to multiple partitions but only for partitioned tables.
BigQuery uses the number of partitions affected by a job when
determining how much of the quota the job consumes. Streaming inserts
do not affect this quota.

Azure Drive vs Block Blob vs Table

I couldn't decide the best approach to handle the following scenario via Azure storage.
~1500+ CSV files between ~1MB to ~500MB overall ~20GB data
Each file uses exactly same model and each model.toString() is ~50 characters ~400byte
Every business day, during 6 hours period, ~8000+ new rows comes per minute
Based on property value, each row goes to the correct file
Multiple instance writing is not necessary as long as multiple reading is supported even there is few seconds delay for snapshot period is OK.
I would like to use Block Blob but downloading ~400MB single file into the computer, just to add a single line and upload it back doesn't make sense and I couldn't find other way around.
There is a Drive option which uses Page Blob unfortunately it is not supported by SDKv2 and makes me nervous about possible discontinuation of the support
And final one is Table which looks OK other than reading few hundred thousands rows continuesly may become an issue
Basically, I prefer to write files when I retrieve the data immediately. But, if it does worth to give up, I can live with the single update at the end of the day which means ~300-1000 lines per file
What would be best approach to handle this scenario?

Based on your above requirement, Azure Tables are the optimal option. With single Azure Storage account you get the following:
Storage Transactions – Up to 20,000 entities/messages/blobs per second
Single Table Partition – a table partition are all of the entities in a table with the same partition key value, and most tables have many partitions. The throughput target for a single partition is:
Up to 20,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to a few thousand requests per second (up to the storage account target 20,000).
Tables – use a more finely grained PartitionKey for the table in order to allow us to automatically spread the table partitions across more servers.
About reading "few hundred thousands rows" continuously, your main obstacle is storage level 20,000 transactions/sec however if you design your partition so granular to segment them on hundreds of servers, you could be able to read "hundred of thousands" in minutes.
Source:
Windows Azure Storage Abstractions and their Scalability Targets
Windows Azure’s Flat Network Storage and 2012 Scalability Targets

Insert mass streaming data in mysql using C#

I use Mysql and I need to insert mass data. The data is streamed to my server in the form of a list of 5k rows. I need to insert more than 3k requests, that means 3k request * 5k rows = 15 000 000 rows.
What I did was used create threads and insert using those threads, as the data come in packets of 5k in an async event. The data response is generated on my request.
What is the best possible way to do it, keeping this scenario in mind?
ThreadPooling for thread managment or simple multithreaded applition and will threads benifit in insertion as I need to insert in a single table (Innodb engine)

You can cache incoming requests on a server. Keep some buffered data in-memory until you get N requests (which you can fine-tune later). Once you get those you just flush data into MySql using some bulk insert routine. It is generally much faster to do one big insert than many small ones.
You can use ConcurrentBag class to keep data on the server. This is a thread-safe collection.
Additionally, you may need to expire cache based on time. This will cover the case where you get some requests n < N and then a client just stops sending data. You would want to flush it anyways and not wait forever until next upcoming requests fully fill the cache.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.