Insert mass streaming data in mysql using C# - c#

I use Mysql and I need to insert mass data. The data is streamed to my server in the form of a list of 5k rows. I need to insert more than 3k requests, that means 3k request * 5k rows = 15 000 000 rows.
What I did was used create threads and insert using those threads, as the data come in packets of 5k in an async event. The data response is generated on my request.
What is the best possible way to do it, keeping this scenario in mind?
ThreadPooling for thread managment or simple multithreaded applition and will threads benifit in insertion as I need to insert in a single table (Innodb engine)

You can cache incoming requests on a server. Keep some buffered data in-memory until you get N requests (which you can fine-tune later). Once you get those you just flush data into MySql using some bulk insert routine. It is generally much faster to do one big insert than many small ones.
You can use ConcurrentBag class to keep data on the server. This is a thread-safe collection.
Additionally, you may need to expire cache based on time. This will cover the case where you get some requests n < N and then a client just stops sending data. You would want to flush it anyways and not wait forever until next upcoming requests fully fill the cache.

Related

Azure CosmosDB Bulk Insert with Batching

I'm using Azure CosmosDB with Azure Function. In function, I'm trying to save around 27 Million records.
The method AddBulk looks as below:
public async Task AddBulkAsync(List<TEntity> documents)
{
try
{
List<Task> concurrentTasks = new();
foreach (var item in documents)
{
concurrentTasks.Add(_container.CreateItemAsync<TEntity>(item, new PartitionKey(Convert.ToString(_PartitionKey?.GetValue(item)))));
}
await Task.WhenAll(concurrentTasks);
}
catch
{
throw;
}
}
I've around 27 Million records categorized by a property called batch. That means each batch holds around 82-85K records.
Currently, I'm creating a list of 27 Million records and sending that list to AddBulkAsync. It is causing a problem in terms of Memory and Throughput. Instead, I'm planning to send the data in batch:
Get distinct batchId from all the docs and create a batch list.
Loop batch list and get documents (i.e around 85K records) for the first batch (in a loop).
Call AddBulk method for documents (i.e around 85K records) of the first batch.
Continue the loop for the next batchId and so on.
Here, I wanted to understand as it calls AddBulk thousands of times and in turn, AddBulk calls await Task.WhenAll(concurrentTasks) those many times internally.
Will it cause a problem? Are there any better approaches available to achieve such a scenario? Please guide.
Some suggestions in scenarios like this.
When doing bulk operations like these, be sure you enable bulkMode = true in the Connection Options for the Cosmos client when instantiating. Bulk Mode works by queuing up divided into groups based upon a partition key range for a physical partition within a container. (Logical key values, when hashed, fall within a range of hashes that are then mapped to the physical partition where they reside). As the queue fills up the Cosmos client will dispatch all the queued-up items for that partition. If it does not fill up, it will dispatch what it has, full or not.
One way you can help with this is to try to batch updates such that the number of items you send are roughly evenly spread across logical partition key values within your container. This will allow the Cosmos client to more fully saturate the available throughput with batches that are as full as possible and sent in parallel. When data is not evenly distributed across logical partition key values, the volume of data ingested will be slower since dispatches will be sent as the queues fill up. It is less efficient and will not fully utilize every ounce of throughput provisioned.
One other thing to suggest is, if at all possible, stream your data rather than batch it. Cosmos DB measures provisioned throughput per second (RU/s). You get only the amount of throughput you've provisioned on a per second basis. Anything over that and you are either rate limited (i.e. 429s) or the speed of operations is slower than it otherwise could be. This is because throughput is evenly spread across all partitions. If you provision 60K RU/s you'll end up with ~6 physical partitions, each of them get 10K RU/s. There is no sharing of throughput. As a result, operations like this are less efficient than they could be.
One effective way to deal with deal with all of this is to amortize your throughput over a longer period of time. Streaming is the perfect way to accomplish this. Streaming data, in general, requires less overall provisioned throughput because you are utilizing throughput over a longer period of time. This allows you to work with a nominal amount of throughput, rather than having to scale up and down again and is just overall more efficient. If streaming from the source is not an option, you can accomplish this with queue-based load leveling by first sending data to a queue, then streaming from there into Cosmos DB.
Either way you go, these docs can help squeeze the most performance out of Azure Functions hosting bulk ingestion workloads into Cosmos DB.
Blog post on Bulk Mode
Bulk support improvements blog post
Advanced Configurations for Azure Function Cosmos DB Triggers
Manage Azure Function Connections This is important because Direct TCP Mode in Cosmos client can create LOTS of connections.

Stream data from database to browser as JSON via ASP.NET Core

I have an API that needs to return large lists of JSON data. I'm trying to stream it directly from the database to the client, to avoid hogging RAM on my web server. Would this be a good way to do it? (Seems to be working)
[HttpGet]
[Route("data")]
public IEnumerable<MappedDataDto> GetTestData()
{
var connection = new NpgsqlConnection(_connectionString);
connection.Open();
IEnumerable<RawDataDto> rawItems = connection.Query<RawDataDto>("SELECT * FROM sometable", buffered: true);
foreach (var rawItem in rawItems)
{
var mappedItem = Map(rawItem);
yield return mappedItem;
}
}
Do I need to disponse the connection or will that automatically be taken care of? I can wrap it in i using block, since that throws an exception like "can't access a disposed object"
Would it be better to use some kind of stream instead of yield return?
EDIT:
In my case we have a large legacy javascript web app that shows graphs and charts of data between two dates. The client downloads the entire dataset and does in memory calculations on the client to represent that data in different kind of ways (the users can create their own custom dashboards, so not every one uses the data in the same way).
Our problem is that when clients request a period with a lot of data, the memory consumption of our ASP.NET Core API increases a lot. So this is what we are trying to prevent.
Unfortunately, making any larger changes to the client would be very time consuming, so we are looking at what we can do on the API side instead.
So this is why I'm trying to figure out if there is a way to stream the data from the database through the API. So that there will need to be no changes on the client, and the memory consumtion of the API will not be as bad since it won't have to hold everything in memory.
Given the functionality (displaying data in charts/graphs) I'd suggest some changes to both the client and the server application.
Let's assume this case scenario:
A client requests data for a period of 30 days, which corresponds to 1 million rows
This means there will be a big memory consumption not only in the server but also on the client application! So I'd suggest rewriting the query to group the data by day, hour or even a shorter time period, whatever suits your needs - this would reduce the amount of data being sent from the server:
grouping by day: 30 records
grouping by hour: 720 records
grouping by 10 minutes range: 4320 records
grouping by minute: 43200 records
Client application would obviously need some changes in order to do the calculations based on the grouped data, not each individual row.
BTW I don't know which RBDMS you're using, but this might be helpful (SQL Server):
How to group time by hour or by 10 minutes

Multi Connections to DB

I have a project where at pick times the site will get 1000 calls per secs this calls need to be saved in the DB ( MS-SQL DB ) .
What is the best practice to manage this large scale connections.
I am using .net C# .
Currently building this as a site that get all the calls in post way.
Thanks For your answers.
From my experience the fastest way to insert a large amount of data is to use a GUID as PK. If you use an autoincrement the sql will write the rows one by one to get the next ID.
But if there is a GUID, the server will write all your data at the "same" time.
Depending on the ressources of your production machine there are several approaches:
1fst: Store all the requests in one big list and your application writes all the requests synchronously into your database. Pro: pretty easy to code, Con: could be a real bottleneck
2nd: Store all the requests in one big list and your application starts a thread for every 1000 requests that you received, these threads write the data into your database asynchronously. Pro: Should be faster, Con: hard to code and hard to maintain.
3rd: Store the request inside the memory of the server and write the data into the database when the application is not busy.

Write to db efficiently from a multithread application

I have a server application that receives data from clients that must be stored in a database.
Client/server communication is made with ServiceStack, and for every client call there can be 1 or more records to be written.
The clients doesn't need to wait the data to be written or to know if the data has been written.
At my customer site the database sometimes may be unavailable for short times so I want to retry the writing until the database is available again.
I can't use a servicebus, or other software..it must be only my server and the database.
I considered two possibilities:
1) fire a thread for every call to write a record (or group of records with a multiple insert) that in case of failure retries until it has success
2) enqueque the data to be written in a global in-memory list, and have a single background thread to continuosly make a single call to the db (with a multiple insert)
What do you consider the most efficient way do do it? or do you have another proposal?
Option 1 is easier, but I'm worried to have too many threads running at the same time, expecially if the db gets unavailable.
In case I'll follow the second route, my idea is:
1) every server thread opened by a client locks the global list to insert 1 or more records to write to the db, release the lock and closes
2) the background thread locks the global list that has for example 50 records, makes a deep copy to a temp list, unlocks the global list
3) the server thread continues to add data to the global list, in the meantime the background thread tries to write the 50 records, retrying until it has success
4) when the background thread manages to write, it locks again the global list (that maybe now has 80 records), remove the first 50 that has been written, and everything starts again
Is there a better way to do this?
--------- EDIT ----------
My issue is that I don't want in any way the client to have to wait, not even for the adding of the record-to-be-sent to a blocked list (that happens when the writing thread writes or tries to write the list to the DB).
That's why in my solution I lock the list only for the time to copy the list to a temporary list that will be written to db.
I'm just wondering if this is crazy and there is a much simpler solution that I'm not following.
My understanding of the problem is as follows:
1. Client sends a data to be inserted to DB
2. Server receives the data and inserts to DB
3. Client doesn't want to know if data is inserted properly or not
In this case, I would suggest, Let server create a single Queue which holds the data to be inserted to DB, let receive thread just receive the data from client and insert into inmemory Queue, this queue can be emptied by another thread which takes care of writing to DB to persist.
You may even use file based queue or priority queue or just in-memory queue for storing the records temporarily.
If you use the .Net Thread Pool you don't need to worry about creating too many threads as thread lifetime is managed for you.
Task.Factory.StartNew(DbWriteMethodHere)
If you want to be smarter you could add the records you want to commit to a BlockingCollection - and then have a thread do BlockingCollection<T>.Take(50) which will block until there is a big enough batch to commit.

Is there a fast and scalable solution to save data?

I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
Receive the data via http handler (c#);
Save the received data; <- HERE
Request the saved data to be processed;
Process the requested data;
Save the processed data. <- HERE
My pre-solution is:
Receive the data via http handler (c#);
Save the received data to Message Queue;
Request from MSQ the saved data to be processed using a windows services;
Process the requested data;
Save the processed data to Microsoft SQL Server (here's the bottleneck);
6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.
I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.
Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.

Categories