Is there a fast and scalable solution to save data?

Is there a fast and scalable solution to save data? - c#

I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
Receive the data via http handler (c#);
Save the received data; <- HERE
Request the saved data to be processed;
Process the requested data;
Save the processed data. <- HERE
My pre-solution is:
Receive the data via http handler (c#);
Save the received data to Message Queue;
Request from MSQ the saved data to be processed using a windows services;
Process the requested data;
Save the processed data to Microsoft SQL Server (here's the bottleneck);

6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.

I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.

Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.

Related

Multi Connections to DB

I have a project where at pick times the site will get 1000 calls per secs this calls need to be saved in the DB ( MS-SQL DB ) .
What is the best practice to manage this large scale connections.
I am using .net C# .
Currently building this as a site that get all the calls in post way.
Thanks For your answers.

From my experience the fastest way to insert a large amount of data is to use a GUID as PK. If you use an autoincrement the sql will write the rows one by one to get the next ID.
But if there is a GUID, the server will write all your data at the "same" time.
Depending on the ressources of your production machine there are several approaches:
1fst: Store all the requests in one big list and your application writes all the requests synchronously into your database. Pro: pretty easy to code, Con: could be a real bottleneck
2nd: Store all the requests in one big list and your application starts a thread for every 1000 requests that you received, these threads write the data into your database asynchronously. Pro: Should be faster, Con: hard to code and hard to maintain.
3rd: Store the request inside the memory of the server and write the data into the database when the application is not busy.

Write to db efficiently from a multithread application

I have a server application that receives data from clients that must be stored in a database.
Client/server communication is made with ServiceStack, and for every client call there can be 1 or more records to be written.
The clients doesn't need to wait the data to be written or to know if the data has been written.
At my customer site the database sometimes may be unavailable for short times so I want to retry the writing until the database is available again.
I can't use a servicebus, or other software..it must be only my server and the database.
I considered two possibilities:
1) fire a thread for every call to write a record (or group of records with a multiple insert) that in case of failure retries until it has success
2) enqueque the data to be written in a global in-memory list, and have a single background thread to continuosly make a single call to the db (with a multiple insert)
What do you consider the most efficient way do do it? or do you have another proposal?
Option 1 is easier, but I'm worried to have too many threads running at the same time, expecially if the db gets unavailable.
In case I'll follow the second route, my idea is:
1) every server thread opened by a client locks the global list to insert 1 or more records to write to the db, release the lock and closes
2) the background thread locks the global list that has for example 50 records, makes a deep copy to a temp list, unlocks the global list
3) the server thread continues to add data to the global list, in the meantime the background thread tries to write the 50 records, retrying until it has success
4) when the background thread manages to write, it locks again the global list (that maybe now has 80 records), remove the first 50 that has been written, and everything starts again
Is there a better way to do this?
--------- EDIT ----------
My issue is that I don't want in any way the client to have to wait, not even for the adding of the record-to-be-sent to a blocked list (that happens when the writing thread writes or tries to write the list to the DB).
That's why in my solution I lock the list only for the time to copy the list to a temporary list that will be written to db.
I'm just wondering if this is crazy and there is a much simpler solution that I'm not following.

My understanding of the problem is as follows:
1. Client sends a data to be inserted to DB
2. Server receives the data and inserts to DB
3. Client doesn't want to know if data is inserted properly or not
In this case, I would suggest, Let server create a single Queue which holds the data to be inserted to DB, let receive thread just receive the data from client and insert into inmemory Queue, this queue can be emptied by another thread which takes care of writing to DB to persist.
You may even use file based queue or priority queue or just in-memory queue for storing the records temporarily.

If you use the .Net Thread Pool you don't need to worry about creating too many threads as thread lifetime is managed for you.
Task.Factory.StartNew(DbWriteMethodHere)
If you want to be smarter you could add the records you want to commit to a BlockingCollection - and then have a thread do BlockingCollection<T>.Take(50) which will block until there is a big enough batch to commit.

I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.

I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.

Processing files off a queue, archiving and storing in a DB

We will be implementing a solution to pull files off of a queue (IBM - MQ). The messages will be 10-20 different xml messages that will need to be dequeued, processed, and archived (store). However, when we store the data contained in the messages in a DB we want to retain the source file so the FileId that gets generated from the archive process will have to be retained and stored with the meta data.
I am trying to figure out what will provide me with the most throughput?
Requirements:
Keep an archive of the file.
Store the parsed data (not the xml blob) from the messages.
Retain the Source File ID from the Archive.
Implement a solution that will scale could grow significantly....currently probably 40-50,000 messages an hour.
So basically my currently bottleneck is that it seems that my archive process and data processing / db load are serial (archive has to process and be successful before I can start on xml parsing / loading).....didn't know if there is a better way to accomplish this.
I would assume we could add other app servers that would be listening on the same queue and could process the messages in parallel if need be. Try to eliminate the DB as the bottleneck by having it perform as little processing as possible (Could send xml blob to DB, but it would have to perform the xml shredding).

Work with DBAs to tune writes to the database.
Test multiple reads and see if you get better throughput.
My experience is that you'll want to have multiple readers, but the number will depend on lots of factors. Test it and see what is bets.

Approach for caching data from data logger

Greetings,
I've been working on a C#.NET app that interacts with a data logger. The user can query and obtain logs for a specified time period, and view plots of the data. Typically a new data log is created every minute and stores a measurement for a few parameters. To get meaningful information out of the logger, a reasonable number of logs need to be acquired - data for at least a few days. The hardware interface is a UART to USB module on the device, which restricts transfers to a maximum of about 30 logs/second. This becomes quite slow when reading in the data acquired over a number of days/weeks.
What I would like to do is improve the perceived performance for the user. I realize that with the hardware speed limitation the user will have to wait for the full download cycle at least the first time they acquire a larger set of data. My goal is to cache all data seen by the app, so that it can be obtained faster if ever requested again. The approach I have been considering is to use a light database, like SqlServerCe, that can store the data logs as they are received. I am then hoping to first search the cache prior to querying a device for logs. The cache would be updated with any logs obtained by the request that were not already cached.
Finally my question - would you consider this to be a good approach? Are there any better alternatives you can think of? I've tried to search SO and Google for reinforcement of the idea, but I mostly run into discussions of web request/content caching.
Thanks for any feedback!

Seems like a very reasonable approach. Personally I'd go with SQL CE for storage, make sure you index the column holding the datetime of the record, then use TableDirect on the index for getting and inserting data so it's blazing fast. Since your data is already chronological there's no need to get any slow SQL query processor involved, just seek to the date (or the end) and roll forward with a SqlCeResultSet. You'll end up being speed limited only by I/O. I profiled doing really, really similar stuff on a project and found TableDirect with SQLCE was just as fast as a flat binary file.

I think you're on the right track wanting to store it locally in some queryable form.
I'd strongly recommend SQLite. There's a .NET class here.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.