Large Database management using C#

Large Database management using C# - c#

We are using MySQL to get data from database, match the data and send back the matched data to user. The MySQL Db contain 10 table , 9 tables are having less data which needed to be matched with 10th table which has 25 Million records and still adding. I need to create C# application to match the data and send to user. After every 1 min, new data is updated in rest of 9 table and old is deleted after being compared. I have got 10 table data in C# memory, but it sometime get out of memory. I'm thinking of diving C# application into 5-6 parts to handle data and than to do rest of logic. But i need some some good suggestion to start my work.
Thanks
APS

I think you are approaching your problem incorrectly. From your post, it sounds like you are trying to load massive quantities of highly volatile data into memory. By doing that, you are entirely defeating the point of having a database server like MySql. Don't preload all of the data into memory...let your users query the data they need from the database via your C# application. That is exactly what database servers are for, and they are going to do a considerably better job at providing optimized, performant access to data than you can do yourself.

You should probably think about your algorithms and decide if there is any way to split the problem into smaller chunks, for example to work on small partitions of the data at a time.
32 bit .net processes have a memory limit of 2GB. Perhaps you are hitting this limit, hence the out of memory errors? If so, two things you could do are:
Have multiple processes running, each dealing with a subset of the data
Move to a 64bit OS and recompile your code into a 64bit executable

Please do not say you have a lot of data. 24 million rows is not exactly a lot by todays standards.
Where does C# enter here? This looks 100% like something (from your explanation) that should be done totally on the server side with SQL.

I dont use MySQL but I would suggest using a stored procedure to sort through the data first. Depends on how complex or cpu-expensive your computation is and how big the dataset is that you're going to send over your network. But normally I'd try to let the server handle it. That way you don't end up sending all your data over the network. Plus you avoid trouble when your data model changes. You don't have to recompile and distribute your C# app. You change 1 stored procedure and you're ready

Related

system architecture for real-time data

The company I work for is running a C# project that crawling data from around 100 websites, saving it to the DB and running some procedures and calculations on that data.
Each one of those 100 websites is having around 10,000 events, and each event is saved to the DB.
After that, the data that was saved is being generated and aggregated to 1 big xml file, so each one of those 10,000 events that were saved, is now presented as a XML file in the DB.
This design looks like that:
1) crawling 100 websites to collects the data and save it the DB.
2) collect the data that was saved to the DB and generate XML files for each event
3) XML files are saved to the DB
The main issue for this post, is the selection of the saved XML files.
Each XML is about 1MB, and considering the fact that there are around 10,000 events, I am not sure SQL Server 2008 R2 is the right option.
I tried to use Redis, and the save is working very well (and fast!), but the query to get those XMLs works very slow (even locally, so network traffic wont be an issue).
I was wondering what are your thoughts? please take into consideration that it is a real-time system, so caching is not an option here.
Any idea will be welcomed.
Thanks.

Instead of using DB you could try a cloud-base system (Azure blobs or Amazon S3), it seems to be a perfect solution. See this post: azure blob storage effectiveness, same situation, except you have XML files instead of images. You can use a DB for storing the metadata, i.e. source and event type of the XML, the path in the cloud, but not the data itself.
You may also zip the files. I don't know the exact method, but it can surely be handled on client-side. Static data is often sent in zipped format to the client by default.

Your question is missing some details such as how long does your data need to remain in the database and such…
I’d avoid storing XML in database if you already have the raw data. Why not have an application that will query the database and generate XML reports on demand? This will save you a lot of space.
10GBs of data per day is something SQL Server 2008 R2 can handle with the right hardware and good structure optimization. You’ll need to investigate if standard edition will be enough or you’ll have to use enterprise or data center licenses.
In any case answer is yes – SQL Server is capable of handling this amount of data but I’d check other solutions as well to see if it’s possible to reduce the costs in any way.

Your basic arch doesn't seem to be at fault, its the way you've perceived the redis, basically if you design your key=>value right there is no way that the retrieval from redis could be slow.
for ex- lets say I have to store 1 mil objects in redis, and say there is an id against which I am storing my objects, this key is nothing but a guid, the save will be really quick, but when it comes to retrieval, do I know the "key" if i KNOW the key it'll be fast, but if I don't know it or I am trying to retrieve my data not on the basis of key but on the basis of some Value in my objects, then off course it'll be slow.
The point is - when it comes to retrieval you should just work against the "Key" and nothing else, so design your key like a pre-calculated value in itself; so when I need to get some data from redis/memcahce, I could make the KEY, and just do a single hit to get the data.
If you could put more details, we'll be able to help you better.

SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!

I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.

If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

C# Design decision for time series access with a database

I'm looking for a "best practise" way to handle incoming time series data.
One data point consists for example of time, height, width etc. for every "tick". Is it a good idea to save n data points in-memory with a collection class and later "flush" the points to a database after reaching the limits of the collection?
Or should the data points be directly written to the database in the first place, so that my object can run queries against it?
I know that this is little information about my requirements, so the question is how fast is the data access to a database compared to a hybrid in-memory and database solution.
Say there are at most 500 data points per second to handle and the data has to be calculated somehow on every point incoming. With a pure database solution, one has to run a store query on every incoming point. I guess this is not effective, but I don't know if such a database is able to "listen" and do this fast.
A nice feature for the database would be to send the points to subcribers. Is this possible with SQL server?
Thanks, Juergen

Putting the "sending to subscribers" requirement aside, don't get into the trap of premature optimization.
I would try the simplest solution first, which is probably just writing the data into the database as it arrives. Then run stress tests. If the performance isn't up to scratch, find the bottlenecks and optimize them out.
Turning to the "sending to subscribers" requirement, this isn't really something which relational database platforms are typically designed for (they are more about storing data and exposing it for on-demand retreival). A pub-sub type requirement is usually best solved using some kind of message bus. Perhaps take a look at something like NServiceBus.

If it is not multi-user then data points in-memory with a collection class is definitive a winner.
If it is multi-user then I would go for some sort of shared in memory data structure on server side
persists it time to time in db.

I would say the bigger question is how you plan on storing this in SQL. I would queue the datapoints in memory for a period of time (1 second?) and then write a single row to the database with a blob field, or nvarchar field containing all the data for that second as this will mean the database will scale further, the row could contain some summary information of what happened in this second which you could use when when performing queries on the data to reduce load when you are doing selects... Of-course this wouldn't be feasable if you want to perform direct queries on this data.
It all depends what you plan to do with the data...

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.

You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.

This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.

3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

Manipulate SQL data in memory?

I have tables and a lot of sotred procedure that work with sql database.
For demonstration purposes I want to load data into memory (maybe dataset that i can then store in session- the demnonstration is limited so server memory cap won't be a problem ?) from my sql tables and manipulate it with my stored procedures.
Is it possible? Or i need to rewrite all my stored procedures or even replace them with code that works with data set?

Stored Procedures are stored in the database. There's no way to make one of them, all the way in the DB server, operate on some in-memory structure back on the web server--a completely different place!
What you're asking to do is a bit like saying you ordered a pizza, and now that it's been delivered to your house you want the pizza parlor to switch the crust for you on the fly, without the pizza leaving your house. Without some seriously advanced technology that can selectively distort the space-time continuum and perhaps even effect time travel, this will never happen.
If you must change the pizza once it is at your house, you either have to order a new one, or modify it yourself with your own tools.

If you wish for SQL Server to manage your table in memory, there is no guaranteed way to.
You can create a table variable and fill it with the dataset, and it'll probably be stored in memory.
Check this answer for more information:
Does MS-SQL support in-memory tables?
Aditionally, with some server configuration you may be able to put your tempdb into a ramdisk, which would effectively give you leeway to operate not just with table variables and hope, but you can store your dataset in a temporary table and be sure it'll be in RAM. Check this Microsoft's article:
http://support.microsoft.com/?scid=kb;en-us;917047&x=17&y=9
EDIT: I would expect that if the dataset fits in server memory (and their configured per process limit) it would be stored in server's memory. But that's just an educated guess as I'm not familiar with ASP.NET's architecture

You can in theory load your entire database in memory and even store a copy of the database for each session. Whether that is a sound thing to do, is a different discussion. All successful ASP applications I hear of are stateless.
Its absolutely impossible to have SQL Server manipulate your process memory where the ADO.Net data sets reside, through Transact-SQL procedures or any other technology. You are going to have to rewrite all your procedures as CLR methods in your ASP.Net application to operate on the ADO.Net data sets.

Have you considered an in-memory DB like SQLite? Check out my answer to this SO thread. There are other alternatives too.

As a mater of fact, you can do it. It's silly but possible.
You load all tables from SQL server into app.
Create XML document containing data.
Send XML to SQL stored procedure (you have to create new ones) and process it
Talking about pizzas, this is like ordering all possible ingredients to be sent from Domino to your house, you open all boxes, reorder them, and then send evrithing back to Domino to make you pizza you want.
Good luck :D

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.