C# Design decision for time series access with a database

C# Design decision for time series access with a database - c#

I'm looking for a "best practise" way to handle incoming time series data.
One data point consists for example of time, height, width etc. for every "tick". Is it a good idea to save n data points in-memory with a collection class and later "flush" the points to a database after reaching the limits of the collection?
Or should the data points be directly written to the database in the first place, so that my object can run queries against it?
I know that this is little information about my requirements, so the question is how fast is the data access to a database compared to a hybrid in-memory and database solution.
Say there are at most 500 data points per second to handle and the data has to be calculated somehow on every point incoming. With a pure database solution, one has to run a store query on every incoming point. I guess this is not effective, but I don't know if such a database is able to "listen" and do this fast.
A nice feature for the database would be to send the points to subcribers. Is this possible with SQL server?
Thanks, Juergen

Putting the "sending to subscribers" requirement aside, don't get into the trap of premature optimization.
I would try the simplest solution first, which is probably just writing the data into the database as it arrives. Then run stress tests. If the performance isn't up to scratch, find the bottlenecks and optimize them out.
Turning to the "sending to subscribers" requirement, this isn't really something which relational database platforms are typically designed for (they are more about storing data and exposing it for on-demand retreival). A pub-sub type requirement is usually best solved using some kind of message bus. Perhaps take a look at something like NServiceBus.

If it is not multi-user then data points in-memory with a collection class is definitive a winner.
If it is multi-user then I would go for some sort of shared in memory data structure on server side
persists it time to time in db.

I would say the bigger question is how you plan on storing this in SQL. I would queue the datapoints in memory for a period of time (1 second?) and then write a single row to the database with a blob field, or nvarchar field containing all the data for that second as this will mean the database will scale further, the row could contain some summary information of what happened in this second which you could use when when performing queries on the data to reduce load when you are doing selects... Of-course this wouldn't be feasable if you want to perform direct queries on this data.
It all depends what you plan to do with the data...

Related

ADO and Microsoft SQL database backup and archival

I am working on re-engineering/upgrade of a tool. The database communication is in C++(unmanaged ADO) and connects to SQL server 2005.
I had a few queries regarding archiving and backup/restore techniques.
Generally archiving is different than backup/restore . can someone provide any link which explains me that .Presently the solution uses bcp tool for archival.I see lot of dependency on table names in the code. what are the things i have to consider in choosing the design(considering i have to take up the backup/archival on a button click, database size of 100mb at max)
Will moving the entire communication to .net will be of any help? considering lot of ORM tools. also all the bussiness logic and UI is in C#
What s the best method to verify the archival data ?
PS: the questionmight be too high level, but i did not get any proper link to understand this. It will be really helpful if someone can answer. I can provide more details!
Thanks in advance!

At 100 MB, I would say you should probably not spend too much time on archiving, and just use traditional backup strategies. The size of your database is so small that archiving would be quite an elaborate operation with very little gain, as the archiving process would typically only be relevant in the case of huge databases.
Generally speaking, a backup in database terms is a way to provide recoverability in case of a disaster (accidental data deletion, server crash, etc). Archiving mostly means you partition your data.
A possible goal with archiving is to keep specific data available for querying, but without the ability to alter it. When dealing with high volume databases, this is an excellent way to increase performance, as read-only data can be indexed much more densely than "hot" data. It also allows you to move the read-only data to an isolated RAID partition that is optimized for READ operations, and will not have to bother with the typical RDBMS IO. Also, by removing the non-active data from the regular database means the size of the data contained in your tables will decrease, which should boost performance of the overall system.
Archiving is typically done for legal reasons. The data in question might not be important for the business anymore, but the IRS or banking rules require it to be available for a certain amount of time.
Using SQL Server, you can archive your data using partitioning strategies. This normally involves figuring out the criteria based on which you will split the data. An example of this could be a date (i.e. data older than 3 years will be moved to the archive-part of the database). In case of huge systems, it might also make sense to split data based on geographical criteria (I.e. Americas on one server, Europe on another).
To answer your questions:
1) See the explanation written above
2) It really depends on what the goal of upgrading is. Moving it to .NET will get the code to be managed, but how important is that for the business?
3) If you do decide to partition, verifying it works could include issuing a query on the original database for data that contains both values before and after the threshold you will be using for partitioning, then splitting the data, and re-issuing the query afterwards to verify it still returns the same record-set. If you configure the system to use an automatic sliding window, you could also keep an eye on the system to ensure that data will automatically be moved to the archive partition.
Again, if the 100MB is not a typo, I would think your database is too small to really benefit from archiving. If your goal is to speed things up, put the system on a server that is able to load the whole database into RAM, or use SSD drives.
If you need to establish a data archive for legal or administrative reasons, give horizontal table partitioning a look. It's a pretty straight-forward process that is mostly handled by SQL Server automatically.
Hope this helps you out!

SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!

I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.

If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

Using LINQ vs SQL for Filtering Collection

I have a very general question regarding the use of LINQ vs SQL to filter a collection. Lets say you are running a fairly complex filter on a database table. It's running, say 10,000 times and the filters could be different every time. Performance wise, are you better off loading the entire database table collection into memory and executing the filters with LINQ, or should you let the database handle the filtering with SQL (since that's what is was built to do). Any thoughts?
EDIT: I should have been more clear. Lets assume we're talking about a table with 1000 records with 20 columns (containing int/string/date data). Currently in my app I am running one query every 1/2 hour to pull in all of the data into a collection (saving that collection in the application cache) and filtering that cached collection throughout my app. I'm wondering if that is worse than doing tons of round trips to the database server (it's Oracle fwiw).

After the update:
It's running, say 10,000 times and
I'm going to assume a table with 1000 records
It seems reasonable to assume the 1k records will fit easily in memory.
And then running 10k filters will be much cheaper in memory (LINQ).
Using SQL would mean loading 10M records, a lot of I/O.

EDIT
Its alwyas depends on the amount of data you have. If you have large amount data than go for sql and if less than for the linq. its also depends on the how frequently calling the data from sql server it its too frequently than its better to load in memory and than apply linq but if not than sql is better.
First Answer
Its better to go on sql side rather than loading in memory and than apply linq filter.
The one reason is better to go for sql rather an linq is
if go for linq
when you are getting 10,000 record it loads in memory as well as increase the nework traffic
if go for sql
no of record decreses so amount of memory utilise is less and aslo decrease networ traffic.

Depends on how big your table is and what type of data it stores.
Personally, I'd go with returning all the data if you plan to use all your filters during the same request.
If it's a filter on demand using ajax, you could reload the data from the database everytime (insuring by the same time your data is up to date)

This will probably cause some debate on the role of a database! I had this exact problem a little while back, some relatively complex filtering (things like "is in X country, where price is y and has the keyword z) and it was horrifically slow. Coupled with this, I was not allowed to change the database structure because it was a third party database.
I swapped out all of the logic, so that the database just returned the results (which i cached every hour) and did the filtering in memory - when I did this I saw massive performance increases.

I will say that is far better to let SQL do the complex filter and rest of processing, but why you may ask.
The main reason is because SQL Server have the index information's that you have set and use this index to access data very fast. If you load them on Linq then you do not have this index information for fast accessing the data, and you lose time to access them. Also you lose time to compile the linq every time.
You can make a simple test to see this different by your self. What test ? Create a simple table with hundred random string, and index this field with the string. Then make search on string field, one using linq and one direct asking the sql.
Update
My first thinking was that the SQL keep the index and make very quick access to the search data base on your SQL.
Then I think that linq can also translate this filter to sql and then get the data, then you make your action etc...
now I think that the actually reason is depend what actions you do. Is faster to run direct the SQL, but the reason of that is depend on how you actually set your linq.
If you try to load all in memory and then use linq then you lose for speed from SQL index, and lose memory, and lose a lot of action to move your data from sql to memory.
If you get data using linq, and then no other search need to be made, then you lose on the moving of all that data on memory, and lose memory.

t depends on the amount of data you are filtering on.
You say the filter runs 10K time and it can be different everytime, in this case if you don't have much data in database you can load that on to server variable.
If you have hundred thousands of records on database that you should not do this perhaps you can create indexes on database and per-compiled procedures to fetch data faster.
You can implement cache facade in between that helps you to store data in server side on first request and update it as per your requirement. (you can write the cache to fill variable only if data has limit of records).
You can calculate time to get data from database by running some test queries and observations. At the same time you can observer the response time from server if the data is stored in memory and calculate the difference and decide as per that.
There can be many other tricks but the base line is
You have to observer and decide.

How can I use a very large dictionary in C#?

I want to use a lookup map or dictionary in a C# application, but it is expected to store 1-2 GB of data.
Can someone please tell if I will still be able to use dictionary class, or if I need to use some other class?
EDIT : We have an existing application which uses oracle database to query or lookup object details. It is however too slow, since the same objects are getting repeatedly queried. I was feeling that it might be ideal to use a lookup map for this scenario, to improve the response time. However I am worried if size will make it a problem

Short Answer
Yes. If your machine has enough memory for the structure (and the overhead of the rest of the program and system including operating system).
Long Answer
Are you sure you want to? Without knowing more about your application, it's difficult to know what to suggest.
Where is the data coming from? A file? Files? A database? Services?
Is this a caching mechanism? If so, can you expire items out of the cache once they haven't been accessed for a while? This way, you don't have to hold everything in memory all the time.
As others have suggested, if you're just trying to store lots of data, can you just use a database? That way you don't have to have all of the information in memory at once. With indexing, most databases are excellent at performing fast retrieves. You could combine this approach with a cache.
Is the data that will be in memory read only, or will it have to be persisted back to some storage when something changes?
Scalability - do you expect that the amount of data that will be stored in this dictionary will increase as time goes on? If so, you're going to run into a point where it's very expensive to buy machines that can handle this amount of data. You might want to look a distributed caching system if this is the case (AppFrabric comes to mind) so you can scale out horizontally (more machines) instead of vertically (one really big expensive point of failure).
UPDATE
In light of the poster's edit, it sounds like caching would go a long way here. There are many ways to do this:
Simple dictionary caching - just cache stuff as its requested.
Memcache
Caching Application Block I'm not a huge fan of this implementation, but others have had success.

As long as you're on a 64GB machine, yes you should be able to use that large of a dictionary. However if you have THAT much data, a database may be more appropriate (cassandra is really nothing but a gigantic dictionary, and there's always MySQL).

When you say 1-2GB of data, I assume that you mean the items are complex objects that cumulatively contain 1-2GB.
Unless they're structs (and they shouldn't be), the dictionary doesn't care how big the items are.
As long as you have less than about 224 items (I pulled that number out of a hat), you can store as much as you can fit in memory.
However, as everyone else has suggested, you should probably use a database instead.
You may want to use an in-memory database such as SQL CE.

You can but For a Dictionary as large as that you are better off using a DataBase

Use a database.
Make sure you've a good DB model, put correct indexes, and off you go.

You can use subdictionaries.
Dictionary<KeyA, Dictionary<KeyB ....
Where KeyA is some common part of KeyB.
For example, if you have a String dictionary you can use the First letter as KeyA.

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.

You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.

This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.

3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.