SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

SQL Database VS. Multiple Flat Files (Thousands of small CSV's) - c#

We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!

I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.

If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

Related

Most effective way of storing and managing moderate number of users

In a current project of mine I need to manage and store a moderate number (from 10-100 to 5000+) of users (ID, username, and some other data).
This means I have to be able to find users quickly at runtime, and I have to be able to save and restore the database to continue statistics after a restart of the program. I will also need to register every connect/disconnect/login/logout of a user for the statistics. (And some other data as well, but you get the idea).
In the past, I saved settings and other stuff in encoded textfiles, or serialized the needed objects and wrote them down. But these methods require me to rewrite the whole database on each change, and that's increasingly slowing it down (especially with a growing number of users/entries), isn't it?
Now the question is: What is the best way to do this kind of thing in C#?
Unfortunately, I don't have any experience in SQL or other query languages (except for a bit of LINQ), but that's not posing any problem for me, as I have the time and motivation to learn one (or more if required) for this task.

Most effective is highly subjective based on who you ask even if narrowing down this question to specific needs. If you are storing non-relational data Mongo or some other NoSQL type of database such as Raven DB would be effective. If your data has a relational shape then an RDBMS such as MySQL, SQL Server, or Oracle would be effective. Relational databases are ideal if you are going to have heavy reporting requirements as this allows non-developers more ease of access in writing simple SQL queries against it. But also keeping in mind performance with disk cache persistence that databases provide. Commonly accessed data is stored in memory to save the round trips to the disk (with hybrid drives I suppose accessing some files directly accomplishes the same thing however SSD's are still not as fast as RAM access). So you really need to ask yourself some questions to identify the best solution for you; What is the shape of your data (flat, relational, etc), do you have reporting requirements where less technical team members need to be able to query the data repository, and what are your performance metrics?

ADO and Microsoft SQL database backup and archival

I am working on re-engineering/upgrade of a tool. The database communication is in C++(unmanaged ADO) and connects to SQL server 2005.
I had a few queries regarding archiving and backup/restore techniques.
Generally archiving is different than backup/restore . can someone provide any link which explains me that .Presently the solution uses bcp tool for archival.I see lot of dependency on table names in the code. what are the things i have to consider in choosing the design(considering i have to take up the backup/archival on a button click, database size of 100mb at max)
Will moving the entire communication to .net will be of any help? considering lot of ORM tools. also all the bussiness logic and UI is in C#
What s the best method to verify the archival data ?
PS: the questionmight be too high level, but i did not get any proper link to understand this. It will be really helpful if someone can answer. I can provide more details!
Thanks in advance!

At 100 MB, I would say you should probably not spend too much time on archiving, and just use traditional backup strategies. The size of your database is so small that archiving would be quite an elaborate operation with very little gain, as the archiving process would typically only be relevant in the case of huge databases.
Generally speaking, a backup in database terms is a way to provide recoverability in case of a disaster (accidental data deletion, server crash, etc). Archiving mostly means you partition your data.
A possible goal with archiving is to keep specific data available for querying, but without the ability to alter it. When dealing with high volume databases, this is an excellent way to increase performance, as read-only data can be indexed much more densely than "hot" data. It also allows you to move the read-only data to an isolated RAID partition that is optimized for READ operations, and will not have to bother with the typical RDBMS IO. Also, by removing the non-active data from the regular database means the size of the data contained in your tables will decrease, which should boost performance of the overall system.
Archiving is typically done for legal reasons. The data in question might not be important for the business anymore, but the IRS or banking rules require it to be available for a certain amount of time.
Using SQL Server, you can archive your data using partitioning strategies. This normally involves figuring out the criteria based on which you will split the data. An example of this could be a date (i.e. data older than 3 years will be moved to the archive-part of the database). In case of huge systems, it might also make sense to split data based on geographical criteria (I.e. Americas on one server, Europe on another).
To answer your questions:
1) See the explanation written above
2) It really depends on what the goal of upgrading is. Moving it to .NET will get the code to be managed, but how important is that for the business?
3) If you do decide to partition, verifying it works could include issuing a query on the original database for data that contains both values before and after the threshold you will be using for partitioning, then splitting the data, and re-issuing the query afterwards to verify it still returns the same record-set. If you configure the system to use an automatic sliding window, you could also keep an eye on the system to ensure that data will automatically be moved to the archive partition.
Again, if the 100MB is not a typo, I would think your database is too small to really benefit from archiving. If your goal is to speed things up, put the system on a server that is able to load the whole database into RAM, or use SSD drives.
If you need to establish a data archive for legal or administrative reasons, give horizontal table partitioning a look. It's a pretty straight-forward process that is mostly handled by SQL Server automatically.
Hope this helps you out!

Which is faster in C# XML or SQL?

Which is faster in C#:to read tiny XML files or to read tiny SQL tables with small amount of data ?
I wonder if is really necessary to create a table in SQL then establish a connection just to read 10 or 11 parameters.
What would you reccomend?

It's really depends on what You need. Nothing stops You from even combining the two worlds as XML can easily be stored in SQL Server.
If You want to actually have the authentication of the SQL Server, have it backed up, versioned or whatever, You can easily design a mixed XML, SQL, table solution. If You really need some propertyBag persistence area files are ok, but they still require care ie. access control, taking care when it is not present etc (reading a file still does throw a lot of exceptions and IT does it with some good reason).
Ask Yourself questions like: do I need restricted access, how will I report changes (if any),
do I need version history,
do I read all the parameters or only part of
it?
what Do i need to do if someone
changes an entry?
what should I do when there is no entry?
does it need to be extensible (new parameters added/removed)?
should it be encrypted?
does the database layer needs to know about it?
Just some thought from the top of my head.
Luke

If you just have a handful of 'settings' that you want to read, I would definitely go with a small XML file. I can't say definitively that it would be faster, but given that you would eliminate the over head of establishing the connection, authenticating, etc it would definitely be simpler.
And if you can use LINQ to XML, its really easy to do.

Speed is not the only consideration. You don't have as much admin overhead with XML files as you would with SQL Server.
If the file is local, it will certainly be faster to read using direct file than networked SQL access. Far less between you and the data. No impact on your process from other SQL usage.

reading a lot of files is slow so if you have tons of xml files i would vote for SQL especially if we consider the fact that you have to parse the xml files as well which is way more complicated and more time consuming then making a connection to a DB especially if the DB is on the localhost :)

SQL based method: pros
Easy to migrate, configure
SQL based method: cons
Connection can be down, connection takes time to establish, DB admin will wonder why there is a tiny table that has no meaning, codebase become unnecessarily complex
File based method: pros
Fast, no overhead on DB
File based method:cons
Migration is an issue. Configuration is an issue. Can easily get corrupted.

How to store my data (C#.net)

I'm having a bit of a problem deciding how to store some data. To see it from a simple perspective, it will be a simple table of data but there will be many tables. There will be about 7 columns in each table, but again there will be a lot of tables (and they will be created at runtime, whenever the customer wants a clean grid)
The data has to be stored locally in a file (and there will not be multiple instances of the software running).
I'm using C# 4.0 and I have been looking at using XML files(one file per table, or storing multiple tables in a file), sqlite, sql server CE, access etc. I will be happy if someone here has some comments or suggestions on how to do/not to do. Stability and reliability(e.g. no trashed databases because of unstable third party software) is probably my biggest concern.

If you are looking to store the data locally in a file, I would recommend the sqlite option since it seems your data is created in the form of a database table already. Sqlite is already built to handle multiple tables and columns so it means less mental overhead for you, the developer.
http://web.archive.org/web/20100208133236/http://www.mikeduncan.com/sqlite-on-dotnet-in-3-mins/ is a decent tutorial to give a quick overview on how to set it up and get going.
As for what NOT to do: don't try to make your own scheme to save the data to a file, it's a well understood problem that has been solved many times over, why re-invent the wheel?

XML wont be a good choice if you are planning to make several queries, since loading text files may be painful when they grow (talking about files over 1mb). If you plan to mantain the data low, the xml would be good to keep it simple. I still won't use it, but if you have a background, then the benefits will be heavier than the learning curve.
If you have no expertise in any of them, and the data is light my suggestion is SQLite, I beleive is the best lightweight DB for .Net and the prvider is very good. you can find it easily on Google.
I would tell you that Access is not recommendable, but this is a personal oppinion. Many people use it and I think is for some reason. So you should check it out and try it.
Again, my final recommendation is SQLite, unless you know very well another one, in which case you'll have to think how much your data is going to grow. If you plan to have a DB around 100mb, any of them, except xml would do; If you think it'll grow bigger than that, consider SQLite heavily

Best way to process a large database?

Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?

Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable

If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection

From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.

I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.