Best way to process a large database?

Best way to process a large database? - c#

Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?

Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable

If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection

From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.

I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.

Related

Loading large amounts of data into a List<MyObject> in .net

I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW

If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.

Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.

SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!

I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.

If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.

You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.

This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.

3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

Best way to save/load large amout of data in a .Net application?

What is the best way to save large amount of data for a .Net 4.0 application?
Right now I am using Lists and serializing to a file in "User Data" folder, and its working ok, but I want to know if there is a better/faster way of saving/loading large amount of data.
The data that I am saving contains only lots of words, like documents.
The size of the data is almost 1 mb.

That really depends on the type of your application. I wouldn't use SQL database of any sort for to just load and save operation of data that I do not need to query or transform. The time it will take to map your object graph to a relational model just not worth it.
Also I don't believe it will ever be faster than simple serialization due to the overhead associated with databases (connection management and mapping)
My recent experience was with BinnaryFormatter which had excellent results (files ~ 15mb). Worse come to worse you can always write your own formatter.

Kinda depends on your data and how you have it stored in your app.
But all these NoSQL storage systems are a possibility or just plain binary data into a file.

When you say "large amout [sic] of data", what exactly do you mean by that? A megabyte? a terabyte?
And what exactly is the data?
If it's a set of account records, it might well belong in a database of some sort; if it's a set of images or word processing documents, perhaps not.

If you want fast access, one approach would be to serialize to a hashtable, and cache it. In between reads and writes...
Problem here is ofcourse, versioning, changing of namespaces(then you wont be able to deserialize....easyly), deadlocks, concurrency etc....
Better if you save the file as a XML/JSON, and when you do read it in to memory save it into a hashtable...for fast access...

Efficient way to analyze large amounts of data?

I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.
I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).
The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).
The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.

It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:
var query = from foo in List<Foo>
where foo.Prop = criteriaVar
select foo;
The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.

This is not a large amount of data. I don't see any reason to involve a database in your analysis.
There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.

It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.

I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.

If you do not mind using access, here is what you can do
Attach a blank Access db as a resource
When needed, write the db out to file.
Run a CREATE TABLE statement that handles the columns of your data
Import the data into the new table
Use sql to run your calculations
OnClose, delete that access db.
You can use a program like Resourcer to load the db into a resx file
ResourceManager res = new ResourceManager( "MyProject.blank_db", this.GetType().Assembly );
byte[] b = (byte[])res.GetObject( "access.blank" );
Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename
"MyProject.blank_db" is the location and name of the resource file
"access.blank" is the tab given to the resource to save

If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.

From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.
These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them:
They allow your skill to accumulate - your knowledge to a partially tool can be helpful in different future tasks.
They allow your efforts to accumulate - the command line (or scripts) you used to finish the task can be repeated as many times as needed with different data, without human interaction.
They usually outperform the same tool you can write. If you don't believe, try to beat sort with your version for terabyte files.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.