Currently I am creating a C# application which has to read a lot of data (over 2,000,000 records) from an existing database and compare it with a lot of other data (also about 2,000,000 records) which do not exist in the database. These comparisons will mostly be String comparisons. The amount of data will grow much bigger and therefore I need to know which solution will result in the best performance.
I have already searched the internet and I came up with two solutions;
Solution 1
The application will execute a single query (SELECT column_name FROM table_name, for example) and store all the data in a DataTable. The application will then compare all the stored data with the input, and if there is a comparison it will be written to the database.
Pros:
The query will only be executed once. After that, I can use the stored data multiple times for all incoming records.
Cons:
As the database grows bigger, so will my RAM usage. Currently I have to work with 1GB (I know, tough life) and I'm afraid it won't fit if I'd practically download the whole content of the database in it.
Processing all the data will take lots and lots of time.
Solution 2
The application will execute a specific query for every record, for example
SELECT column_name FROM table_name WHERE value_name = value
and will then check is the DataTable will have records, something like
if(datatable.Rows.Count>0) { \\etc }
If it has records, I can conclude there are matching records and I can write to the database.
Pros:
Probably a lot less usage of RAM since I will only get specific data.
Processing goes a lot faster.
Cons:
I will have to execute a lot of queries. If you are interested in numbers, it will probably around 5 queries per record. Having 2,000,000 records, that would be 10,000,000 queries.
My question is, what would be the smartest option, given that I have limited RAM?
Any other suggestions are welcome aswell, ofcourse.
If you have SQL Server available to you, this seems a job directly suited to SQL Server Integration Services. You might consider using that tool instead of building your own. It depends on your exact business needs, but in general data merging like this would be a batch/unattended or tool based operation ?
You might be able to code it to run faster than SSIS, but I'd give it a try just to see if its acceptible to you, and save yourself the cost of the custom development.
Related
I need to compare particular content of 2 SQL tables located in different servers: Table1 and Table2.
I want to compare each row from Table1 against the whole content of Table2.
Comparison logic is kind of complicated so I want to apply a logical operator that I will wrinte in C#. So I don't want to do the comparison on the SQL query itself.
My concern is the size of the data I will work on will be around 200 MB.
I was thinking to load the data into a DataTable by using ADO.Net and do the comparison on the memory.
What would you recommend? Is there already a pattern like approach to compare massive data?
200 MB should not be a problem. A .NET application can handle much more than that at once.
But even so, I would probably use a forward-only data reader for Table 1, just because there's no good reason not to, and that should reduce the amount of memory required. You can keep table 2 in memory with whatever structure you are accustomed to.
You can use two SqlDataReaders. They only have one row in memory at a time, are forward only, and extremely efficient. After getting the row back from the reader you can then compare the values. Here is an example.
See MSDN.
The most scalable solution is to create SQLCLR functions to execute the comparisons you want.
You should probably avoid a row-by-row comparison at all costs. The network latency and delays due to round-tripping will result in extremely slow execution.
A quick&dirty solution is to extract the data to local files then do the comparison as you will pay the network tax only once. Unfortunately, you lose the speedup provided by database indexes and query optimizations.
A similar solution is to load all the data once in memory and then use indexing structures like dictionaries to provide additional speedup. This is probably doable as your data can fit in memory. You still pay the network tax only once but gain from faster execution.
The most scalable solution is to create SQLCLR code to create one or more functions that will perform the comparisons you want. This way you avoid the network tax altogether, avoid creating and optimizing your own structures in memory and can take advantage of indexes and optimizations.
These solutions may not be applicable, depending on the actual logic of the comparisons you are doing. Both solutions rely on sorting the data correctly
1) Binary search. - You can find the matching row in table 2 without scanning through all of table 2 by using a binary search, this will significantly reduce the number of comparisons
2) If you are looking for overlaps/matches/missing rows between the two tables, you can sort both tables in the same order. Then you can loop through the two tables simultaniously, keeping a pointer to the current row of each table. If table 1 is "ahead" of table 2, then you only increment the table 2 pointer until they are either equal, or table 2 is ahead. Then once table 2 is ahead, you start incrementing table 1 until it is ahead. etc. In this way you only have to loop through each record from each table one time, and you are guaranteed that there were no matches that you missed.
If table 1 and 2 match, then that is a match. while table 1 is ahead, then every row in table 2 is "missing" from table 1, and visa versa.
This solution would also work if you only need to take some action if the rows are in a certain range of each other or something.
3) If you have to actually do some action for every row in table 2 for every row in table 1, then its just two nested loops, and there is not much that you can do to optimize that other than make the comparison/work as efficient as possible. You could possibly multi-thread it though depending on what the work was and where your bottle-neck is.
Can you stage the data to the same database using a quick ETL/SSIS job? This would allow you to do set operations which might be easier deal with. If not, I would agree with the recommendations for forward-only data reader with one table in memory
A couple years ago I wrote a db table comparison tool, which is now an open-source project called Data Comparisons.
You can check out the source code if you want. There is a massive optimization you can make when the two tables you're comparing are on the same physical server, because you can write a SQL query to take care of this. I called this the "Quick compare" method in Data Comparisons, and it's available whenever you're sharing the same connection string for both sides of the comparison.
When they're on two different servers, however, you have no choice but to pull the data into memory and compare the rows there. Using SqlDataReaders would work. However, it's complicated when you must know exactly what's different (what rows are missing from table A or table B, what rows are different, etc). For that reason my method was to use DataTables, which are slower but at least they provide you with the necessary functionality.
Building this tool was a learning process for me. There are probably opportunities for optimization with the in-memory comparison. For example, loading the data into a Dictionary and doing your comparisons off of primary keys with Linq would probably be faster. You could even try Parallel Linq and see if that helps. And as Jeffrey L Whitledge mentioned, you might as well use a SqlDataReader for one of the tables while the other is stored in memory.
We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.
Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!
I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.
This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.
If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.
I have a very general question regarding the use of LINQ vs SQL to filter a collection. Lets say you are running a fairly complex filter on a database table. It's running, say 10,000 times and the filters could be different every time. Performance wise, are you better off loading the entire database table collection into memory and executing the filters with LINQ, or should you let the database handle the filtering with SQL (since that's what is was built to do). Any thoughts?
EDIT: I should have been more clear. Lets assume we're talking about a table with 1000 records with 20 columns (containing int/string/date data). Currently in my app I am running one query every 1/2 hour to pull in all of the data into a collection (saving that collection in the application cache) and filtering that cached collection throughout my app. I'm wondering if that is worse than doing tons of round trips to the database server (it's Oracle fwiw).
After the update:
It's running, say 10,000 times and
I'm going to assume a table with 1000 records
It seems reasonable to assume the 1k records will fit easily in memory.
And then running 10k filters will be much cheaper in memory (LINQ).
Using SQL would mean loading 10M records, a lot of I/O.
EDIT
Its alwyas depends on the amount of data you have. If you have large amount data than go for sql and if less than for the linq. its also depends on the how frequently calling the data from sql server it its too frequently than its better to load in memory and than apply linq but if not than sql is better.
First Answer
Its better to go on sql side rather than loading in memory and than apply linq filter.
The one reason is better to go for sql rather an linq is
if go for linq
when you are getting 10,000 record it loads in memory as well as increase the nework traffic
if go for sql
no of record decreses so amount of memory utilise is less and aslo decrease networ traffic.
Depends on how big your table is and what type of data it stores.
Personally, I'd go with returning all the data if you plan to use all your filters during the same request.
If it's a filter on demand using ajax, you could reload the data from the database everytime (insuring by the same time your data is up to date)
This will probably cause some debate on the role of a database! I had this exact problem a little while back, some relatively complex filtering (things like "is in X country, where price is y and has the keyword z) and it was horrifically slow. Coupled with this, I was not allowed to change the database structure because it was a third party database.
I swapped out all of the logic, so that the database just returned the results (which i cached every hour) and did the filtering in memory - when I did this I saw massive performance increases.
I will say that is far better to let SQL do the complex filter and rest of processing, but why you may ask.
The main reason is because SQL Server have the index information's that you have set and use this index to access data very fast. If you load them on Linq then you do not have this index information for fast accessing the data, and you lose time to access them. Also you lose time to compile the linq every time.
You can make a simple test to see this different by your self. What test ? Create a simple table with hundred random string, and index this field with the string. Then make search on string field, one using linq and one direct asking the sql.
Update
My first thinking was that the SQL keep the index and make very quick access to the search data base on your SQL.
Then I think that linq can also translate this filter to sql and then get the data, then you make your action etc...
now I think that the actually reason is depend what actions you do. Is faster to run direct the SQL, but the reason of that is depend on how you actually set your linq.
If you try to load all in memory and then use linq then you lose for speed from SQL index, and lose memory, and lose a lot of action to move your data from sql to memory.
If you get data using linq, and then no other search need to be made, then you lose on the moving of all that data on memory, and lose memory.
t depends on the amount of data you are filtering on.
You say the filter runs 10K time and it can be different everytime, in this case if you don't have much data in database you can load that on to server variable.
If you have hundred thousands of records on database that you should not do this perhaps you can create indexes on database and per-compiled procedures to fetch data faster.
You can implement cache facade in between that helps you to store data in server side on first request and update it as per your requirement. (you can write the cache to fill variable only if data has limit of records).
You can calculate time to get data from database by running some test queries and observations. At the same time you can observer the response time from server if the data is stored in memory and calculate the difference and decide as per that.
There can be many other tricks but the base line is
You have to observer and decide.
I have a problem which I cannot seem to get around no matter how hard I try.
This company works in market analysis, and have pretty large tables (300K - 1M rows) and MANY columns (think 250-300) which we do some calculations on.
I´ll try to get straight to the problem:
The problem is the filtering of the data. All databases I´ve tried so far are way too slow to select data and return it.
At the moment I am storing the entire table in memory and filtering using dynamic LINQ.
However, while this is quite fast (about 100 ms to filter 250 000 rows) I need better results than this...
Is there any way I can change something in my code (not the data model) which could speed the filtering up?
I have tried using:
DataTable.Select which is slow. Dynamic LINQ which is better, but
still too slow. Normal LINQ (just for testing purposes) which almost
is good enough. Fetching from MySQL and do the processing later on
which is badass slow.
At the beginning of this project we thought that some high-performance database would be able to handle this, but I tried:
H2 (IKVM)
HSQLDB (compiled ODBC-driver)
CubeSQL
MySQL
SQL
SQLite
...
And they are all very slow to interface .NET and get results from.
I have also tried splitting the data into chunks and combining them later in runtime to make the total amount of data which needs filtering smaller.
Is there any way in this universe I can make this faster?
Thanks in advance!
UPDATE
I just want to add that I have not created this database in question.
To add some figures, if I do a simple select of 2 field in the database query window (SQLyog) like this (visit_munic_name is indexed):
SELECT key1, key2 FROM table1 WHERE filter1 = filterValue1
It takes 125 milliseconds on 225639 rows.
Why is it so slow? I have tested 2 different boxes.
Of course they must change someting, obviously?
You do not explain what exactly you want to do, or why filtering a lot of rows is important. Why should it matter how fast you can filter 1M rows to get an aggregate if your database can precalculate that aggregate for you? In any case it seems you are using the wrong tools for the job.
On one hand, 1M rows is a small number of rows for most databases. As long as you have the proper indexes, querying shouldn't be a big problem. I suspect that either you do not have indexes on your query columns or you want to perform ad-hoc queries on non-indexed columns.
Furthermore, it doesn't matter which database you use if your data schema is wrong for the job. Analytical applications typically use star schemas to allow much faster queries for a lot more data than you describe.
All databases used for analysis purposes use special data structures which require that you transform your data to a form they like.
For typical relational databases you have to create star schemas that are combined with cubes to precalculate aggregates.
Column databases store data in a columnar format usually combined with compression to achieve fast analytical queries, but they require that you learn to query them in their own language, which may be very different than the SQL language most people are accustomed to.
On the other hand, the way you query (LINQ or DataTable.Select or whatever) has minimal effect on performance. Picking the proper data structure is much more important.
For instance, using a Dictionary<> is much faster than using any of the techniques you mentioned. A dictionary essentially checks for single values in memory. Executing DataTable.Select without indexes, using LINQ to Datasets or to Objects is essentially the same as scanning all entries of an array or a List<> for a specific value,because that is what all these methods do - scan an entire list sequentially.
The various LINQ providers do not do the job of a database. They do not optimize your queries. They just execute what you tell them to execute. Even doing a binary search on a sorted list is faster than using the generic LINQ providers.
There are various things you can try, depending on what you need to do:
If you are looking for a quick way to slice and dice your data, use an existing product like PowerPivot functionality of Excel 2010. PowerPivot loads and compresses MANY millions of rows in an in-memory columnar format and allows you to query your data just as you would with a Pivot table, and even define joins with other in memory sources.
If you want a more repeatable process you can either create the appropriate star schemas in a relational database or use a columnar database. In either case you will have to write the scripts to load your data in the proper structures.
If you are creating your own application you really need to investigate the various algorithms and structures used by other similar tools either for in memory.
I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.
You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.
This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.
3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.