C#: Very fast object search & retrieval using any persistence model - c#

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.

You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.

This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.

3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

Related

Performance: Lots of queries or lots of processing?

Currently I am creating a C# application which has to read a lot of data (over 2,000,000 records) from an existing database and compare it with a lot of other data (also about 2,000,000 records) which do not exist in the database. These comparisons will mostly be String comparisons. The amount of data will grow much bigger and therefore I need to know which solution will result in the best performance.
I have already searched the internet and I came up with two solutions;
Solution 1
The application will execute a single query (SELECT column_name FROM table_name, for example) and store all the data in a DataTable. The application will then compare all the stored data with the input, and if there is a comparison it will be written to the database.
Pros:
The query will only be executed once. After that, I can use the stored data multiple times for all incoming records.
Cons:
As the database grows bigger, so will my RAM usage. Currently I have to work with 1GB (I know, tough life) and I'm afraid it won't fit if I'd practically download the whole content of the database in it.
Processing all the data will take lots and lots of time.
Solution 2
The application will execute a specific query for every record, for example
SELECT column_name FROM table_name WHERE value_name = value
and will then check is the DataTable will have records, something like
if(datatable.Rows.Count>0) { \\etc }
If it has records, I can conclude there are matching records and I can write to the database.
Pros:
Probably a lot less usage of RAM since I will only get specific data.
Processing goes a lot faster.
Cons:
I will have to execute a lot of queries. If you are interested in numbers, it will probably around 5 queries per record. Having 2,000,000 records, that would be 10,000,000 queries.
My question is, what would be the smartest option, given that I have limited RAM?
Any other suggestions are welcome aswell, ofcourse.
If you have SQL Server available to you, this seems a job directly suited to SQL Server Integration Services. You might consider using that tool instead of building your own. It depends on your exact business needs, but in general data merging like this would be a batch/unattended or tool based operation ?
You might be able to code it to run faster than SSIS, but I'd give it a try just to see if its acceptible to you, and save yourself the cost of the custom development.

SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

We are designing an update to a current system (C++\CLI and C#).
The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.
Data is only inserted (create / append to a file, create folder) never updated / removed.
Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.
There is an option to start saving this data to an MS-SQL database.
Process time (reading the CSV's to external program) could be up to a few minutes.
How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)
I'd appreciate your answers, Pros and Cons are welcome.
Thank you for your time.
Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.
Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.
Possible Pros:
Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships
Possible Cons:
You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express
Good luck!
I'd like to try hitting those questions a bit out of order.
Roughly, when does reading the raw data from a database becomes
quicker than reading the CSV's? (10 files, 100 files? ...)
Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.
Does one of the methods take significantly more storage than the
other?
Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.
How should we choose which method to use?
Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.
It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.
This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.
If you have the option to use a ms-sql database, I would do that.
Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.
You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.
I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

Using LINQ vs SQL for Filtering Collection

I have a very general question regarding the use of LINQ vs SQL to filter a collection. Lets say you are running a fairly complex filter on a database table. It's running, say 10,000 times and the filters could be different every time. Performance wise, are you better off loading the entire database table collection into memory and executing the filters with LINQ, or should you let the database handle the filtering with SQL (since that's what is was built to do). Any thoughts?
EDIT: I should have been more clear. Lets assume we're talking about a table with 1000 records with 20 columns (containing int/string/date data). Currently in my app I am running one query every 1/2 hour to pull in all of the data into a collection (saving that collection in the application cache) and filtering that cached collection throughout my app. I'm wondering if that is worse than doing tons of round trips to the database server (it's Oracle fwiw).
After the update:
It's running, say 10,000 times and
I'm going to assume a table with 1000 records
It seems reasonable to assume the 1k records will fit easily in memory.
And then running 10k filters will be much cheaper in memory (LINQ).
Using SQL would mean loading 10M records, a lot of I/O.
EDIT
Its alwyas depends on the amount of data you have. If you have large amount data than go for sql and if less than for the linq. its also depends on the how frequently calling the data from sql server it its too frequently than its better to load in memory and than apply linq but if not than sql is better.
First Answer
Its better to go on sql side rather than loading in memory and than apply linq filter.
The one reason is better to go for sql rather an linq is
if go for linq
when you are getting 10,000 record it loads in memory as well as increase the nework traffic
if go for sql
no of record decreses so amount of memory utilise is less and aslo decrease networ traffic.
Depends on how big your table is and what type of data it stores.
Personally, I'd go with returning all the data if you plan to use all your filters during the same request.
If it's a filter on demand using ajax, you could reload the data from the database everytime (insuring by the same time your data is up to date)
This will probably cause some debate on the role of a database! I had this exact problem a little while back, some relatively complex filtering (things like "is in X country, where price is y and has the keyword z) and it was horrifically slow. Coupled with this, I was not allowed to change the database structure because it was a third party database.
I swapped out all of the logic, so that the database just returned the results (which i cached every hour) and did the filtering in memory - when I did this I saw massive performance increases.
I will say that is far better to let SQL do the complex filter and rest of processing, but why you may ask.
The main reason is because SQL Server have the index information's that you have set and use this index to access data very fast. If you load them on Linq then you do not have this index information for fast accessing the data, and you lose time to access them. Also you lose time to compile the linq every time.
You can make a simple test to see this different by your self. What test ? Create a simple table with hundred random string, and index this field with the string. Then make search on string field, one using linq and one direct asking the sql.
Update
My first thinking was that the SQL keep the index and make very quick access to the search data base on your SQL.
Then I think that linq can also translate this filter to sql and then get the data, then you make your action etc...
now I think that the actually reason is depend what actions you do. Is faster to run direct the SQL, but the reason of that is depend on how you actually set your linq.
If you try to load all in memory and then use linq then you lose for speed from SQL index, and lose memory, and lose a lot of action to move your data from sql to memory.
If you get data using linq, and then no other search need to be made, then you lose on the moving of all that data on memory, and lose memory.
t depends on the amount of data you are filtering on.
You say the filter runs 10K time and it can be different everytime, in this case if you don't have much data in database you can load that on to server variable.
If you have hundred thousands of records on database that you should not do this perhaps you can create indexes on database and per-compiled procedures to fetch data faster.
You can implement cache facade in between that helps you to store data in server side on first request and update it as per your requirement. (you can write the cache to fill variable only if data has limit of records).
You can calculate time to get data from database by running some test queries and observations. At the same time you can observer the response time from server if the data is stored in memory and calculate the difference and decide as per that.
There can be many other tricks but the base line is
You have to observer and decide.

updating a db record each time i retrieve it from the db - is it a good practice?

I'm using an sql server and i have a specific table that can contain ~1million-~10 million recrdords max.
In each record i retrieve I do some checkings (i run a few simple lines of code), and then I want to mark that the records was checked in DateTime.Now;
so what i do is retrieve a record, check some stuff, run an 'update' query to set the 'last_checked_time' field to DateTime.Now, and then move to the next record.
I can then get all the records ordered by their 'last_checked_time' field (ascending), and then i can iterate over them ordered by the their check time..
Is this a good practice ? Can it still remain speedy as long as i have no more than 10 million records on that table ?
I've read somewhere that every 'update' query is actually a deletion and a creation of a new record.
I'd also like to mention that these records will be frequently retrieved by my ASP.net website ..
I was thinking of writing down the 'last_checked_time' on a local txt file/binary file,but i'm guessing it would mean implementing something that the database can already do for you.
If you need that "last checked time" value then the best, most efficient, place to hold it is on the row in the table. It doesn't matter how many rows there are in the table, each update will affect just the row(s) you updated.
How an update is implemented is up to the DBMS, but it is not generally done by deleting and re-inserting the row.
I would recommend retrieving your data or a portion of the data, doing your checks on all of them and sending the updates back in transactions to let the database operate more effectively. This would provide for fewer round trips.
As to if this is a good practice, I would say yes especially since you are using in in your queries. Definitely, do not store the last checked time in a file and try to match up after you load your database data. The database RDBMS is designed to effeciently handle this for you. Don't reinvent the wheel using cubes.
Personally, I see no issues with it. It seems perfectly reasonable to store the last checked time in the database, especially since it might be used in queries (for example, to find records that haven't been checked in over a week).
Maybe (just maybe) you could create a new table containing two rows: the id of the row in the first table and the checked date.
That way you wouldn't alter the original table, but depending on the usage of the data and the check date you would be forced to make a joined query which is maybe something you also don't want to do.
It makes sense to store the 'checked time' as part of the row you're updating, rather than in a separate file or even a separate table in the database. This approach should provide optimal performance and help to maintain consistency. Solutions involving more than one table or external data stores may introduce a requirement for distributed or multi-table transactional updates that involve significant locking, which can negatively impact performance and make it much more difficult to guarantee consistency.
In general, solutions that minimize the scope of transactions and, by extension, locking, are worth striving for. Also, simplicity itself is a useful goal.

Best way to process a large database?

Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?
Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable
If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection
From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.
I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.

Categories