Linq performance for in-memory collection - c#

I have a list: Collection users which has around 100K+ records of users (all user objects fully loaded from the database with fields like Bio, First name, last name etc). This collection is fetched on application start from the database and is kept in memory.
Then I have code like:
User cachedUser = users.FirstOrDefault(x => string.Equals(x.UserName, username,
StringComparison.CurrentCultureIgnoreCase));
Which I use to fetch users from this collection. But somehow I noticed that this operation is incredibly slow. Is there a performance issue while using Linq to query in memory collection of large objects? Should I instead call the DB each time I want to get a user?

I think you might need to re-think your architecture based on the information you have given us. Take advantage of the database and let it do the search work for you. Observe, measure, and make changes accordingly after that. You might realize that you prematurely optimized the whole thing.

If you want to optimize your response time and you could create a Dictionary<T,U> and search the user within:
Dictionary<string, User> usersDictionary = new <Dictionary<string, User>(StringComparer.CurrentCultureIgnoreCase);
// After querying the users from the DB add them to the dictionary
usersDictionary.Add(user.UserName, user);
// Then when you need to retrieve a user
User retrieveUser = null;
usersDictionary.TryGetValue(username, out retrieveUser);
Hope that helps !

Your LINQ query like any other iteration technique (loop, search in array) will access every single record until the requested record is found. In the worst case that means 100k comparisons. To make this faster, you have the following options:
use a sorted list or a dictionary: a binary search is a lot faster. Sort
the data when fetching it from the database by using ORDER BY
use a DataSet. It's like an In-Memory database which provides faster search
Leave the data in the database and set appropriate indexes for faster access
I suggest to use the database due to the following reasons:
It's a waste of memory to store 100k records, which you probably never use
As soon as you change your data, you will have to refresh your cache, which might be rather complex
web applications are multithreaded (every request runs in its own thread). In case you change your data, you will have to synchronize with locks.
a database can cache frequently called data
you have to write less code
you have a stateless web application which scales better (web farms)
your application probably has other data, you cannot store everything in memory

The different in the search performance that you notice is because the database is use indexing to locate the string in the database, but you in the memory you simple search all records until you find the one. Also the database keep a hash number for the string and search for this number hash that is a lot faster, and not make actually string compare.
The Dictionary<> make also an indexing, but have a delay to add data, when the data start grow because when its add some data, every time is search where to place it in the correct index point.
Also the database cache the results, many database cache also the indexing and create extra statistics that help to locate fast what you looking for.
Is better to let the database make the search, except if you can make something faster for extra custom cases.

Related

Query DB with the little things, or store some "bigger chunks" of results and filter them in code?

I'm working on an application that imports video files and lets the user browse them and filter them based on various conditions. By importing I mean creating instances of my VideoFile model class and storing them in a DB table. Once hundreds of files are there, the user wants to browse them.
Now, the first choice they have in the UI is to select a DateRecorded, which calls a GetFilesByDate(Date date) method on my data access class. This method will query the SQL database, asking only for files with the given date.
On top of that, I need to filter files by, let's say, FrameRate, Resolution or UserRating. This would place additional criteria on the files already filtered by their date. I'm deciding which road to take:
Only query the DB for a new set of files when the desired DateRecorded changes. Handle all subsequent filtering manually in C# code, by iterating over the stored collection of _filesForSelectedDay and testing them against current additional rules.
Query the DB each time any little filter changes, asking for a smaller and very specific set of files more often.
Which one would you choose, or even better, any thoughts on pros and cons of either of those?
Some additional points:
A query in GetFilesByDate is expected to return tens of items, so it's not very expensive to store the result in a collection always sitting in memory.
Later down the road I might want to select files not just for a specific day, but let's say for the entire month. This may give hundreds or thousands of items. This actually makes me lean towards option two.
The data access layer is not yet implemented. I just have a dummy class implementing the required interface, but storing the data in a in-memory collection instead of working with any kind of DB.
Once I'm there, I'll almost certainly use SQLite and store the database in a local file.
Personally I'd always go the DB every time until it proves impractical. If it's a small amount of data then the overhead should also be small. When it gets larger then the DB comes into its own. It's unlikely you will be able to write code better than the DB although the round trip can cost. Using the DB your data will always be consistent and up to date.
If you find you are hitting the BD too hard then you can try caching your data and working out if you already have some or all of the data being requested to save time. However then you have aging and consistency problems to deal with. You also then have servers with memory stuffed full of data that could be used for other things!
Basically, until it becomes an issue, just use the DB and use your energy on the actual problems you encounter, not the maybes.
If you've already gotten a bunch of data to begin with, there's no need to query the db again for a subset of that set. Just store it in an object which you can query on refinement of the search query by the user.

Using LINQ vs SQL for Filtering Collection

I have a very general question regarding the use of LINQ vs SQL to filter a collection. Lets say you are running a fairly complex filter on a database table. It's running, say 10,000 times and the filters could be different every time. Performance wise, are you better off loading the entire database table collection into memory and executing the filters with LINQ, or should you let the database handle the filtering with SQL (since that's what is was built to do). Any thoughts?
EDIT: I should have been more clear. Lets assume we're talking about a table with 1000 records with 20 columns (containing int/string/date data). Currently in my app I am running one query every 1/2 hour to pull in all of the data into a collection (saving that collection in the application cache) and filtering that cached collection throughout my app. I'm wondering if that is worse than doing tons of round trips to the database server (it's Oracle fwiw).
After the update:
It's running, say 10,000 times and
I'm going to assume a table with 1000 records
It seems reasonable to assume the 1k records will fit easily in memory.
And then running 10k filters will be much cheaper in memory (LINQ).
Using SQL would mean loading 10M records, a lot of I/O.
EDIT
Its alwyas depends on the amount of data you have. If you have large amount data than go for sql and if less than for the linq. its also depends on the how frequently calling the data from sql server it its too frequently than its better to load in memory and than apply linq but if not than sql is better.
First Answer
Its better to go on sql side rather than loading in memory and than apply linq filter.
The one reason is better to go for sql rather an linq is
if go for linq
when you are getting 10,000 record it loads in memory as well as increase the nework traffic
if go for sql
no of record decreses so amount of memory utilise is less and aslo decrease networ traffic.
Depends on how big your table is and what type of data it stores.
Personally, I'd go with returning all the data if you plan to use all your filters during the same request.
If it's a filter on demand using ajax, you could reload the data from the database everytime (insuring by the same time your data is up to date)
This will probably cause some debate on the role of a database! I had this exact problem a little while back, some relatively complex filtering (things like "is in X country, where price is y and has the keyword z) and it was horrifically slow. Coupled with this, I was not allowed to change the database structure because it was a third party database.
I swapped out all of the logic, so that the database just returned the results (which i cached every hour) and did the filtering in memory - when I did this I saw massive performance increases.
I will say that is far better to let SQL do the complex filter and rest of processing, but why you may ask.
The main reason is because SQL Server have the index information's that you have set and use this index to access data very fast. If you load them on Linq then you do not have this index information for fast accessing the data, and you lose time to access them. Also you lose time to compile the linq every time.
You can make a simple test to see this different by your self. What test ? Create a simple table with hundred random string, and index this field with the string. Then make search on string field, one using linq and one direct asking the sql.
Update
My first thinking was that the SQL keep the index and make very quick access to the search data base on your SQL.
Then I think that linq can also translate this filter to sql and then get the data, then you make your action etc...
now I think that the actually reason is depend what actions you do. Is faster to run direct the SQL, but the reason of that is depend on how you actually set your linq.
If you try to load all in memory and then use linq then you lose for speed from SQL index, and lose memory, and lose a lot of action to move your data from sql to memory.
If you get data using linq, and then no other search need to be made, then you lose on the moving of all that data on memory, and lose memory.
t depends on the amount of data you are filtering on.
You say the filter runs 10K time and it can be different everytime, in this case if you don't have much data in database you can load that on to server variable.
If you have hundred thousands of records on database that you should not do this perhaps you can create indexes on database and per-compiled procedures to fetch data faster.
You can implement cache facade in between that helps you to store data in server side on first request and update it as per your requirement. (you can write the cache to fill variable only if data has limit of records).
You can calculate time to get data from database by running some test queries and observations. At the same time you can observer the response time from server if the data is stored in memory and calculate the difference and decide as per that.
There can be many other tricks but the base line is
You have to observer and decide.

C# Data Collection v Database lookup - which is the more efficient?

I was wondering if I could pick the brains of the community...
In a project I am working on, there is a need to look up a value from a key-value list. This list will not change throughout the life of the software. Lets say as an example the list is thus:
ID Name
1 Apple
2 Orange
3 Pear
4 Banana
...and so on.
I am considering two methods of implementing this. The first is to store the list in some sort of (as yet undecided) C# data collection, and then look the required value up at runtime as required. The second method I am considering is storing the list within a database table (SQL Server 2008, since you asked). The application can then access the database at runtime via a stored procedure.
The lookup will occur twice in quick succession following a request made by the user from a web form.
My query is this: Which of these two methods would be the most efficient in terms of processing time?
I realise that there might not be a definitive answer to this question, but I would welcome any comments or thoughts.
What is expensive in term of performances is usually disk access and network access.
if you have the collection available in memory (RAM) in the web server or application server this would be faster than a query to the database.
if the data is not likely to change often or at all, you can go for in memory data structure, if it changes sometimes you can query from db and store it in the cache so following accesses to that object will not require database query until cache expires or is reset depending on your needs.
Use a Dictionary<>, it's orders of magnitude faster than a roundtrip to the database.
As you state -
This list will not change throughout the life of the software.
I'd hard code the list into the program. It's not worth the database overhead for a list that will never change.

How can I use a very large dictionary in C#?

I want to use a lookup map or dictionary in a C# application, but it is expected to store 1-2 GB of data.
Can someone please tell if I will still be able to use dictionary class, or if I need to use some other class?
EDIT : We have an existing application which uses oracle database to query or lookup object details. It is however too slow, since the same objects are getting repeatedly queried. I was feeling that it might be ideal to use a lookup map for this scenario, to improve the response time. However I am worried if size will make it a problem
Short Answer
Yes. If your machine has enough memory for the structure (and the overhead of the rest of the program and system including operating system).
Long Answer
Are you sure you want to? Without knowing more about your application, it's difficult to know what to suggest.
Where is the data coming from? A file? Files? A database? Services?
Is this a caching mechanism? If so, can you expire items out of the cache once they haven't been accessed for a while? This way, you don't have to hold everything in memory all the time.
As others have suggested, if you're just trying to store lots of data, can you just use a database? That way you don't have to have all of the information in memory at once. With indexing, most databases are excellent at performing fast retrieves. You could combine this approach with a cache.
Is the data that will be in memory read only, or will it have to be persisted back to some storage when something changes?
Scalability - do you expect that the amount of data that will be stored in this dictionary will increase as time goes on? If so, you're going to run into a point where it's very expensive to buy machines that can handle this amount of data. You might want to look a distributed caching system if this is the case (AppFrabric comes to mind) so you can scale out horizontally (more machines) instead of vertically (one really big expensive point of failure).
UPDATE
In light of the poster's edit, it sounds like caching would go a long way here. There are many ways to do this:
Simple dictionary caching - just cache stuff as its requested.
Memcache
Caching Application Block I'm not a huge fan of this implementation, but others have had success.
As long as you're on a 64GB machine, yes you should be able to use that large of a dictionary. However if you have THAT much data, a database may be more appropriate (cassandra is really nothing but a gigantic dictionary, and there's always MySQL).
When you say 1-2GB of data, I assume that you mean the items are complex objects that cumulatively contain 1-2GB.
Unless they're structs (and they shouldn't be), the dictionary doesn't care how big the items are.
As long as you have less than about 224 items (I pulled that number out of a hat), you can store as much as you can fit in memory.
However, as everyone else has suggested, you should probably use a database instead.
You may want to use an in-memory database such as SQL CE.
You can but For a Dictionary as large as that you are better off using a DataBase
Use a database.
Make sure you've a good DB model, put correct indexes, and off you go.
You can use subdictionaries.
Dictionary<KeyA, Dictionary<KeyB ....
Where KeyA is some common part of KeyB.
For example, if you have a String dictionary you can use the First letter as KeyA.

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.
You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.
This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.
3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

Categories