Problem: I have millions of rows from database to be processed.
I need to implement a method that will return a "stream"(?) of database rows.
I don't want to load all of them into memory at once.
I was thinking about returning a lazy IEnumerable<Record> and use yield.
The method would handle loading consecutive records using SqlDataReader.
But what will happen when a client will call .Count() on my IEnumerable? Counting all the records would mean a need to fetch them all.
Is there any good modern way to return a stream of objects not storing all of them in memory, just process one by one? My method should return a stream of records.
It seems like Reactive Extensions might solve the problem for me but I have never used it.
Any ideas?
Thanks
First, why reinvent the wheel? Entity Framework makes it easier to do stuff like this and adds all of the abstraction for you. The DbSet<TEntity> on the DbContext object implements IQueryable<TEntity> and IEnumerable<T> so you can:
Execute a Count() (with and without a lamda filter argument) with an extension method when you need to figure out the number of records (or some other aggregate function)
You can loop through them as an IEnumerable which opens a connection and reads 1 record at a time each time the MoveNext method is called from the connection.
If you do want to load everything in memory at once (I understand you don't based on your description) you could call extension method ToList or ToArray.
If you insist on using ADO.NET and manually doing this (I understand with legacy code there is not always a choice to use EF) then opening a data reader from the connection is the best approach. This will read in each next record with each corresponding call to method Read() and this is the least expensive way to go about doing a read of records in the DB.
If you want a Count then I suggest you write a new sql query which returns a count executed on your Database Server using Sql akin to
SELECT COUNT(field) FROM table
as this is best practice. Do not iterate and sum up all of your records from a reader with some custom work around to execute a sum in memory, that would be waste of resources not to mention creating complex code with no benefit.
For count query the db, and return to the user.
On the other hand, you need to implement count only for ICollection, IEnumerable does not require that. just return IEnumerable for iteration on the records.
just notice that you handle correctly the connection to the db.
Related
I got a function like this
public List<Code> GetCodes()
{
return _db.Codes.Select(p => p).ToList();
}
I have 160,000 records in this table which contains 2 columns.
Then I also loop through it as follows..
List<Code> CodeLastDigits = CodeDataProvider.GetCodes();
And then loop through it
foreach (var postCodeLastDigit in PostCodeLastDigits)
I am just trying to understand how many times a call is made to the database to retrieve those records, and want to make sure it only happens once.
Linq will delay the call to the database until you give it a reason to go.
In your case, you are giving a reason in the ToList() method.
Once you call ToList() you will have all the items in memory and wont be hitting the database again.
You didnt mention which DB platform you are using, but if it is SQL Server, you can use SQL Server profiler to watch your queries to the database. This is a good way to see how many calls and what sql is being run by linq to SQL. As #acfrancis notes below, LinqPad can also do this.
For SQL Server, here is a good tutorial on creating a trace
When you call ToList(), it's going to hit the database. In this case, it appears like you'll just be hitting the database once to populate your CodeLastDigits. As long as you aren't hitting the database again in your foreach, you should be good.
As long as you have the full version of SQl Server, you can run Sql Server Profiler while going through your code to see what's happening on the database.
Probably once. But the more complete answer is that it depends, and you should be familiar with the case where even a simple access pattern like this one can result in many, many round trips to the database.
It's not likely a table named codes contains any complex types, but if it did, you'd want to watch out. Depending on how you access the properties of a code object, you could incur extra hits to the database if you don't use LoadWith properly.
Consider an example where you have a Code object, which contains a CodeType object (also mapped to a table) in a class structure like this:
class Code {
CodeType type;
}
If you don't load CodeType objects with Code objects, Linq to SQL would contact the database on every iteration of your loop if CodeType is referenced inside the loop, because it would lazy-load the object only when needed. Watch out for this, and do some research on the LoadWith<> method and its use to get comfortable that you're not running into this situation.
foreach (x in PostCodeDigits) {
Print(x.type);
}
If I run against dlinq, for example
var custq = data.StoreDBHelper.DataContext.Customers as IEnumerable <Data.SO.Customer>;
I Thought it was not much of a difference against running:
var custq = data.StoreDBHelper.DataContext.Customers as IQueryable <Data.SO.Customer>;
As, IQueryable inherits from IEnumerable.
But I discovered the following, if you call:
custq.Sum()
then the program will process this as you called .toList() it you use the 'as IEnumerable'
because the memory on the progam raised to the same level, when i tried, custq.ToList.Sum()
but not on the 'as IQueryable' (because the issue then running on sql server) and did not affect the memory usage of the program.
My question is simply this, should you not use 'as IEnumerable' on Dlinq? But 'as IQueryable' as an general rule? I know that if you are running standard iterations, it gets the same result, between 'as IEnumerable'and 'as IQueryable'.
But is it just the summary functions and delegate expressions in where statement that there will be a difference - or will you in general get better performance if you use 'as IQueryable'? (for standard iteration and filter functions on DLinq entities)
Thanks !
Well, depends on what you want to do...
Casting it as IEnumerable will return an object you can enumerate... and nothing more.
So yes, if you call Count on an IEnumerable, then you enumerate the list (so you actually perform your Select query) and count each iteration.
On the other hand, if you keep an IQueryable, then you may enumerate it, but you could also perform database operations like Were, OrderBy or count. Then this will delay execution of the query and eventually modify it before running it.
Calling OrderBy on an enumerable browse all results and order them in memory. Calling OrderBy on a queryable simply adds ORDER BY at the end of your SQL and let database do the ordering.
In general, it is better to keep it as an IQueryable, yes... Unless you want to count them by actually browsing them (instead of doing a SELECT COUNT(*)...)
When using .FromCache() on an IQueryable result set, should I additionally call .ToList(), or can I just return the IEnumerable<> returned by the materialized query with FromCache?
I am assuming you are using a derivative of the code from http://petemontgomery.wordpress.com/2008/08/07/caching-the-results-of-linq-queries/ . If you look at the FromCache implementation, you will see the that the query.ToList() is already called. This means that the evaluated list is what is cached. So,
You do NOT need to call ToList()
That depends entirely on what you want to do with it. If you're just going to foreach over it once then you may as well just leave it as an IEnumerable. There's no need to build up a list just to discard it right away.
If you plan to iterate over it multiple times it's probably best to ToList it, so that you're not accessing the underlying IQueryable multiple times. You should also ToList it if it's possible for the underlying query to change over time and you don't want those changes to be reflected in your query.
If you are likely to not need to iterate all of the items (you may end up stopping after the first item, or half way, or something like that) then it's probably best to leave it as an IEnumerable to potentially avoid even fetching some amount of data in the first place.
If the method has no idea how it's going to be used, and it's just a helper method that will be used by not-yet-written code, then consider returning IEnumerable. The caller can call ToList on it if they have a compelling reason to turn it into a list.
For me, as a general rule, I leave such queries as IEnumerable unless I have some compelling reason to make it a List.
I'm having let's say thousands of Customer records and I have to show them on a webform. Also, I have one CustomerEntity which has 10 properties. So when I fetch data in using a DataReader and convert it into List<CustomerEntity> I am required to loop through the data two times.
So is the use of generics suggestable in such a scenario? If yes then what will be my applications performance?
For E.g.
In CustomerEntity class, i'm having CustomerId & CustomerName propeties. And i'm getting 100 records from Customer Table
Then for Preparing List i've wrote following code
while (dr.Read())
{
// creation of new object of customerEntity
// code for getting properties of CustomerEntity
for (var index = 0; index < MyProperties.Count; index++)
{
MyProperty.setValue(CustEntityObject,dr.GetValue(index));
}
//adding CustEntity object to List<CustomerEntity>
}
How can i avoid these two loops. Is their any other mechanism?
I'm not really sure how generics ties into data-volume; they are unrelated concepts... it also isn't clear to me why this requires you to read everything twice. But yes: generics are fine when used in volume (why wouldn't they be?). But of course, the best way to find a problem is profiling (either server performance or bandwidth - perhaps more the latter in this case).
Of course the better approach is: don't show thousands of records on a web form; what is the user going to do with that? Use paging, searching, filtering, ajax, etc - every trick imaginable - but don't send thousands of records to the client.
Re the updated question; the loop for setting properties isn't necessarily bad. This is an entirely appropriate inner loop. Before doing anything, profile to see if this is actually a problem. I suspect that sheer bandwidth (between server and client, or server and database) is the biggest issue. If you can prove that this loop is a problem there are things you can do do optimise:
switch to using PropertyDescriptor (rather than PropertyInfo), and use HyperDescriptor to make it a lot faster
write code with DynamicMethod to do the job - requires some understanding of IL, but very fast
write a .NET 3.5 / LINQ Expression to do the same and use .Compile() - like the second point, but (IMO) a bit easier
I can add examples for the first and third bullets; I don't really want to write an example for the second, simply because I wouldn't write that code myself that way any more (I'd use the 3rd option where available, else the 1st).
It is very difficult what to say the performance will be, but consider these things -
Generics provides type saftey
If you're going to display 10,000 records in the page, your application will probably be unusable. If records are being paged, consider returning only those records that are actually needed for the page you are on.
You shouldn't need to loop through the data twice. What are you doing with the data?
I'm getting into using the Repository Pattern for data access with the Entity Framework and LINQ as the underpinning of implementation of the non-Test Repository. Most samples I see return AsQueryable() when the call returns N records instead of List<T>. What is the advantage of doing this?
AsQueryable just creates a query, the instructions needed to get a list. You can make futher changes to the query later such as adding new Where clauses that get sent all the way down to the database level.
AsList returns an actual list with all the items in memory. If you add a new Where cluse to it, you don't get the fast filtering the database provides. Instead you get all the information in the list and then filter out what you don't need in the application.
So basically it comes down to waiting until the last possible momment before committing yourself.
Returning IQueryable<T> has the advantage, that the execution is defferer until you really start enumerating the result and you can compose the query with other queries and still get server side execution.
The problem is that you cannot control the lifetime of the database context in this method - you need an open context and must ensure that it stays open until the query gets executed. And then you must ensure that the context will be disposed. If you return the result as a List<T>, T[], or something similar, you loose deffered execution and server side execution of composed queries, but you win control over the lifetime of the database context.
What fits best, of course, depends on the actual requirements. It's another question without a single truth.
AsQueryable is an extension method for IEnumerable<T> that could do two things:
If the IEnumerable<T> implements IQueryable<T> justs casts, doing nothing.
Otherwise creates a 'fake' IEnumerable<T> (EnumerableQuery<T>) that implements every method compiling the lambdas and calling to Enumerable extension methods.
So in most of the cases using AsQueryable is useless, unless u are forced to pass a IQueryable to a method and u have a IEnumerable instead, it's a hack.
NOTE: AsQueryable is a hack, IQueryable of course is not!
Returning IQueryable<T> will defer execution of the query until its results are actually used. Until then, you can also perform additional database query operations on the IQueryable<T>; on a List you're restricted to generally less-efficient in-memory operations.
IQueryable: This is kind of deferred execution - lazy loading so your query is evaluated and hit to database. If we add any additional clause it will hit the Db with the required query with filters.
IEnumerable: Is eager loading thus records will be loaded first in-memory operation and then the operation will be performed.
So depending on use case like while paging on a huge number of records we should use IQueryable<T>, if short operation and doesn't create huge memory storage use IEnumerable.