How to iterate large query

How to iterate large query - c#

I know how to make pagination but it doesn't fit my requirements because underlying query of pagination is updating itself whenever i need next paged result. So i am looking simple solution to iterate one by one whole results of query efficiently. Please take a look below example.
var urls = db.Websites.Select(s => s.Website)
.Except(db.OldWebsites.Select(s => s.Website));
foreach (var url in urls)
{
//process items
}
I just want to know that the solution is really efficiently does iteration whole results or not. I am not exactly sure that the solution is loading rows one by one without loading all results to memory.
Can someone verify this or suggest better solution ?

Yes Entity Framework streaming results instead of buffering as default. Calling AsStreaming method as below gives warning as : "Queries are now streaming by default unless a retrying ExecutionStrategy is used.
foreach (var item in db.Websites.AsStreaming()) { }
Just needs to be carefull that DbContext doesn't hold references for iterated results. Anonymous types or primitive results already not tracked so it needs to call AsNoTracking for Entity results like
db.Websites.AsNoTracking()

Related

Is there a difference in entity framework order?

I'm running into some speed issues in my project and it seems like the primary cause it calls to the database using entity framework. Every time I call the database, it is always done as
database.Include(...).Where(...)
and I'm wondering if that is different than
database.Where(...).Include(...)?
My thinking is that the first way includes everything for all the elements in the target table, then filters out the ones I want, while the second one filters out the ones I want, then only includes everything for those. I don't fully understand entity framework, so is my thinking correct?

Entity Framework delays its querying as long as it can, up until the point where your code start working on the data. Just to prove the example:
var query = db.People
.Include(p => p.Cars)
.Where(p => p.Employer.Name == "Globodyne")
.Select(p => p.Employer.Founder.Cars);
With all these chained calls, EF has not yet called the database. Instead, it has kept track of what you're trying to fetch, and it knows what query to run if you start working with the data. If you never do anything else with query after this point, then you will never hit the database.
However, if you do any of the following:
var result = query.ToList();
var firstCar = query.FirstOrDefault();
var founderHasCars = query.Any();
Now, EF is forced to look at the database because it cannot answer your question unless it actually fetches the data from the database. At this point, not before, does EF actually hit the database.
For reference, this trigger to fetch the data is often referred to as "enumerating the collection", i.e. turning a query into an actual result set.
By deferring the execution of that query for as long as possible, EF is able to wait and see if you're going to filter/order/paginate/transform/... the result set, which could lead to EF needing to return less data than when it executes every command immediately.
This also means that when you call Include, you're not actually hitting the database, so you're not going to be loading data from items that will later be filtered by your Where clause, if you didn't enumerate the collection.
Take these two examples:
var list1 = db.People
.Include(p => p.Cars)
.ToList() // <= enumeration
.Where(p => p.Name == "Bob");
var list2 = db.People
.Include(p => p.Cars)
.Where(p => p.Name == "Bob")
.ToList(); // <= enumeration
These lists will eventually yield the same result. However, the first list will fetch data before you filter it because you called ToList before Where. This means you're going to be loading all people and their cars in memory, only to then filter that list in memory.
The second list, however, will only enumerate the collection when it already knows about the Where clause, and therefore EF will only load people named Bob and their cars into memory. The filtering will happen on the database before it gets sent back to your runtime.
You did not show enough code for me to verify whether you are prematurely enumerating the collection. I hope this answer helps you in determining whether this is the cause of your performance issues.
database.Include(...).Where(...) and I'm wondering if that is different than database.Where(...).Include(...)?
Assuming this code is verbatim (except the missing db set) and there is nothing happening inbetween the Include and Where, the order does not change the execution and therefore it is not the source of your performance issue.
I generally advise you to put your Include statements before anything else (i.e. right after db.MyTable), as a matter of readability. The other operations depends on the specific query you're trying to construct.

Most of the times order of clauses will not make any difference
Include statement tells to SQL Join one table with another
While Where will results in.. yes, SQL Where
When you do something like database.Include(...).Where(...) you are building IQueryable object that will be transleted to direct SQL after you try to access it like with .ToList() or .FirstOrDefault() and those queries are already optimized
So if you still have performance issues - you should use profiler to look for bottlenecks and maybe consider using stored procedures (those could be integrated with EF)

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?

This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.

Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .

This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

Understanding lazy loading optimization in C#

After reading a bit of how yield, foreach, linq deferred execution and iterators work in C#. I decided to give it a try optimizing an attribute based validation mechanic inside a small project. The result:
private IEnumerable<string> GetPropertyErrors(PropertyInfo property)
{
// where Entity is the current object instance
string propertyValue = property.GetValue(Entity)?.ToString();
foreach (var attribute in property.GetCustomAttributes().OfType<ValidationAttribute>())
{
if (!attribute.IsValid(propertyValue))
{
yield return $"Error: {property.Name} {attribute.ErrorMessage}";
}
}
}
// inside another method
foreach(string error in GetPropertyErrors(property))
{
// Some display/insert log operation
}
I find this slow but that also could be due to reflection or a large amount of properties to process.
So my question is... Is this optimal or a good use of the lazy loading mechanic? or I'm missing something and just wasting tons of resources.
NOTE: The code intention itself is not important, my concern is the use of lazy loading in it.

Lazy loading is not something specific to C# or to Entity Framework. It's a common pattern, which allows defer some data loading. Deferring means not loading immediately. Some samples when you need that:
Loading images in (Word) document. Document may be big and it can contain thousands of images. If you'll load all them when document is opened it might take big amount of time. Nobody wants sit and watch 30 seconds on loading document. Same approach is used in web browsers - resources are not sent with body of page. Browser defers resources loading.
Loading graphs of objects. It may be objects from database, file system objects etc. Loading full graph might be equal to loading all database content into memory. How long it will take? Is it efficient? No. If you are building some file system explorer will you load info about every file in system before you start using it? It's much faster if you will load info about current directory only (and probably it's direct children).
Lazy loading not always mean deferring loading until you really need data. Loading might occur in background thread before you really need that data. E.g. you might never scroll to the bottom of web page to see footer image. Lazy loading means only deferring. And C# enumerators can help you with that. Consider getting list of files in directory:
string[] files = Directory.GetFiles("D:");
IEnumerable<string> filesEnumerator = Directory.EnumerateFiles("D:");
First approach returns array of files. It means directory should get all its files and save their names to array before you can get even first file name. It's like loading all images before you see document.
Second approach uses enumerator - it returns files one by one when you ask for next file name. It means that enumerator is returned immediately without getting all files and saving them to some collection. And you can process files one by one when you need that. Here getting files list is deferred.
But you should be careful. If underlying operation is not deferred, then returning enumerator gives you no benefits. E.g.
public IEnumerable<string> EnumerateFiles(string path)
{
foreach(string file in Directory.GetFiles(path))
yield return file;
}
Here you use GetFiles method which fills array of file names before returning them. So yielding files one by one gives you no speed benefits.
Btw in your case you have exactly same problem - GetCustomAttributes extension internally uses Attribute.GetCustomAttributes method which returns array of attributes. So you will not reduce time of getting first result.

This isn't quite how the term "lazy loading" is generally used in .NET. "Lazy loading" is most often used of something like:
public SomeType SomeValue
{
get
{
if (_backingField == null)
_backingField = RelativelyLengthyCalculationOrRetrieval();
return _backingField;
}
}
As opposed to just having _backingField set when an instance was constructed. Its advantage is that it costs nothing in the cases when SomeValue is never accessed, at the expense of a slightly greater cost when it is. It's therefore advantageous when the chances of SomeValue not being called are relatively high, and generally disadvantageous otherwise with some exceptions (when we might care about how quickly things are done in between instance creation and the first call to SomeValue).
Here we have deferred execution. It's similar, but not quite the same. When you call GetPropertyErrors(property) rather than receiving a collection of all of the errors you receive an object that can find those errors when asked for them.
It will always save the time taken to get the first such item, because it allows you to act upon it immediately rather than waiting until it has finished processing.
It will always reduce memory use, because it isn't spending memory on a collection.
It will also save time in total, because no time is spent creating a collection.
However, if you need to access it more than once, then while a collection will still have the same results, it will have to calculate them all again (unlike lazy loading which loads its results and stores them for subsequent reuse).
If you're rarely going to want to hit the same set of results, it's generally always a win.
If you're almost always going to want to hit the same set of results, it's generally a lose.
If you are sometimes going to want to hit the same set of results though, you can pass the decision on whether to cache or not up to the caller, with a single use calling GetPropertyErrors() and acting on the results directly, but a repeated use calling ToList() on that and then acting repeatedly on that list.
As such, the approach of not sending a list is the more flexible, allowing the calling code to decide which approach is the more efficient for its particular use of it.
You could also combine it with lazy loading:
private IEnumerable<string> LazyLoadedEnumerator()
{
if (_store == null)
return StoringCalculatingEnumerator();
return _store;
}
private IEnumerable<string> StoringCalculatingEnumerator()
{
List<string> store = new List<string>();
foreach(string str in SomethingThatCalculatesTheseStrings())
{
yield return str;
store.Add(str);
}
_store = store;
}
This combination is rarely useful in practice though.
As a rule, start with deferred evaluation as the normal approach and decide further up the call chain whether to store the results or not. An exception though is if you can know the size of the results before you begin (you can't here because you don't know if an element will be added or not until you've examined the property). In this case there is the possibility of a performance improvement in just how you create that list, because you can set its capacity ahead of time. This though is a micro-optimisation that is only applicable if you also know that you'll also always want to work on a list and doesn't save that much in the grand scheme of things.

Get the "indexOf" of an ICollection in Entity Framework

I have two objects that are connected in a way such that ObjectA contains an ICollection of ObjectB. I would like to be able to determine the indexOf ObjectB in inside the ICollection<ObjectB> that is stored in ObjectA.
One of the solutions was to convert the ICollection to a list and then use the built in IndexOf. However, when multiple threads access the ICollection, in many cases, I can get the same indexOf value for multiple ObjectBs.
Question Is there any way to have a certain field (of type int) that stores the index of ObjectB inside the ICollection? If not, is there anyway to ensure that indexOf (when multiple threads attempt to access it) gives the right index (i.e. no matter of the thread)?
Possible solutions I've tried to ensure to use a new context for each look up as well as GetDatabaseValues() and Reload(). This has worked better (especially in Debug mode), but when the debug mode is turned off, the same index of value is given to more ObjectBs.
Edit
I tried to add an OrderBy statement, but it seems like none of the approaches work.
// var objectB = new ObjectB();
using(var context = new ContextDb())
{
var objectA = context.ObjectAs.Single(x => x.Id == 1);
objectA.objectBs.Add(objectB);
context.SaveChanges();
context.Entry(objectB).Reload();
context.Entry(objectA).Reload();
var list = objectA.objectBs.Select(x => x.Id).OrderBy(x => x).ToList(); // order by primary key.
sb.AppendLine( string.Join(",", list.ToArray())); // for testing
objectB.LocalId= list.IndexOf(objectB.Id) + 1; // the "local id"
context.SaveChanges();
}
The result is quite strange, although I seem to be able to see a pattern. Note, the code above is in a for loop that runs a certain amount of times. During the first iteration (first line in the string builder) gives the following:
2932 2932,2933 2932,2933,2934 2932,2933,2934,2935,2936 2932,2933,2934,2935,2936,2937,2938 2932,2933,2934,2935,2936,2937,2938,2939,2940 2932,2933,2934,2935,2936,2937,2938,2939,2940,2941,2942
The second line:
2932,2933,2934,2935,2936,2937,2938,2939,2940,2941,2942,2943,2944,2945 2932,2933,2934,2935,2936,2937,2938,2939,2940,2941,2942,2943,2944,2945,2946,2947
The last line:
2932,2933,2934,2935,2936,2937,2938,2939,2940,2941,2942,2943,2944,2945,2946,2947,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957,2958,2959,2960,2961,2962,2963,2964,2965,2966,2967,2968,2969,2970
Does anyone know if there is a build in way in Entity framework to avoid these duplicates?

Honestly, I'm still not entirely clear on what you're looking for. However, there's a couple of things I can say based on your expanded question. First, when you pull anything from a database, there's no inherent order. By default, it'll generally be ordered by PK, if possible, or more appropriately by "insert order". However, that's risky to rely on if an exact order is necessary. If you need a truly exact and replicate-able order, then you need to issue an ORDER BY clause with the order you want.
Especially if you relying on navigation properties filled by Entity Framework through either eager or lazy-loading a foreign key, you can't rely on the implict order at all. Again, if order is important, then you need to use OrderBy or OrderByDescending with some property on the entity to make sure that you get a true apples-to-apples order comparison.

Use IDictionary in which you can store index as well as the object.
IDictionary

new objects added during long loop

We currently have a production application that runs as a windows service. Many times this application will end up in a loop that can take several hours to complete. We are using Entity Framework for .net 4.0 for our data access.
I'm looking for confirmation that if we load new data into the system, after this loop is initialized, it will not result in items being added to the loop itself. When the loop is initialized we are looking for data "as of" that moment. Although I'm relatively certain that this will work exactly like using ADO and doing a loop on the data (the loop only cycles through data that was present at the time of initialization), I am looking for confirmation for co-workers.
Thanks in advance for your help.
//update : here's some sample code in c# - question is the same, will the enumeration change if new items are added to the table that EF is querying?
IEnumerable<myobject> myobjects = (from o in db.theobjects where o.id==myID select o);
foreach (myobject obj in myobjects)
{
//perform action on obj here
}

It depends on your precise implementation.
Once a query has been executed against the database then the results of the query will not change (assuming you aren't using lazy loading). To ensure this you can dispose of the context after retrieving query results--this effectively "cuts the cord" between the retrieved data and that database.
Lazy loading can result in a mix of "initial" and "new" data; however once the data has been retrieved it will become a fixed snapshot and not susceptible to updates.
You mention this is a long running process; which implies that there may be a very large amount of data involved. If you aren't able to fully retrieve all data to be processed (due to memory limitations, or other bottlenecks) then you likely can't ensure that you are working against the original data. The results are not fixed until a query is executed, and any updates prior to query execution will appear in results.

I think your best bet is to change the logic of your application such that when the "loop" logic is determining whether it should do another interation or exit you take the opportunity to load the newly added items to the list. see pseudo code below:
var repo = new Repository();
while (repo.HasMoreItemsToProcess())
{
var entity = repo.GetNextItem();
}
Let me know if this makes sense.

The easiest way to assure that this happens - if the data itself isn't too big - is to convert the data you retrieve from the database to a List<>, e.g., something like this (pulled at random from my current project):
var sessionIds = room.Sessions.Select(s => s.SessionId).ToList();
And then iterate through the list, not through the IEnumerable<> that would otherwise be returned. Converting it to a list triggers the enumeration, and then throws all the results into memory.
If there's too much data to fit into memory, and you need to stick with an IEnumerable<>, then the answer to your question depends on various database and connection settings.

I'd take a snapshot of ID's to be processed -- quickly and as a transaction -- then work that list in the fashion you're doing today.
In addition to accomplishing the goal of not changing the sample mid-stream, this also gives you the ability to extend your solution to track status on each item as it's processed. For a long-running process, this can be very helpful for progress reporting restart / retry capabilities, etc.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.