Your preference to 'materializing' IEnumerables? - c#

Sometimes it is necessary to actually 'evaluate' IEnumerable in the middle of a method, because it is used in multiple queries and compiler issues warning ("Possible multiple enumeration of IEnumerable")
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct();
var skippedLookup = skippedIds.ToLookup(x => x.FundId, _ => new { _.Id, _.Name});
if (skippedIds.Any()) // compiler warning
{
...
// other iterations over skippedIds, etc.
}
I used to do:
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct()
.ToList();
...
but would like to know if there are better options. The code above creates List<T> object on the heap which is I guess unnecessary GC burden in the context of a temporary variable that dies within the method.
I am now using ToImmutableArray() that comes with System.Collections.Immutable library. Not only this creates stack-allocated object (not true, thanks commentors), but it also attaches 'immutable' semantics to my code which is I guess a good functional-style practice.
But what are performance implications? What is the preferable way of 'materializing' temporary subquery results that are used in multiple places locally within a method?

The performance implications of materialising it in memory are:
The initial grab of all items from the database - if you're not going to be using all of the items, then you could be taking more than you need.
Depending on the structure you use you could have insertion costs - ToImmutableArray() will be about as quick as ToArray() because ImmutableArray just wraps the built-in array type and removes the mutation option.
GC burdens are less of a concern if you're throwing the object away quickly. Because it's very unlikely the item will jump from Gen 0 to Gen 1 and will be collected without much cost. But obviously the more big objects you allocate the more likely it is that a collection is triggered.
You could use the Seq<A> type from language-ext (Disclosure: I'm the author). Which is designed to be a 'better enumerable' in that it will only ever consume each item in an IEnumerable<A> once and is lazy like IEnumerable<A>.
So, you could do this:
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct()
.ToSeq();
Obviously there's nothing for free in this world, and the costs of Seq<A> are:
An allocation per item consumed (as it memorises the items you've read so you don't do it again). But they're tiny objects that just have two references in and so cause very little GC pressure.
Holding open the connection to the database longer than you possibly need, which could cause other performance issues with your db: deadlocks, etc.
But the benefits are you only consume what you need and you consume it once. Personally I would look to limit your query and use ToImmutableArray(), taking less than you need from the db will always be the preferred approach.

In this specific case, the issue is that you have materialised the results (in the form of a Lookup), but then refer to the unmaterialised results.
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct();
var skippedLookup = skippedIds.ToLookup(x => x.FundId, _ => new { _.Id, _.Name});
if (skippedIds.Any()) // compiler warning
In the above code, skippedIds is not materialised, but skippedLookup is. As such, you may consider changing:
if (skippedIds.Any()) // compiler warning
to:
if (skippedLookup.Any()) // no compiler warning
If we take the more general case, some additional guidance:
Consider the performance cost of multiple enumeration (e.g. hitting the database twice) vs materialising (e.g. RAM usage) - which is best can be contextual
Consider using ToList or ToImmutableArray for materialising (both appear to perform well).
Consider whether any of the LINQ operations can be removed from the code without an impact on overall functionality. A common mistake is to use Any then foreach - in many cases the Any can be removed since the foreach will automatically do nothing if the enumerable is empty.
If the IEnumerable is using LINQ to Objects and you are performing Distinct then a materialising operation (e.g. ToList) then instead use new HashSet<YourTypeHere>(YourEnumerableHere) . It will perform the Distinct and the materialising operation in one hit.
When materialising using ToList consider exposing the resulting List as IReadOnlyList to indicate to the consumers that it is not designed to be altered.
In practical terms, it rarely matters much what approach you choose. There will some GC overhead of the List and its underlying array, sure. But in the broader context of the overall GC load (e.g. of the objects that the List contains) it is unlikely to be an issue. If the list gets big enough then the Large Object Heap can be involved, which isn't optimal. But honestly, let the GC do its job. If there is a problem, optimise then but not before.

Related

Is there a difference in entity framework order?

I'm running into some speed issues in my project and it seems like the primary cause it calls to the database using entity framework. Every time I call the database, it is always done as
database.Include(...).Where(...)
and I'm wondering if that is different than
database.Where(...).Include(...)?
My thinking is that the first way includes everything for all the elements in the target table, then filters out the ones I want, while the second one filters out the ones I want, then only includes everything for those. I don't fully understand entity framework, so is my thinking correct?
Entity Framework delays its querying as long as it can, up until the point where your code start working on the data. Just to prove the example:
var query = db.People
.Include(p => p.Cars)
.Where(p => p.Employer.Name == "Globodyne")
.Select(p => p.Employer.Founder.Cars);
With all these chained calls, EF has not yet called the database. Instead, it has kept track of what you're trying to fetch, and it knows what query to run if you start working with the data. If you never do anything else with query after this point, then you will never hit the database.
However, if you do any of the following:
var result = query.ToList();
var firstCar = query.FirstOrDefault();
var founderHasCars = query.Any();
Now, EF is forced to look at the database because it cannot answer your question unless it actually fetches the data from the database. At this point, not before, does EF actually hit the database.
For reference, this trigger to fetch the data is often referred to as "enumerating the collection", i.e. turning a query into an actual result set.
By deferring the execution of that query for as long as possible, EF is able to wait and see if you're going to filter/order/paginate/transform/... the result set, which could lead to EF needing to return less data than when it executes every command immediately.
This also means that when you call Include, you're not actually hitting the database, so you're not going to be loading data from items that will later be filtered by your Where clause, if you didn't enumerate the collection.
Take these two examples:
var list1 = db.People
.Include(p => p.Cars)
.ToList() // <= enumeration
.Where(p => p.Name == "Bob");
var list2 = db.People
.Include(p => p.Cars)
.Where(p => p.Name == "Bob")
.ToList(); // <= enumeration
These lists will eventually yield the same result. However, the first list will fetch data before you filter it because you called ToList before Where. This means you're going to be loading all people and their cars in memory, only to then filter that list in memory.
The second list, however, will only enumerate the collection when it already knows about the Where clause, and therefore EF will only load people named Bob and their cars into memory. The filtering will happen on the database before it gets sent back to your runtime.
You did not show enough code for me to verify whether you are prematurely enumerating the collection. I hope this answer helps you in determining whether this is the cause of your performance issues.
database.Include(...).Where(...) and I'm wondering if that is different than database.Where(...).Include(...)?
Assuming this code is verbatim (except the missing db set) and there is nothing happening inbetween the Include and Where, the order does not change the execution and therefore it is not the source of your performance issue.
I generally advise you to put your Include statements before anything else (i.e. right after db.MyTable), as a matter of readability. The other operations depends on the specific query you're trying to construct.
Most of the times order of clauses will not make any difference
Include statement tells to SQL Join one table with another
While Where will results in.. yes, SQL Where
When you do something like database.Include(...).Where(...) you are building IQueryable object that will be transleted to direct SQL after you try to access it like with .ToList() or .FirstOrDefault() and those queries are already optimized
So if you still have performance issues - you should use profiler to look for bottlenecks and maybe consider using stored procedures (those could be integrated with EF)

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

C# laziness question

What's the common approach to design applications, which strongly rely on lazy evaluation in C# (LINQ, IEnumerable, IQueryable, ...)?
Right now I usually attempt to make every query as lazy as possible, using yield return and LINQ queries, but in runtime this could usually lead to "too lazy" behavior, when every query gets builts from it's beginning obviously resulting in severe visual performance degradation.
What I usually do means putting ToList() projection operators somewhere to cache the data, but I suspect this approach might be incorrect.
What's the appropriate / common ways to design this sort of applications from the very beginning?
I find it useful to classify each IEnumerable into one of three categories.
fast ones - e.g. lists and arrays
slow ones - e.g. database queries or heavy calculations
non-deterministic ones - e.g. list.Select(x => new { ... })
For category 1, I tend keep the concrete type when appropriate, arrays or IList etc.
For category 3, those are best to keep local within a method, to avoid hard-to find bugs.
Then we have category 2, and as always when optimizing performance, measure first to find the bottlenecks.
A few random thoughts - as the question itself is loosely defined:
Lazy is good only when the result might not be used hence loaded only when needed. Most operations, however, would need the data to be loaded so laziness is not good in that term.
Laziness can cause difficult bugs. We have seen it all with data contexts in ORMs
Lazy is good when it comes to MEF
Pretty broad question and unfortunately you're going to hear this a lot: It depends. Lazy-loading is great until it's not.
In general, if you're using the same IEnumerables over and over it might be best to cache them as lists.
But rarely does it make sense for your callers to know this either way. That is, if you're getting IEnumerables from a repository or something, it is best to let the repository do its job. It might cache it as a list internally or it might build it up every time. If your callers try to get too clever they might miss changes in the data, etc.
I would suggest doing a ToList in your DAL before returning the DTO
public IList<UserDTO> GetUsers()
{
using (var db = new DbContext())
{
return (from u in db.tblUsers
select new UserDTO()
{
Name = u.Name
}).ToList();
}
}
In the example above you have to do a ToList() before the DbContext scope ends.
I you need a certain sequence of data to be cached, call one of the aggregation operators (ToList, ToArray, etc.) on that sequence. Otherwise just use lazy evaluation.
Build your code around your data. What data is volatile and needs to be pulled fresh each time? Use lazy evaluation and don't cache. What data is relatively static and only needs to be pulled once? Cache that data in memory so you don't pull it unnecessarily.
Deferred execution and caching all items with .ToList() are not the only options. The third option is to cache the items while you are iterating by using a lazy List.
The execution is still deferred but all items are only yielded once. An example of how this work:
public class LazyListTest
{
private int _count = 0;
public void Test()
{
var numbers = Enumerable.Range(1, 40);
var numbersQuery = numbers.Select(GetElement).ToLazyList(); // Cache lazy
var total = numbersQuery.Take(3)
.Concat(numbersQuery.Take(10))
.Concat(numbersQuery.Take(3))
.Sum();
Console.WriteLine(_count);
}
private int GetElement(int value)
{
_count++;
// Some slow stuff here...
return value * 100;
}
}
If you run the Test() method, the _count is only 10. Without caching it would be 16 and with .ToList() it would be 40!
An example of the implementation of LazyList can be found here.

C# Unable to clear memory of large generic collection

i am putting 2 very large datasets into memory, performing a join to filter out a subset from the first collection and then attempting to destroy the second collection as it uses approximately 600MB of my system's RAM. The problem is that the code below is not working. After the code below runs, a foreach loop runs and takes about 15 mins. During this time the memory does NOT reduce from 600MB+. Am i doing something wrong?
List<APPLES> tmpApples = dataContext.Apples.ToList(); // 100MB
List<ORANGES> tmpOranges = dataContext.Oranges.ToList(); // 600MB
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id, oranges => oranges.Id, (apples, oranges) => apples).ToList();
tmpOranges.Clear();
tmpOranges = null;
GC.Collect();
Note i re-use tmpApples later so i am not clearing it just now..
A few things to note:
Unless your dataContext can be cleared / garbage collected, that may well be retaining references to a lot of objects
Calling Clear() and then setting the variable to null is pointless, if you're really not doing anything else with the list. The GC can tell when you're not using a variable any more, in almost all cases.
Presumably you're judging how much memory the process has reserved; I don't think the CLR will actually return memory to the operating system, but the memory which has been freed by garbage collection will be available to further uses within the CLR. (EDIT: As per comments below, it's possible that the CLR frees areas of the Large Object Heap, but I don't know for sure.)
Clearing, nullifying and collecting hardly ever has any (positive) effect. The GC will automatically detect when objects are not referenced anymore. Further more, As long as the Join operation runs, both the tmpApples and tmpOranges collections are referenced and with it all their objects. They can therefore not be collected.
A better solution would be to do the filter in the database:
// NOTE That I removed the ToList operations
IQueryable<APPLE> tmpApples = dataContext.Apples;
IQueryable<ORANGE> tmpOranges = dataContext.Oranges;
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id,
oranges => oranges.Id, (apples, oranges) => apples)
.ToList();
The reason this data is not collected back is because although you are clearing the collection (hence collection does not have a reference to items anymore),DataContext keeps a reference and this causes it to stay in memory.
You have to dispose your DataContext as soon as you are done.
UPDATE
OK, you probably have fallen victim to large object issue.
Assuming this as Large Object Heap issue you could try to not retrieve all apples at once but instead get them in "packets". So instead of calling
List<APPLE> apples = dataContext.Apples.ToList()
instead try to store the apples in separate lists
int packetSize = 100;
List<APPLE> applePacket1 = dataContext.Apples.Take(packetSize);
List<APPLE> applePacket2 = dataContext.Applies.Skip(packetSize).Take(packetSize);
Does that help?
Use some profiler tools or SOS.dll to find out, where your memory belongs to. If some operations take TOO much time, this sounds like you are swapping out to page file.
EDIT: Also keep in mind, the Debug version will delay the collection of local variables which are not referenced anymore for easier investigation.
The only thing you're doing wrong is explicitly calling the Garbage collector. You don't need to do this (in fact you shouldn't) and as Steven says you don't need to do anything to the collections anyway they'll just go away - eventually.
If you're concern is the performance of the 15 minute foreach loop perhaps it is that loop which you should post. It is probably not related to the memory usage.

Should I use a simple foreach or Linq when collecting data out of a collection

For a simple case, where class foo has a member i, and I have a collection of foos, say IEnumerable<Foo> foos, and I want to end up with a collection of foo's member i, say List<TypeOfi> result.
Question: is it preferable to use a foreach (Option 1 below) or some form of Linq (Option 2 below) or some other method. Or, perhaps, it it not even worth concerning myself with (just choose my personal preference).
Option 1:
foreach (Foo foo in foos)
result.Add(foo.i);
Option 2:
result.AddRange(foos.Select(foo => foo.i));
To me, Option 2 looks cleaner, but I'm wondering if Linq is too heavy handed for something that can achieved with such a simple foreach loop.
Looking for all opinions and suggestions.
I prefer the second option over the first. However, unless there is a reason to pre-create the List<T> and use AddRange, I would avoid it. Personally, I would use:
List<TypeOfi> results = foos.Select(f => f.i).ToList();
In addition, I would not necessarily even use ToList() unless you actually need a true List<T>, or need to force the execution to be immediate instead of deferred. If you just need the collection of "i" values to iterate, I would simply use:
var results = foos.Select(f => f.i);
I definitely prefer the second. It is far more declarative and easier to understand (to me, at least).
LINQ is here to make our lives more declarative so I would hardly consider it heavy handed even in cases as seemingly "trivial" as this.
As Reed said, though, you could improve the quality by using:
var result = foos.Select(f => f.i).ToList();
As long as there is no data already in the result collection.
LINQ isn't heavy handed in any way, both the foreach and the linq code do about the same, the foreach in the second case is just hidden away.
It really is just a matter of preference, at least concerning linq to objects. If your source collection is a linq to entities query or something different, it is a complete different case - the second case would put the query into the database which is much more effective. In this simple case, the difference probably won't be that much, but if you throw in a Where operator or others into it and make the query non-trivial, the linq query will most likely have better/faster performance.
I think you could also just do
foos.Select(foo => foo.i).ToList<TypeOfi>();

Categories