C# laziness question - c#

What's the common approach to design applications, which strongly rely on lazy evaluation in C# (LINQ, IEnumerable, IQueryable, ...)?
Right now I usually attempt to make every query as lazy as possible, using yield return and LINQ queries, but in runtime this could usually lead to "too lazy" behavior, when every query gets builts from it's beginning obviously resulting in severe visual performance degradation.
What I usually do means putting ToList() projection operators somewhere to cache the data, but I suspect this approach might be incorrect.
What's the appropriate / common ways to design this sort of applications from the very beginning?

I find it useful to classify each IEnumerable into one of three categories.
fast ones - e.g. lists and arrays
slow ones - e.g. database queries or heavy calculations
non-deterministic ones - e.g. list.Select(x => new { ... })
For category 1, I tend keep the concrete type when appropriate, arrays or IList etc.
For category 3, those are best to keep local within a method, to avoid hard-to find bugs.
Then we have category 2, and as always when optimizing performance, measure first to find the bottlenecks.

A few random thoughts - as the question itself is loosely defined:
Lazy is good only when the result might not be used hence loaded only when needed. Most operations, however, would need the data to be loaded so laziness is not good in that term.
Laziness can cause difficult bugs. We have seen it all with data contexts in ORMs
Lazy is good when it comes to MEF

Pretty broad question and unfortunately you're going to hear this a lot: It depends. Lazy-loading is great until it's not.
In general, if you're using the same IEnumerables over and over it might be best to cache them as lists.
But rarely does it make sense for your callers to know this either way. That is, if you're getting IEnumerables from a repository or something, it is best to let the repository do its job. It might cache it as a list internally or it might build it up every time. If your callers try to get too clever they might miss changes in the data, etc.

I would suggest doing a ToList in your DAL before returning the DTO
public IList<UserDTO> GetUsers()
{
using (var db = new DbContext())
{
return (from u in db.tblUsers
select new UserDTO()
{
Name = u.Name
}).ToList();
}
}
In the example above you have to do a ToList() before the DbContext scope ends.

I you need a certain sequence of data to be cached, call one of the aggregation operators (ToList, ToArray, etc.) on that sequence. Otherwise just use lazy evaluation.
Build your code around your data. What data is volatile and needs to be pulled fresh each time? Use lazy evaluation and don't cache. What data is relatively static and only needs to be pulled once? Cache that data in memory so you don't pull it unnecessarily.

Deferred execution and caching all items with .ToList() are not the only options. The third option is to cache the items while you are iterating by using a lazy List.
The execution is still deferred but all items are only yielded once. An example of how this work:
public class LazyListTest
{
private int _count = 0;
public void Test()
{
var numbers = Enumerable.Range(1, 40);
var numbersQuery = numbers.Select(GetElement).ToLazyList(); // Cache lazy
var total = numbersQuery.Take(3)
.Concat(numbersQuery.Take(10))
.Concat(numbersQuery.Take(3))
.Sum();
Console.WriteLine(_count);
}
private int GetElement(int value)
{
_count++;
// Some slow stuff here...
return value * 100;
}
}
If you run the Test() method, the _count is only 10. Without caching it would be 16 and with .ToList() it would be 40!
An example of the implementation of LazyList can be found here.

Related

Your preference to 'materializing' IEnumerables?

Sometimes it is necessary to actually 'evaluate' IEnumerable in the middle of a method, because it is used in multiple queries and compiler issues warning ("Possible multiple enumeration of IEnumerable")
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct();
var skippedLookup = skippedIds.ToLookup(x => x.FundId, _ => new { _.Id, _.Name});
if (skippedIds.Any()) // compiler warning
{
...
// other iterations over skippedIds, etc.
}
I used to do:
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct()
.ToList();
...
but would like to know if there are better options. The code above creates List<T> object on the heap which is I guess unnecessary GC burden in the context of a temporary variable that dies within the method.
I am now using ToImmutableArray() that comes with System.Collections.Immutable library. Not only this creates stack-allocated object (not true, thanks commentors), but it also attaches 'immutable' semantics to my code which is I guess a good functional-style practice.
But what are performance implications? What is the preferable way of 'materializing' temporary subquery results that are used in multiple places locally within a method?
The performance implications of materialising it in memory are:
The initial grab of all items from the database - if you're not going to be using all of the items, then you could be taking more than you need.
Depending on the structure you use you could have insertion costs - ToImmutableArray() will be about as quick as ToArray() because ImmutableArray just wraps the built-in array type and removes the mutation option.
GC burdens are less of a concern if you're throwing the object away quickly. Because it's very unlikely the item will jump from Gen 0 to Gen 1 and will be collected without much cost. But obviously the more big objects you allocate the more likely it is that a collection is triggered.
You could use the Seq<A> type from language-ext (Disclosure: I'm the author). Which is designed to be a 'better enumerable' in that it will only ever consume each item in an IEnumerable<A> once and is lazy like IEnumerable<A>.
So, you could do this:
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct()
.ToSeq();
Obviously there's nothing for free in this world, and the costs of Seq<A> are:
An allocation per item consumed (as it memorises the items you've read so you don't do it again). But they're tiny objects that just have two references in and so cause very little GC pressure.
Holding open the connection to the database longer than you possibly need, which could cause other performance issues with your db: deadlocks, etc.
But the benefits are you only consume what you need and you consume it once. Personally I would look to limit your query and use ToImmutableArray(), taking less than you need from the db will always be the preferred approach.
In this specific case, the issue is that you have materialised the results (in the form of a Lookup), but then refer to the unmaterialised results.
var skippedIds = objects.Where(x => x.State=="skip")
.Select(x => x.Id)
.Distinct();
var skippedLookup = skippedIds.ToLookup(x => x.FundId, _ => new { _.Id, _.Name});
if (skippedIds.Any()) // compiler warning
In the above code, skippedIds is not materialised, but skippedLookup is. As such, you may consider changing:
if (skippedIds.Any()) // compiler warning
to:
if (skippedLookup.Any()) // no compiler warning
If we take the more general case, some additional guidance:
Consider the performance cost of multiple enumeration (e.g. hitting the database twice) vs materialising (e.g. RAM usage) - which is best can be contextual
Consider using ToList or ToImmutableArray for materialising (both appear to perform well).
Consider whether any of the LINQ operations can be removed from the code without an impact on overall functionality. A common mistake is to use Any then foreach - in many cases the Any can be removed since the foreach will automatically do nothing if the enumerable is empty.
If the IEnumerable is using LINQ to Objects and you are performing Distinct then a materialising operation (e.g. ToList) then instead use new HashSet<YourTypeHere>(YourEnumerableHere) . It will perform the Distinct and the materialising operation in one hit.
When materialising using ToList consider exposing the resulting List as IReadOnlyList to indicate to the consumers that it is not designed to be altered.
In practical terms, it rarely matters much what approach you choose. There will some GC overhead of the List and its underlying array, sure. But in the broader context of the overall GC load (e.g. of the objects that the List contains) it is unlikely to be an issue. If the list gets big enough then the Large Object Heap can be involved, which isn't optimal. But honestly, let the GC do its job. If there is a problem, optimise then but not before.

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

Fastest Approach to Finding Number of Unique Members in a List

I've been trying to find a good way of looking for the number of unique values from a list. There was a very good question here which I tried to peruse to create a solution that looks like this:
gridStats[0] = gridList.SelectMany(x => x.Position.Easting).Distinct().ToList().Count();
gridStats[1] = gridList.SelectMany(x => x.Position.Northing).Distinct().ToList().Count();
However, that seems to produce an error saying that I am implicitly declaring the type arguments that didn't make sense. Further research seemed to suggest that 'Distinct', good as it is, would not actually provide what I am looking for in any case without some additional code.
Therefore, I gave up on that approach and tried to go for a loop method, and I have arrived at this:
List<double> eastings = new List<double>();
List<double> northings = new List<double>();
for (int i = 0; i < gridList.Count; i++)
{
if (!eastings.Contains(gridList[i].Position.Easting))
{
eastings.Add(gridList[i].Position.Easting);
}
if (!northings.Contains(gridList[i].Position.Northing))
{
northings.Add(gridList[i].Position.Northing);
}
}
gridStats[0] = eastings.Count;
gridStats[1] = northings.Count;
Note here that 'gridList' can have hundreds of millions of entries.
Quite predictably, this loop is not particularly fast in use. Therefore, I was hoping it would be possible to either get assistance in making that loop more efficient or assistance in sorting out the Linq approach.
What do you suggest as the best approach when the only concern is the speed at which this task is performed?
You were so close.
Distinct is indeed the best choice for this scenario - it's similar to HashSet<T> based implementation, but uses internally a special lightweight hash set implementation. In practice I don't think there will be a noticeable difference in performance, but still Distinct is more readable and at the same time a bit faster.
What you've missed though is that the question in the link is about list of objects having a list property so it needed SelectMany, while in your case the objects hold a single property, so a simple Select will do the job, like this
gridStats[0] = gridList.Select(x => x.Position.Easting).Distinct().Count();
gridStats[1] = gridList.Select(x => x.Position.Northing).Distinct().Count();
Also note that ToList call was not needed in order to use Count extension method. Every operation has a cost, so don't include unnecessary methods - they'll not make your code more readable, but for sure will make it slower and more space consuming.
You can speed this up by using HashSet instead of List for eastings and northings:
HashSet<double> eastings = new HashSet<double>();
HashSet<double> northings = new HashSet<double>();
The reason this would be faster is because a HashSet uses a hash to give O(1) look ups, versus using List which will be O(n) (it has to search the whole list to see if the item exists).

LinQ optimization

Here is a peace of code:
void MyFunc(List<MyObj> objects)
{
MyFunc1(objects);
foreach( MyObj obj in objects.Where(obj1=>obj1.Good))
{
// Do Action With Good Object
}
}
void MyFunc1(List<MyObj> objects)
{
int iGoodCount = objects.Where(obj1=>obj1.Good).Count();
BeHappy(iGoodCount);
// do other stuff with 'objects' collection
}
Here we see that collection is analyzed twice and each time the value of 'Good' property is checked for each member: 1st time when calculating count of good objects, 2nd - when iterating through all good objects.
It is desirable to have that optimized, and here is a straightforward solution:
before call to MyFunc1 makecreate an additional temporary collection of good objects only (goodObjects, it can be IEnumerable);
get count of these objects and pass it as an additional parameter to MyFunc1;
in the 'MyFunc' method iterate not through 'objects.Where(...)' but through the 'goodObjects' collection.
Not too bad approach (as far as I see), but additional variable is required to be created in the 'MyFunc' method and additional parameter is required to be passed.
Question: is there any LinQ out-of-the-box functionality that allows any caching during 1st Where().Count(), remembering a processed collection and use it in the next iteration?
Any thoughts are welcome.
Thanks.
No, LINQ queries are not optimized in this way (what you describe is similar to the way SQL Server reuses a query execution plan). LINQ does not (and, for practical purposes, cannot) know enough about your objects in order to optimize this way. As far as it knows, your collection has changed (or is entirely different) between the two calls.
You're obviously aware of the ability to persist your query into a new List<T>, but apart from that there's really nothing that I can recommend without knowing more about your class and where else MyFunc is used.
As long as MyFunc1 doesn't need to modify the list by adding/removing objects, this will work.
void MyFunc(List<MyObj> objects)
{
ILookup<bool, MyObj> objLookup = objects.ToLookup(obj1 => obj1.Good);
MyFunc1(objLookup[true]);
foreach(MyObj obj in objLookup[true])
{
//..
}
}
void MyFunc1(IEnumerable<MyObj> objects)
{
//..
}

What are real life applications of yield?

I know what yield does, and I've seen a few examples, but I can't think of real life applications, have you used it to solve some specific problem?
(Ideally some problem that cannot be solved some other way)
I realise this is an old question (pre Jon Skeet?) but I have been considering this question myself just lately. Unfortunately the current answers here (in my opinion) don't mention the most obvious advantage of the yield statement.
The biggest benefit of the yield statement is that it allows you to iterate over very large lists with much more efficient memory usage then using say a standard list.
For example, let's say you have a database query that returns 1 million rows. You could retrieve all rows using a DataReader and store them in a List, therefore requiring list_size * row_size bytes of memory.
Or you could use the yield statement to create an Iterator and only ever store one row in memory at a time. In effect this gives you the ability to provide a "streaming" capability over large sets of data.
Moreover, in the code that uses the Iterator, you use a simple foreach loop and can decide to break out from the loop as required. If you do break early, you have not forced the retrieval of the entire set of data when you only needed the first 5 rows (for example).
Regarding:
Ideally some problem that cannot be solved some other way
The yield statement does not give you anything you could not do using your own custom iterator implementation, but it saves you needing to write the often complex code needed. There are very few problems (if any) that can't solved more than one way.
Here are a couple of more recent questions and answers that provide more detail:
Yield keyword value added?
Is yield useful outside of LINQ?
actually I use it in a non traditional way on my site IdeaPipe
public override IEnumerator<T> GetEnumerator()
{
// goes through the collection and only returns the ones that are visible for the current user
// this is done at this level instead of the display level so that ideas do not bleed through
// on services
foreach (T idea in InternalCollection)
if (idea.IsViewingAuthorized)
yield return idea;
}
so basically it checks if viewing the idea is currently authorized and if it is it returns the idea. If it isn't, it is just skipped. This allows me to cache the Ideas but still display the ideas to the users that are authorized. Else I would have to re pull them each time based on permissions, when they are only re-ranked every 1 hour.
One interesting use is as a mechanism for asynchronous programming esp for tasks that take multiple steps and require the same set of data in each step. Two examples of this would be Jeffery Richters AysncEnumerator Part 1 and Part 2. The Concurrency and Coordination Runtime (CCR) also makes use of this technique CCR Iterators.
LINQ's operators on the Enumerable class are implemented as iterators that are created with the yield statement. It allows you to chain operations like Select() and Where() without actually enumerating anything until you actually use the enumerator in a loop, typically by using the foreach statement. Also, since only one value is computed when you call IEnumerator.MoveNext() if you decide to stop mid-collection, you'll save the performance hit of calculating all of the results.
Iterators can also be used to implement other kinds of lazy evaluation where expressions are evaluated only when you need it. You can also use yield for more fancy stuff like coroutines.
Another good use for yield is to perform a function on the elements of an IEnumerable and to return a result of a different type, for example:
public delegate T SomeDelegate(K obj);
public IEnumerable<T> DoActionOnList(IEnumerable<K> list, SomeDelegate action)
{
foreach (var i in list)
yield return action(i);
}
Using yield can prevent downcasting to a concrete type. This is handy to ensure that the consumer of the collection doesn't manipulate it.
You can also use yield return to treat a series of function results as a list. For instance, consider a company that pays its employees every two weeks. One could retrieve a subset of payroll dates as a list using this code:
void Main()
{
var StartDate = DateTime.Parse("01/01/2013");
var EndDate = DateTime.Parse("06/30/2013");
foreach (var d in GetPayrollDates(StartDate, EndDate)) {
Console.WriteLine(d);
}
}
// Calculate payroll dates in the given range.
// Assumes the first date given is a payroll date.
IEnumerable<DateTime> GetPayrollDates(DateTime startDate, DateTime endDate, int daysInPeriod = 14) {
var thisDate = startDate;
while (thisDate < endDate) {
yield return thisDate;
thisDate = thisDate.AddDays(daysInPeriod);
}
}

Categories