How Linq's GroupBy method has a deferred execution?

How Linq's GroupBy method has a deferred execution? - c#

I've found this question, but it has no answer yet... What algorithm does Linq GroupBy use?
Since you have to iterate over the entire source collection to know all the groups, how can its execution be deferred? Will it iterate the source collection only once? Does it use a buffer?

(I'm assuming we're only talking about LINQ to Objects.)
It's still deferred in that until you start asking for results, it won't read the source collection at all. But yes, once you ask for the first result, it will indeed read the whole collection. It only reads the source once, and it only asks each element for its grouping key once. All the results are buffered in memory, as you suspected.
My Edulinq blog post on GroupBy (Edulinq is basically a reimplementation of LINQ to Objects for the sake of education) shows a sample implementation, although that's in terms of ToLookup.

Related

Would AddRange() be faster than ToList() in this case?

I have a comma delimited string called ctext which I want to split and put into a List<string>.
Would using LINQ,
List<string> f = ctext.Split(',').ToList();
be slower than not using LINQ?
List<string> f;
f.AddRange(ctext.Split(','));
It seems that LINQ would actually copy something somewhere at some point which would make it slower, whereas AddRange() would just check the size of the list once, expand it, and dump it in.
Or is there an even faster way? (Like using a for loop, but I doubt it.)

Fortunately, we can easily look at what ToList does now that it's open source. (Follow the link for the latest source...)
I haven't seen IListProvider<T> before, but I doubt that an array implements it, which means we've basically got new List<TSource>(source). Looking at the List<T> source shows that both the constructor and AddRange basically end up using CopyTo.
In other words, other than a few levels of indirection, I'd expect them both to do the same thing.

It seems that LINQ would actually copy something somewhere at some point which would make it slower, whereas AddRange() would just check the size of the list once, expand it, and dump it in.
You are correct that both of these things are happening, but incorrect in thinking that each are specific to that one operation. Both ToList and AddRange do both of those things. Both operations copy all of the values from the input sequence into the list, and since both of them are adding multiple items at the same time they're able to see how much to expand the internal capacity of the list all at once, rather than needing to perform multiple expansions.

Does ToLookup forces immediate execution of a sequence

I was looking into Enumerable.ToLookup API which converts an enumerable sequence into a dictionary type data structure. More details can be found here:
https://msdn.microsoft.com/en-us/library/system.linq.enumerable.tolookup(v=vs.110).aspx
The only difference it carries from ToDictionary API is the fact that it won't give any error if the key selector results in duplicate keys. I need a comparison of deferred execution semantics of these two APIs. As far as I know, ToDictionary API results in immediate execution of the sequence i.e. it doesn't follow deferred execution semantics of LINQ queries. Can anyone help me with the deferred execution behavior of ToLookup API? Is it the same as ToDictionary API or there is some difference?

Easy enough to test...
void Main()
{
var lookup = Inf().ToLookup(i => i / 100);
Console.WriteLine("if you see this, ToLookup is deferred"); //never happens
}
IEnumerable<int> Inf()
{
unchecked
{
for(var i=0;;i++)
{
yield return i;
}
}
}
To recap, ToLookup greedily consumes the source sequence without deferring.
In contrast, the GroupBy operator is deferred, so you can write the following to no ill-effect:
var groups = Inf().GroupBy(i => i / 100); //oops
However, GroupBy is greedy, so when you enumerate, the entire source sequence is consumed.
This means that
groups.SelectMany(g=>g).First();
also fails to complete.
When you think about the problem of grouping, it quickly becomes apparent that when separating a sequence into a sequence of groups, it would be impossible to know if even just one of the groups were complete without completely consuming the entire sequence.

This was sort of covered here, but it was hard to find!
In short -- ToLookup does not defer execution!
ToLookup() -> immediate execution
GroupBy() (and other query methods) -> deferred execution

If you look at the reference implementation source code for both the Enumerable.ToDictionary() and the Enumerable.ToLookup() methods, you will see that both end up executing a foreach loop over the source enumerable. That's one way to confirm that the execution of the source enumerable is not deferred in both cases.
But I mean, the answer is pretty self evident in that if you start off with an enumerable, and the return value of the function is no longer an enumerable, then clearly, it must have been executed (consumed), no?
(That last paragraph was not accurate as pointed out by #spender in the comments)

Is it better DLinq over IQueryable or DLinq over IEnumerable for better performance?

If I run against dlinq, for example
var custq = data.StoreDBHelper.DataContext.Customers as IEnumerable <Data.SO.Customer>;
I Thought it was not much of a difference against running:
var custq = data.StoreDBHelper.DataContext.Customers as IQueryable <Data.SO.Customer>;
As, IQueryable inherits from IEnumerable.
But I discovered the following, if you call:
custq.Sum()
then the program will process this as you called .toList() it you use the 'as IEnumerable'
because the memory on the progam raised to the same level, when i tried, custq.ToList.Sum()
but not on the 'as IQueryable' (because the issue then running on sql server) and did not affect the memory usage of the program.
My question is simply this, should you not use 'as IEnumerable' on Dlinq? But 'as IQueryable' as an general rule? I know that if you are running standard iterations, it gets the same result, between 'as IEnumerable'and 'as IQueryable'.
But is it just the summary functions and delegate expressions in where statement that there will be a difference - or will you in general get better performance if you use 'as IQueryable'? (for standard iteration and filter functions on DLinq entities)
Thanks !

Well, depends on what you want to do...
Casting it as IEnumerable will return an object you can enumerate... and nothing more.
So yes, if you call Count on an IEnumerable, then you enumerate the list (so you actually perform your Select query) and count each iteration.
On the other hand, if you keep an IQueryable, then you may enumerate it, but you could also perform database operations like Were, OrderBy or count. Then this will delay execution of the query and eventually modify it before running it.
Calling OrderBy on an enumerable browse all results and order them in memory. Calling OrderBy on a queryable simply adds ORDER BY at the end of your SQL and let database do the ordering.
In general, it is better to keep it as an IQueryable, yes... Unless you want to count them by actually browsing them (instead of doing a SELECT COUNT(*)...)

What is the complexity of this LINQ example?

I am wondering about general performance of LINQ. I admit, that it comes handy but how performant is LINQ? I know that is a broad question. So I want to ask about a particular example:
I have an anonymous type:
var users = reader.Select(user => new MembershipUser(reader.Name, reader Age));
And now, I want to convert it to the MembershipUserCollection.
So I do it like this:
MembershipUserCollection membershipUsers = new MembershipUserCollection();
users.ToList().ForEach(membershipUsers.Add); //what is the complexity of this line?
What is the complexity of the last line? Is it n^2 ?
Is ToList() method iterates for each element of the users and adds it to the list?
Or does ToList() works differently? Because if it is not, I find hard to justice the reason of using the last line of the code instead of simply:
foreach (var user in users)
{
membershipUsers.Add(user);
}

Your example isn't particularly good for your question because ToList() isn't really in the same class of extension methods as the other ones supporting LINQ. The ToList() extension method is a conversion operation, not a query operation. The real values in LINQ are deferred execution of a composite query built by combining several LINQ query operations and improved readability. In LINQ2SQL you also get the advantage of constructing arbitrary queries that get pushed to the DB server for actual execution, taking advantage of optimizations that the DB may have in place to improve performance.
In general, I would expect that the question of performance largely comes down to how well you construct the actual queries and has a lot more to do with how well the programmer knows the tools and data than how well the tool is implemented. In your case, it makes no sense to construct a temporary list just to be able to invoke the convenience ForEach method on it if all you care about is performance. You'd be better off simply iterating over the enumeration you already have (as you suspect). LINQ won't stop a programmer from writing bad code, though it may disguise bad code for the person who doesn't understand how LINQ works.
It's always the case that you can construct an equivalent program not using LINQ for any program using LINQ. It may be that you can actually improve on the performance. I would submit, though, that LINQ makes it much easier to write readable code than non-LINQ solutions. By that, I mean more compact and understandable. It also makes it easier to write composable code, which when executed in a deferred manner performs better than, non-LINQ compositions. By breaking the code into composable parts, you simplify it and improve understandability.
I think the trick here is to really understand where LINQ makes sense rather than treat it as a shiny, new tool that you need to now use for every problem you have. The nice part of this shiny, new tool, though, is that it really does come in handy in a lot of situations.

It's O(n) - since .ToList() iterates once through the enumeration and copys the elements into the resulting List<T> (whose insertion is O(1)). Thus the complexity is fine.
The actual issue you might see is that you create a completely new, temporary List<T> just to copy its contents into another list (and afterwards discard it).
I suspect that's just due to the convenience of having a .ForEach()-method on List<T>s. One could nonetheless code a direct implementation for IEnumerable<T>s, which would save this one superfluous copying - or just write
foreach (var user in users) membershipUsers.Add(user)
which is basically what you want to express after all ;-)

Converting to a list will have the same complexity as iterating over the sequence, which may really by anything depending on how the sequence is generated. A normal Select over an in-memory list is O(n).
The performance of using ForEach on a List vs a foreach loop comes down to the overhead of invoking a delegate vs the overhead of creating and using a enumerator, I cannot say which one is quicker, but if both are used on an in-memory list, the complexity is the same.

Combining LINQ statements for efficiency

regarding linq to objects, if i use a .Where(x => x....) and then straight afterwards use a .SkipWhile(x => x...) does this incur a performance penalty because i am going over the collection twice?
Should i find a way to put everything in the Where clause or in the SkipWhile clause?

There will be a minor inefficiency due to chaining iterators together, but it really will be very minor. (In particular, although each matching item will be seen by both operators, they won't be buffered up or anything like that. LINQ to Object isn't going to create a new list of all matching items and then run SkipWhile over it.)
If this is so performance-critical, you'd probably get a very slight speed bump by not using LINQ in the first place. In every other case, write the simplest code first and only worry about micro-optimisations like this when you've proved it's a bottleneck.

Using a Where and a SkipWhile doesn't result in "going over the collection twice." LINQ to Objects works on a pull model. When you enumerate the combined query, the SkipWhile will start asking its source for elements. Its source is the Where, so this will cause the Where to start asking its source in turn for elements. So the SkipWhile will see all elements that pass the Where clause, but it's getting them as it goes. The upshot is that LINQ does a foreach over the original collection, returning only elements that pass both the Where and SkipWhile filters -- and this involves only a single pass over the collection.
There may be a trivial loss of efficiency because there are two iterators involved, but it is unlikely to be significant. You should write the code to be clear (as you are doing at the moment), and if you suspect that the clear version is causing a performance issue, measure to make sure, and only then try combining the clauses.

As with most things the answer is it depends on what you're doing. If you have multiple where's the operate on the same object it's probably worth combining them with &&'s for example.
Most of the LINQ operators won't iterate over the entire collection per operator, they merely process one item and pass it onto the next operator. There are exceptions to this such as Reverse and OrderBy, but generally if you were using Where and SkipWhile for example you would have a chain that would process one item at a time. Now your first Where statement could obviously filter out some items, so SkipWhile wouldn't see an item until it passed through the preceding operator.
My personal preference is to keep operators separate for clarity and combine them only if performance becomes an issue.

You aren't going over the collection twice when you use Where and SkipWhile.
The Where method will stream its output to the SkipWhile method one item at a time, and likewise the SkipWhile method will stream its output to any subsequent method one item at a time.
(There will be a small overhead because the compiler generates separate iterator objects for each method behind the scenes. But if I was worried about the overhead of compiler-generated iterators then I probably wouldn't be using LINQ in the first place.)

No, there's (essentially) no performance penalty. That's what lazy (deferred) execution is all about.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.