Does ToLookup forces immediate execution of a sequence - c#

I was looking into Enumerable.ToLookup API which converts an enumerable sequence into a dictionary type data structure. More details can be found here:
https://msdn.microsoft.com/en-us/library/system.linq.enumerable.tolookup(v=vs.110).aspx
The only difference it carries from ToDictionary API is the fact that it won't give any error if the key selector results in duplicate keys. I need a comparison of deferred execution semantics of these two APIs. As far as I know, ToDictionary API results in immediate execution of the sequence i.e. it doesn't follow deferred execution semantics of LINQ queries. Can anyone help me with the deferred execution behavior of ToLookup API? Is it the same as ToDictionary API or there is some difference?

Easy enough to test...
void Main()
{
var lookup = Inf().ToLookup(i => i / 100);
Console.WriteLine("if you see this, ToLookup is deferred"); //never happens
}
IEnumerable<int> Inf()
{
unchecked
{
for(var i=0;;i++)
{
yield return i;
}
}
}
To recap, ToLookup greedily consumes the source sequence without deferring.
In contrast, the GroupBy operator is deferred, so you can write the following to no ill-effect:
var groups = Inf().GroupBy(i => i / 100); //oops
However, GroupBy is greedy, so when you enumerate, the entire source sequence is consumed.
This means that
groups.SelectMany(g=>g).First();
also fails to complete.
When you think about the problem of grouping, it quickly becomes apparent that when separating a sequence into a sequence of groups, it would be impossible to know if even just one of the groups were complete without completely consuming the entire sequence.

This was sort of covered here, but it was hard to find!
In short -- ToLookup does not defer execution!
ToLookup() -> immediate execution
GroupBy() (and other query methods) -> deferred execution

If you look at the reference implementation source code for both the Enumerable.ToDictionary() and the Enumerable.ToLookup() methods, you will see that both end up executing a foreach loop over the source enumerable. That's one way to confirm that the execution of the source enumerable is not deferred in both cases.
But I mean, the answer is pretty self evident in that if you start off with an enumerable, and the return value of the function is no longer an enumerable, then clearly, it must have been executed (consumed), no?
(That last paragraph was not accurate as pointed out by #spender in the comments)

Related

Could locking an enumerable potentially cause multiple enumeration?

I think ReSharper is lying to me.
I have this extension method that (hopefully) returns an xor of two enumerations:
public static IEnumerable<T> Xor<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
lock (first)
{
lock (second)
{
var firstAsList = first.ToList();
var secondAsList = second.ToList();
return firstAsList.Except(secondAsList).Union(secondAsList.Except(firstAsList));
}
}
}
ReSharper thinks I'm performing a multiple enumeration of an IEnumerable, as you can see, on both the arguments. If I remove the locks, then it's satisfied that I'm not.
Is ReSharper right or wrong? I believe it's wrong.
edit: I do realize that I'm enumerating the lists multiple times, but ReSharper is saying I'm enumerating over the original arguments multiple times, which I don't think is true. I'm enumerating both arguments once into a list so I may then perform the actual set manipulation, but as I see it, I'm not actually iterating over the arguments passed multiple times.
For example, if the passed arguments are actually query results, my belief is this method won't cause a storm of queries to be executed by the set manipulation. I do understand what ReSharper means by warning of multiple enumeration: if the enumerables passed are heavy to generate, then if they're enumerated multiple times, then performing multiple enumerations on them will be much slower.
Also, removing the locks definitely makes ReSharper happier:
You are indeed enumerating both of the lists multiple times. You are not enumerating the enumerables passed as parameters multiple times. Both lists are enumerated once for each call to Except. The call to Union is not enumerating either sequence an additional time, but rather is enumerating the results of the two calls to Except.
Of course, iterating a List multiple times in a context like this isn't really a problem; there aren't negative consequences to iterating an unchanging list multiple times.
The lock statements have nothing whatsoever to do with enumeration of the sequences. Locking on an IEnumerable does not iterate it. Of course, locking on two objects like this, specifically two objects that are not limited in scope to this section of code, is very dangerous. It's quite possible to end up deadlocking the program with locks used in this manor if code elsewhere (such as another invocation of this method) ends up taking locks on the same objects in the opposite order).
This is a bit of a funny one.
First things first: as you've correctly identified, R# is raising this inspection not against the multiple usages of the Lists - there is of course nothing to worry about in multiply enumerating a List - but against (what R' sees as multiple usages of) the IEnumerable arguments. I'm presuming you already know why this would be potentially bad, so I'll skip that.
Now to the question of whether R# is right to complain here. To quote the C# spec,
A lock statement of the form
lock (x) ...
where x is an expression of a reference-type, is precisely
equivalent to
System.Threading.Monitor.Enter(x);
try {
...
}
finally {
System.Threading.Monitor.Exit(x);
}
except that x is only evaluated once.
(I've put in this emphasis because I like this wording; it avoids debates (that I'm definitely not qualified to enter) about whether this is "syntatic sugar" or not.)
Taking a minimal example which produces this R# inspection:
private static void Method(IEnumerable<int> enumerable)
{
lock (enumerable)
{
var list = enumerable.ToList();
}
}
and replacing it by what I think is the precisely equivalent version as mandated by the spec:
private static void Method(IEnumerable<int> enumerable)
{
var x = enumerable;
System.Threading.Monitor.Enter(x);
try
{
var list = enumerable.ToList();
}
finally
{
System.Threading.Monitor.Exit(x);
}
}
also produces the inspection.
The question then is: is R# right to produce this inspection? And this is where I think we get into a grey area. When I pass the following enumerable to either of these methods:
static IEnumerable<int> MyEnumerable()
{
Console.WriteLine("Enumerable enumerated");
yield return 1;
yield return 2;
}
it is not multiply enumerated, which would suggest that R# is wrong to warn here; however, I can't actually find anything in documentation that guarantees this behaviour of either lock or Monitor.Enter. So for me it's not quite as clear-cut as this R# bug I reported, where use of GetType flagged this inspection; but nonetheless I'd guess you're safe.
If you raise this on the R# bug tracker, you can get JetBrains' finest looking at a) whether this behaviour is indeed guaranteed. and b) whether R# can be adjusted to either not warn, or provide a justification for warning.
That said, of course, using locking here probably isn't actually achieving what you want to achieve, as stated in other answers and comments...

Is it better DLinq over IQueryable or DLinq over IEnumerable for better performance?

If I run against dlinq, for example
var custq = data.StoreDBHelper.DataContext.Customers as IEnumerable <Data.SO.Customer>;
I Thought it was not much of a difference against running:
var custq = data.StoreDBHelper.DataContext.Customers as IQueryable <Data.SO.Customer>;
As, IQueryable inherits from IEnumerable.
But I discovered the following, if you call:
custq.Sum()
then the program will process this as you called .toList() it you use the 'as IEnumerable'
because the memory on the progam raised to the same level, when i tried, custq.ToList.Sum()
but not on the 'as IQueryable' (because the issue then running on sql server) and did not affect the memory usage of the program.
My question is simply this, should you not use 'as IEnumerable' on Dlinq? But 'as IQueryable' as an general rule? I know that if you are running standard iterations, it gets the same result, between 'as IEnumerable'and 'as IQueryable'.
But is it just the summary functions and delegate expressions in where statement that there will be a difference - or will you in general get better performance if you use 'as IQueryable'? (for standard iteration and filter functions on DLinq entities)
Thanks !
Well, depends on what you want to do...
Casting it as IEnumerable will return an object you can enumerate... and nothing more.
So yes, if you call Count on an IEnumerable, then you enumerate the list (so you actually perform your Select query) and count each iteration.
On the other hand, if you keep an IQueryable, then you may enumerate it, but you could also perform database operations like Were, OrderBy or count. Then this will delay execution of the query and eventually modify it before running it.
Calling OrderBy on an enumerable browse all results and order them in memory. Calling OrderBy on a queryable simply adds ORDER BY at the end of your SQL and let database do the ordering.
In general, it is better to keep it as an IQueryable, yes... Unless you want to count them by actually browsing them (instead of doing a SELECT COUNT(*)...)

deferred execution or not

Are below comments correct about DEFERRED EXECUTION?
1. var x = dc.myTables.Select(r=>r);//yes
2. var x = dc.myTables.Where(..).Select(r=>new {..});//yes
3. var x = dc.myTables.Where(..).Select(r=>new MyCustomClass {..});//no
In other words, I always thought projecting custom class objects will always cause eager execution. But I couldn't find references supporting/denying it (though I am seeing results contradicting it, hence the post)
Every statement in your question is an example of deferred execution. The contents of the Select and Where statement have no effect on whether or not the resulting value is deferred executed or not. The Select + Where statements themselves dictate that.
As a counter example consider the Sum method. This is always eagerly executed irrespective of what the input is.
var sum = dc.myTables.Sum(...); // Always eager
To prove your point, your test should look like this:
var tracer = string.Empty;
Func<inType, outType> map = r => {
tracer = "set";
return new outType(...);
}
var x = dc.myTables.Where(..).Select(map);
// this confirms x was never enumerated as tracer would be "set".
Assert.AreEqual(string.Empty, tracer);
// confirm that it would have enumerated if it could
CollectionAssert.IsNotEmpty(x);
It has been my observation that the only way to force execution right away is to force iteration of the collection. I do this by calling .ToArray() on my LINQ.
Generally methods that return a sequence use deferred execution:
IEnumerable<X> ---> Select ---> IEnumerable<Y>
and methods that return a single object doesn't:
IEnumerable<X> ---> First ---> Y
So, methods like Where, Select, Take, Skip, GroupBy and OrderBy use deferred execution because they can, while methods like First, Single, ToList and ToArray doesn't because they can't.
from here
.Select(...) is always deferred.
When you're working with IQueryable<T>, this and other deferred execution methods build up an expression tree and this isn't ever compiled into an actual executable expression until it's iterated. That is, you need to:
Do a for-each on the projected enumerable.
Call a method that internally enumerates the enumerable (i.e. .Any(...), .Count(...), .ToList(...), ...).

When NOT to use yield (return) [duplicate]

This question already has answers here:
Closed 12 years ago.
This question already has an answer here:
Is there ever a reason to not use 'yield return' when returning an IEnumerable?
There are several useful questions here on SO about the benefits of yield return. For example,
Can someone demystify the yield
keyword
Interesting use of the c# yield
keyword
What is the yield keyword
I'm looking for thoughts on when NOT to use yield return. For example, if I expect to need to return all items in a collection, it doesn't seem like yield would be useful, right?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
It's a good idea to think carefully about your use of "yield return" when dealing with recursively defined structures. For example, I often see this:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
if (root == null) yield break;
yield return root.Value;
foreach(T item in PreorderTraversal(root.Left))
yield return item;
foreach(T item in PreorderTraversal(root.Right))
yield return item;
}
Perfectly sensible-looking code, but it has performance problems. Suppose the tree is h deep. Then there will at most points be O(h) nested iterators built. Calling "MoveNext" on the outer iterator will then make O(h) nested calls to MoveNext. Since it does this O(n) times for a tree with n items, that makes the algorithm O(hn). And since the height of a binary tree is lg n <= h <= n, that means that the algorithm is at best O(n lg n) and at worst O(n^2) in time, and best case O(lg n) and worse case O(n) in stack space. It is O(h) in heap space because each enumerator is allocated on the heap. (On implementations of C# I'm aware of; a conforming implementation might have other stack or heap space characteristics.)
But iterating a tree can be O(n) in time and O(1) in stack space. You can write this instead like:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
var stack = new Stack<Tree<T>>();
stack.Push(root);
while (stack.Count != 0)
{
var current = stack.Pop();
if (current == null) continue;
yield return current.Value;
stack.Push(current.Left);
stack.Push(current.Right);
}
}
which still uses yield return, but is much smarter about it. Now we are O(n) in time and O(h) in heap space, and O(1) in stack space.
Further reading: see Wes Dyer's article on the subject:
http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx
What are the cases where use of yield
will be limiting, unnecessary, get me
into trouble, or otherwise should be
avoided?
I can think of a couple of cases, IE:
Avoid using yield return when you return an existing iterator. Example:
// Don't do this, it creates overhead for no reason
// (a new state machine needs to be generated)
public IEnumerable<string> GetKeys()
{
foreach(string key in _someDictionary.Keys)
yield return key;
}
// DO this
public IEnumerable<string> GetKeys()
{
return _someDictionary.Keys;
}
Avoid using yield return when you don't want to defer execution code for the method. Example:
// Don't do this, the exception won't get thrown until the iterator is
// iterated, which can be very far away from this method invocation
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
yield ...
}
// DO this
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
return new BazIterator(baz);
}
The key thing to realize is what yield is useful for, then you can decide which cases do not benefit from it.
In other words, when you do not need a sequence to be lazily evaluated you can skip the use of yield. When would that be? It would be when you do not mind immediately having your entire collection in memory. Otherwise, if you have a huge sequence that would negatively impact memory, you would want to use yield to work on it step by step (i.e., lazily). A profiler might come in handy when comparing both approaches.
Notice how most LINQ statements return an IEnumerable<T>. This allows us to continually string different LINQ operations together without negatively impacting performance at each step (aka deferred execution). The alternative picture would be putting a ToList() call in between each LINQ statement. This would cause each preceding LINQ statement to be immediately executed before performing the next (chained) LINQ statement, thereby forgoing any benefit of lazy evaluation and utilizing the IEnumerable<T> till needed.
There are a lot of excellent answers here. I would add this one: Don't use yield return for small or empty collections where you already know the values:
IEnumerable<UserRight> GetSuperUserRights() {
if(SuperUsersAllowed) {
yield return UserRight.Add;
yield return UserRight.Edit;
yield return UserRight.Remove;
}
}
In these cases the creation of the Enumerator object is more expensive, and more verbose, than just generating a data structure.
IEnumerable<UserRight> GetSuperUserRights() {
return SuperUsersAllowed
? new[] {UserRight.Add, UserRight.Edit, UserRight.Remove}
: Enumerable.Empty<UserRight>();
}
Update
Here's the results of my benchmark:
These results show how long it took (in milliseconds) to perform the operation 1,000,000 times. Smaller numbers are better.
In revisiting this, the performance difference isn't significant enough to worry about, so you should go with whatever is the easiest to read and maintain.
Update 2
I'm pretty sure the above results were achieved with compiler optimization disabled. Running in Release mode with a modern compiler, it appears performance is practically indistinguishable between the two. Go with whatever is most readable to you.
Eric Lippert raises a good point (too bad C# doesn't have stream flattening like Cw). I would add that sometimes the enumeration process is expensive for other reasons, and therefore you should use a list if you intend to iterate over the IEnumerable more than once.
For example, LINQ-to-objects is built on "yield return". If you've written a slow LINQ query (e.g. that filters a large list into a small list, or that does sorting and grouping), it may be wise to call ToList() on the result of the query in order to avoid enumerating multiple times (which actually executes the query multiple times).
If you are choosing between "yield return" and List<T> when writing a method, consider: is each single element expensive to compute, and will the caller need to enumerate the results more than once? If you know the answers are yes and yes, you shouldn't use yield return (unless, for example, the List produced is very large and you can't afford the memory it would use. Remember, another benefit of yield is that the result list doesn't have to be entirely in memory at once).
Another reason not to use "yield return" is if interleaving operations is dangerous. For example, if your method looks something like this,
IEnumerable<T> GetMyStuff() {
foreach (var x in MyCollection)
if (...)
yield return (...);
}
this is dangerous if there is a chance that MyCollection will change because of something the caller does:
foreach(T x in GetMyStuff()) {
if (...)
MyCollection.Add(...);
// Oops, now GetMyStuff() will throw an exception
// because MyCollection was modified.
}
yield return can cause trouble whenever the caller changes something that the yielding function assumes does not change.
I would avoid using yield return if the method has a side effect that you expect on calling the method. This is due to the deferred execution that Pop Catalin mentions.
One side effect could be modifying the system, which could happen in a method like IEnumerable<Foo> SetAllFoosToCompleteAndGetAllFoos(), which breaks the single responsibility principle. That's pretty obvious (now...), but a not so obvious side effect could be setting a cached result or similar as an optimisation.
My rules of thumb (again, now...) are:
Only use yield if the object being returned requires a bit of processing
No side effects in the method if I need to use yield
If have to have side effects (and limiting that to caching etc), don't use yield and make sure the benefits of expanding the iteration outweigh the costs
Yield would be limiting/unnecessary when you need random access. If you need to access element 0 then element 99, you've pretty much eliminated the usefulness of lazy evaluation.
One that might catch you out is if you are serialising the results of an enumeration and sending them over the wire. Because the execution is deferred until the results are needed, you will serialise an empty enumeration and send that back instead of the results you want.
I have to maintain a pile of code from a guy who was absolutely obsessed with yield return and IEnumerable. The problem is that a lot of third party APIs we use, as well as a lot of our own code, depend on Lists or Arrays. So I end up having to do:
IEnumerable<foo> myFoos = getSomeFoos();
List<foo> fooList = new List<foo>(myFoos);
thirdPartyApi.DoStuffWithArray(fooList.ToArray());
Not necessarily bad, but kind of annoying to deal with, and on a few occasions it's led to creating duplicate Lists in memory to avoid refactoring everything.
When you don't want a code block to return an iterator for sequential access to an underlying collection, you dont need yield return. You simply return the collection then.
If you're defining a Linq-y extension method where you're wrapping actual Linq members, those members will more often than not return an iterator. Yielding through that iterator yourself is unnecessary.
Beyond that, you can't really get into much trouble using yield to define a "streaming" enumerable that is evaluated on a JIT basis.

Combining LINQ statements for efficiency

regarding linq to objects, if i use a .Where(x => x....) and then straight afterwards use a .SkipWhile(x => x...) does this incur a performance penalty because i am going over the collection twice?
Should i find a way to put everything in the Where clause or in the SkipWhile clause?
There will be a minor inefficiency due to chaining iterators together, but it really will be very minor. (In particular, although each matching item will be seen by both operators, they won't be buffered up or anything like that. LINQ to Object isn't going to create a new list of all matching items and then run SkipWhile over it.)
If this is so performance-critical, you'd probably get a very slight speed bump by not using LINQ in the first place. In every other case, write the simplest code first and only worry about micro-optimisations like this when you've proved it's a bottleneck.
Using a Where and a SkipWhile doesn't result in "going over the collection twice." LINQ to Objects works on a pull model. When you enumerate the combined query, the SkipWhile will start asking its source for elements. Its source is the Where, so this will cause the Where to start asking its source in turn for elements. So the SkipWhile will see all elements that pass the Where clause, but it's getting them as it goes. The upshot is that LINQ does a foreach over the original collection, returning only elements that pass both the Where and SkipWhile filters -- and this involves only a single pass over the collection.
There may be a trivial loss of efficiency because there are two iterators involved, but it is unlikely to be significant. You should write the code to be clear (as you are doing at the moment), and if you suspect that the clear version is causing a performance issue, measure to make sure, and only then try combining the clauses.
As with most things the answer is it depends on what you're doing. If you have multiple where's the operate on the same object it's probably worth combining them with &&'s for example.
Most of the LINQ operators won't iterate over the entire collection per operator, they merely process one item and pass it onto the next operator. There are exceptions to this such as Reverse and OrderBy, but generally if you were using Where and SkipWhile for example you would have a chain that would process one item at a time. Now your first Where statement could obviously filter out some items, so SkipWhile wouldn't see an item until it passed through the preceding operator.
My personal preference is to keep operators separate for clarity and combine them only if performance becomes an issue.
You aren't going over the collection twice when you use Where and SkipWhile.
The Where method will stream its output to the SkipWhile method one item at a time, and likewise the SkipWhile method will stream its output to any subsequent method one item at a time.
(There will be a small overhead because the compiler generates separate iterator objects for each method behind the scenes. But if I was worried about the overhead of compiler-generated iterators then I probably wouldn't be using LINQ in the first place.)
No, there's (essentially) no performance penalty. That's what lazy (deferred) execution is all about.

Categories