This question already has answers here:
Closed 12 years ago.
This question already has an answer here:
Is there ever a reason to not use 'yield return' when returning an IEnumerable?
There are several useful questions here on SO about the benefits of yield return. For example,
Can someone demystify the yield
keyword
Interesting use of the c# yield
keyword
What is the yield keyword
I'm looking for thoughts on when NOT to use yield return. For example, if I expect to need to return all items in a collection, it doesn't seem like yield would be useful, right?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
It's a good idea to think carefully about your use of "yield return" when dealing with recursively defined structures. For example, I often see this:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
if (root == null) yield break;
yield return root.Value;
foreach(T item in PreorderTraversal(root.Left))
yield return item;
foreach(T item in PreorderTraversal(root.Right))
yield return item;
}
Perfectly sensible-looking code, but it has performance problems. Suppose the tree is h deep. Then there will at most points be O(h) nested iterators built. Calling "MoveNext" on the outer iterator will then make O(h) nested calls to MoveNext. Since it does this O(n) times for a tree with n items, that makes the algorithm O(hn). And since the height of a binary tree is lg n <= h <= n, that means that the algorithm is at best O(n lg n) and at worst O(n^2) in time, and best case O(lg n) and worse case O(n) in stack space. It is O(h) in heap space because each enumerator is allocated on the heap. (On implementations of C# I'm aware of; a conforming implementation might have other stack or heap space characteristics.)
But iterating a tree can be O(n) in time and O(1) in stack space. You can write this instead like:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
var stack = new Stack<Tree<T>>();
stack.Push(root);
while (stack.Count != 0)
{
var current = stack.Pop();
if (current == null) continue;
yield return current.Value;
stack.Push(current.Left);
stack.Push(current.Right);
}
}
which still uses yield return, but is much smarter about it. Now we are O(n) in time and O(h) in heap space, and O(1) in stack space.
Further reading: see Wes Dyer's article on the subject:
http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx
What are the cases where use of yield
will be limiting, unnecessary, get me
into trouble, or otherwise should be
avoided?
I can think of a couple of cases, IE:
Avoid using yield return when you return an existing iterator. Example:
// Don't do this, it creates overhead for no reason
// (a new state machine needs to be generated)
public IEnumerable<string> GetKeys()
{
foreach(string key in _someDictionary.Keys)
yield return key;
}
// DO this
public IEnumerable<string> GetKeys()
{
return _someDictionary.Keys;
}
Avoid using yield return when you don't want to defer execution code for the method. Example:
// Don't do this, the exception won't get thrown until the iterator is
// iterated, which can be very far away from this method invocation
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
yield ...
}
// DO this
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
return new BazIterator(baz);
}
The key thing to realize is what yield is useful for, then you can decide which cases do not benefit from it.
In other words, when you do not need a sequence to be lazily evaluated you can skip the use of yield. When would that be? It would be when you do not mind immediately having your entire collection in memory. Otherwise, if you have a huge sequence that would negatively impact memory, you would want to use yield to work on it step by step (i.e., lazily). A profiler might come in handy when comparing both approaches.
Notice how most LINQ statements return an IEnumerable<T>. This allows us to continually string different LINQ operations together without negatively impacting performance at each step (aka deferred execution). The alternative picture would be putting a ToList() call in between each LINQ statement. This would cause each preceding LINQ statement to be immediately executed before performing the next (chained) LINQ statement, thereby forgoing any benefit of lazy evaluation and utilizing the IEnumerable<T> till needed.
There are a lot of excellent answers here. I would add this one: Don't use yield return for small or empty collections where you already know the values:
IEnumerable<UserRight> GetSuperUserRights() {
if(SuperUsersAllowed) {
yield return UserRight.Add;
yield return UserRight.Edit;
yield return UserRight.Remove;
}
}
In these cases the creation of the Enumerator object is more expensive, and more verbose, than just generating a data structure.
IEnumerable<UserRight> GetSuperUserRights() {
return SuperUsersAllowed
? new[] {UserRight.Add, UserRight.Edit, UserRight.Remove}
: Enumerable.Empty<UserRight>();
}
Update
Here's the results of my benchmark:
These results show how long it took (in milliseconds) to perform the operation 1,000,000 times. Smaller numbers are better.
In revisiting this, the performance difference isn't significant enough to worry about, so you should go with whatever is the easiest to read and maintain.
Update 2
I'm pretty sure the above results were achieved with compiler optimization disabled. Running in Release mode with a modern compiler, it appears performance is practically indistinguishable between the two. Go with whatever is most readable to you.
Eric Lippert raises a good point (too bad C# doesn't have stream flattening like Cw). I would add that sometimes the enumeration process is expensive for other reasons, and therefore you should use a list if you intend to iterate over the IEnumerable more than once.
For example, LINQ-to-objects is built on "yield return". If you've written a slow LINQ query (e.g. that filters a large list into a small list, or that does sorting and grouping), it may be wise to call ToList() on the result of the query in order to avoid enumerating multiple times (which actually executes the query multiple times).
If you are choosing between "yield return" and List<T> when writing a method, consider: is each single element expensive to compute, and will the caller need to enumerate the results more than once? If you know the answers are yes and yes, you shouldn't use yield return (unless, for example, the List produced is very large and you can't afford the memory it would use. Remember, another benefit of yield is that the result list doesn't have to be entirely in memory at once).
Another reason not to use "yield return" is if interleaving operations is dangerous. For example, if your method looks something like this,
IEnumerable<T> GetMyStuff() {
foreach (var x in MyCollection)
if (...)
yield return (...);
}
this is dangerous if there is a chance that MyCollection will change because of something the caller does:
foreach(T x in GetMyStuff()) {
if (...)
MyCollection.Add(...);
// Oops, now GetMyStuff() will throw an exception
// because MyCollection was modified.
}
yield return can cause trouble whenever the caller changes something that the yielding function assumes does not change.
I would avoid using yield return if the method has a side effect that you expect on calling the method. This is due to the deferred execution that Pop Catalin mentions.
One side effect could be modifying the system, which could happen in a method like IEnumerable<Foo> SetAllFoosToCompleteAndGetAllFoos(), which breaks the single responsibility principle. That's pretty obvious (now...), but a not so obvious side effect could be setting a cached result or similar as an optimisation.
My rules of thumb (again, now...) are:
Only use yield if the object being returned requires a bit of processing
No side effects in the method if I need to use yield
If have to have side effects (and limiting that to caching etc), don't use yield and make sure the benefits of expanding the iteration outweigh the costs
Yield would be limiting/unnecessary when you need random access. If you need to access element 0 then element 99, you've pretty much eliminated the usefulness of lazy evaluation.
One that might catch you out is if you are serialising the results of an enumeration and sending them over the wire. Because the execution is deferred until the results are needed, you will serialise an empty enumeration and send that back instead of the results you want.
I have to maintain a pile of code from a guy who was absolutely obsessed with yield return and IEnumerable. The problem is that a lot of third party APIs we use, as well as a lot of our own code, depend on Lists or Arrays. So I end up having to do:
IEnumerable<foo> myFoos = getSomeFoos();
List<foo> fooList = new List<foo>(myFoos);
thirdPartyApi.DoStuffWithArray(fooList.ToArray());
Not necessarily bad, but kind of annoying to deal with, and on a few occasions it's led to creating duplicate Lists in memory to avoid refactoring everything.
When you don't want a code block to return an iterator for sequential access to an underlying collection, you dont need yield return. You simply return the collection then.
If you're defining a Linq-y extension method where you're wrapping actual Linq members, those members will more often than not return an iterator. Yielding through that iterator yourself is unnecessary.
Beyond that, you can't really get into much trouble using yield to define a "streaming" enumerable that is evaluated on a JIT basis.
Related
I'm not sure that this has a huge effect on modern day computers.
However, it is interesting to know.
I've been reading these resources trying to find out if the over head of using the iterator state machine is larger than just returning a list. The emphasis is not on the ability to lazy load, in my example it makes no odds on when it is evaluated.
Resoures:
C# yield return performance
When NOT to use yield (return)
Is there ever a reason to not use 'yield return' when returning an IEnumerable?
https://blogs.msdn.microsoft.com/wesdyer/2007/03/23/all-about-iterators/
https://coding.abel.nu/2011/12/return-ienumerable-with-yield-return/
Is 'yield return' slower than "old school" return?
Now the last resource is probably the closest.
It comes down to this if you know length of list is a constant is:
Is the footprint of IEnumerable<T> state machine > List<T>
So:
public void IEnumerable<CustomClass> GetCustoms(string input)
{
foreach (var ch in input.ToCharArray())
yield return new CustomClass(ch);
}
does the state machine create a larger overhead than:
public void IEnumerable<CustomClass> GetCustoms(string input)
{
var result = new List<CustomClass>(input.Length);
foreach(var ch in input.ToCharArray())
result.Add(new CustomClass(ch));
return result;
}
Obviously this is a contrived example. but it illustrates the point.
I can imagine, that if there was a lot of preprocessing happening before the iteration in the methods, that the state machine would contain all that pre processing overhead.
Logic would say that the allocation of List memory is far less than the allocation IEnumerable state machine memory.
Obviouosly the bigger the enumeration, the more it starts to flip the other way and the IEnumerable state machine becomes a smaller foot print.
Are these observations correct?
I think ReSharper is lying to me.
I have this extension method that (hopefully) returns an xor of two enumerations:
public static IEnumerable<T> Xor<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
lock (first)
{
lock (second)
{
var firstAsList = first.ToList();
var secondAsList = second.ToList();
return firstAsList.Except(secondAsList).Union(secondAsList.Except(firstAsList));
}
}
}
ReSharper thinks I'm performing a multiple enumeration of an IEnumerable, as you can see, on both the arguments. If I remove the locks, then it's satisfied that I'm not.
Is ReSharper right or wrong? I believe it's wrong.
edit: I do realize that I'm enumerating the lists multiple times, but ReSharper is saying I'm enumerating over the original arguments multiple times, which I don't think is true. I'm enumerating both arguments once into a list so I may then perform the actual set manipulation, but as I see it, I'm not actually iterating over the arguments passed multiple times.
For example, if the passed arguments are actually query results, my belief is this method won't cause a storm of queries to be executed by the set manipulation. I do understand what ReSharper means by warning of multiple enumeration: if the enumerables passed are heavy to generate, then if they're enumerated multiple times, then performing multiple enumerations on them will be much slower.
Also, removing the locks definitely makes ReSharper happier:
You are indeed enumerating both of the lists multiple times. You are not enumerating the enumerables passed as parameters multiple times. Both lists are enumerated once for each call to Except. The call to Union is not enumerating either sequence an additional time, but rather is enumerating the results of the two calls to Except.
Of course, iterating a List multiple times in a context like this isn't really a problem; there aren't negative consequences to iterating an unchanging list multiple times.
The lock statements have nothing whatsoever to do with enumeration of the sequences. Locking on an IEnumerable does not iterate it. Of course, locking on two objects like this, specifically two objects that are not limited in scope to this section of code, is very dangerous. It's quite possible to end up deadlocking the program with locks used in this manor if code elsewhere (such as another invocation of this method) ends up taking locks on the same objects in the opposite order).
This is a bit of a funny one.
First things first: as you've correctly identified, R# is raising this inspection not against the multiple usages of the Lists - there is of course nothing to worry about in multiply enumerating a List - but against (what R' sees as multiple usages of) the IEnumerable arguments. I'm presuming you already know why this would be potentially bad, so I'll skip that.
Now to the question of whether R# is right to complain here. To quote the C# spec,
A lock statement of the form
lock (x) ...
where x is an expression of a reference-type, is precisely
equivalent to
System.Threading.Monitor.Enter(x);
try {
...
}
finally {
System.Threading.Monitor.Exit(x);
}
except that x is only evaluated once.
(I've put in this emphasis because I like this wording; it avoids debates (that I'm definitely not qualified to enter) about whether this is "syntatic sugar" or not.)
Taking a minimal example which produces this R# inspection:
private static void Method(IEnumerable<int> enumerable)
{
lock (enumerable)
{
var list = enumerable.ToList();
}
}
and replacing it by what I think is the precisely equivalent version as mandated by the spec:
private static void Method(IEnumerable<int> enumerable)
{
var x = enumerable;
System.Threading.Monitor.Enter(x);
try
{
var list = enumerable.ToList();
}
finally
{
System.Threading.Monitor.Exit(x);
}
}
also produces the inspection.
The question then is: is R# right to produce this inspection? And this is where I think we get into a grey area. When I pass the following enumerable to either of these methods:
static IEnumerable<int> MyEnumerable()
{
Console.WriteLine("Enumerable enumerated");
yield return 1;
yield return 2;
}
it is not multiply enumerated, which would suggest that R# is wrong to warn here; however, I can't actually find anything in documentation that guarantees this behaviour of either lock or Monitor.Enter. So for me it's not quite as clear-cut as this R# bug I reported, where use of GetType flagged this inspection; but nonetheless I'd guess you're safe.
If you raise this on the R# bug tracker, you can get JetBrains' finest looking at a) whether this behaviour is indeed guaranteed. and b) whether R# can be adjusted to either not warn, or provide a justification for warning.
That said, of course, using locking here probably isn't actually achieving what you want to achieve, as stated in other answers and comments...
I came across this implementation in Enumerable.cs by reflector.
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
//check parameters
TSource local = default(TSource);
long num = 0L;
foreach (TSource local2 in source)
{
if (predicate(local2))
{
local = local2;
num += 1L;
//I think they should do something here like:
//if (num >= 2L) throw Error.MoreThanOneMatch();
//no necessary to continue
}
}
//return different results by num's value
}
I think they should break the loop if there are more than 2 items meets the condition, why they always loop through the whole collection? In case of that reflector disassembles the dll incorrectly, I write a simple test:
class DataItem
{
private int _num;
public DataItem(int num)
{
_num = num;
}
public int Num
{
get{ Console.WriteLine("getting "+_num); return _num;}
}
}
var source = Enumerable.Range(1,10).Select( x => new DataItem(x));
var result = source.Single(x => x.Num < 5);
For this test case, I think it will print "getting 0, getting 1" and then throw an exception. But the truth is, it keeps "getting 0... getting 10" and throws an exception.
Is there any algorithmic reason they implement this method like this?
EDIT Some of you thought it's because of side effects of the predicate expression, after a deep thought and some test cases, I have a conclusion that side effects doesn't matter in this case. Please provide an example if you disagree with this conclusion.
Yes, I do find it slightly strange especially because the overload that doesn't take a predicate (i.e. works on just the sequence) does seem to have the quick-throw 'optimization'.
In the BCL's defence however, I would say that the InvalidOperation exception that Single throws is a boneheaded exception that shouldn't normally be used for control-flow. It's not necessary for such cases to be optimized by the library.
Code that uses Single where zero or multiple matches is a perfectly valid possibility, such as:
try
{
var item = source.Single(predicate);
DoSomething(item);
}
catch(InvalidOperationException)
{
DoSomethingElseUnexceptional();
}
should be refactored to code that doesn't use the exception for control-flow, such as (only a sample; this can be implemented more efficiently):
var firstTwo = source.Where(predicate).Take(2).ToArray();
if(firstTwo.Length == 1)
{
// Note that this won't fail. If it does, this code has a bug.
DoSomething(firstTwo.Single());
}
else
{
DoSomethingElseUnexceptional();
}
In other words, we should leave the use of Single to cases when we expect the sequence to contain only one match. It should behave identically to Firstbut with the additional run-time assertion that the sequence doesn't contain multiple matches. Like any other assertion, failure, i.e. cases when Single throws, should be used to represent bugs in the program (either in the method running the query or in the arguments passed to it by the caller).
This leaves us with two cases:
The assertion holds: There is a single match. In this case, we want Single to consume the entire sequence anyway to assert our claim. There's no benefit to the 'optimization'. In fact, one could argue that the sample implementation of the 'optimization' provided by the OP will actually be slower because of the check on every iteration of the loop.
The assertion fails: There are zero or multiple matches. In this case, we do throw later than we could, but this isn't such a big deal since the exception is boneheaded: it is indicative of a bug that must be fixed.
To sum up, if the 'poor implementation' is biting you performance-wise in production, either:
You are using Single incorrectly.
You have a bug in your program. Once the bug is fixed, this particular performance problem will go away.
EDIT: Clarified my point.
EDIT: Here's a valid use of Single, where failure indicates bugs in the calling code (bad argument):
public static User GetUserById(this IEnumerable<User> users, string id)
{
if(users == null)
throw new ArgumentNullException("users");
// Perfectly fine if documented that a failure in the query
// is treated as an exceptional circumstance. Caller's job
// to guarantee pre-condition.
return users.Single(user => user.Id == id);
}
Update:
I got some very good feedback to my answer, which has made me re-think. Thus I will first provide the answer that states my "new" point of view; you can still find my original answer just below. Make sure to read the comments in-between to understand where my first answer misses the point.
New answer:
Let's assume that Single should throw an exception when it's pre-condition is not met; that is, when Single detects than either none, or more than one item in the collection matches the predicate.
Single can only succeed without throwing an exception by going through the whole collection. It has to make sure that there is exactly one matching item, so it will have to check all items in the collection.
This means that throwing an exception early (as soon as it finds a second matching item) is essentially an optimization that you can only benefit from when Single's pre-condition cannot be met and when it will throw an exception.
As user CodeInChaos says clearly in a comment below, the optimization wouldn't be wrong, but it is meaningless, because one usually introduces optimizations that will benefit correctly-working code, not optimizations that will benefit malfunctioning code.
Thus, it is actually correct that Single could throw an exception early; but it doesn't have to, because there's practically no added benefit.
Old answer:
I cannot give a technical reason why that method is implemented the way it is, since I didn't implement it. But I can state my understanding of the Single operator's purpose, and from there draw my personal conclusion that it is indeed badly implemented:
My understanding of Single:
What is the purpose of Single, and how is it different from e.g. First or Last?
Using the Single operator basically expresses one's assumption that exactly one item must be returned from the collection:
If you don't specify a predicate, it should mean that the collection is expected to contain exactly one item.
If you do specify a predicate, it should mean that exactly one item in the collection is expected to satisfy that condition. (Using a predicate should have the same effect as items.Where(predicate).Single().)
This is what makes Single different from other operators such as First, Last, or Take(1). None of those operators have the requirement that there should be exactly one (matching) item.
When should Single throw an exception?
Basically, when it finds that your assumption was wrong; i.e. when the underlying collection does not yield exactly one (matching) item. That is, when there are zero or more than one items.
When should Single be used?
The use of Single is appropriate when your program's logic can guarantee that the collection will yield exactly one item, and one item only. If an exception gets thrown, that should mean that your program's logic contains a bug.
If you process "unreliable" collections, such as I/O input, you should first validate the input before you pass it to Single. Single, together with an exception catch block, is not appropriate for making sure that the collection has only one matching item. By the time you invoke Single, you should already have made sure that there'll be only one matching item.
Conclusion:
The above states my understanding of the Single LINQ operator. If you follow and agree with this understanding, you should come to the conclusion that Single ought to throw an exception as early as possible. There is no reason to wait until the end of the (possibly very large) collection, because the pre-condition of Single is violated as soon as it detects a second (matching) item in the collection.
When considering this implementation we must remember that this is the BCL: general code that is supposed to work good enough in all sorts of scenarios.
First, take these scenarios:
Iterate over 10 numbers, where the first and second elements are equal
Iterate over 1.000.000 numbers, where the first and third elements are equal
The original algorithm will work well enough for 10 items, but 1M will have a severe waste of cycles. So in these cases where we know that there are two or more early in the sequences, the proposed optimization would have a nice effect.
Then, look at these scenarios:
Iterate over 10 numbers, where the first and last elements are equal
Iterate over 1.000.000 numbers, where the first and last elements are equal
In these scenarios the algorithm is still required to inspect every item in the lists. There is no shortcut. The original algorithm will perform good enough, it fulfills the contract. Changing the algorithm, introducing an if on each iteration will actually decrease performance. For 10 items it will be negligible, but 1M it will be a big hit.
IMO, the original implementation is the correct one, since it is good enough for most scenarios. Knowing the implementation of Single is good though, because it enables us to make smart decisions based on what we know about the sequences we use it on. If performance measurements in one particular scenario shows that Single is causing a bottleneck, well: then we can implement our own variant that works better in that particular scenario.
Update: as CodeInChaos and Eamon have correctly pointed out, the if test introduced in the optimization is indeed not performed on each item, only within the predicate match block. I have in my example completely overlooked the fact that the proposed changes will not affect the overall performance of the implementation.
I agree that introducing the optimization would probably benefit all scenarios. It is good to see though that eventually, the decision to implement the optimization is made on the basis of performance measurements.
I think it's a premature optimization "bug".
Why this is NOT reasonable behavior due to side effects
Some have argued that due to side effects, it should be expected that the entire list is evaluated. After all, in the correct case (the sequence indeed has just 1 element) it is completely enumerated, and for consistency with this normal case it's nicer to enumerate the entire sequence in all cases.
Although that's a reasonable argument, it flies in the face of the general practice throughout the LINQ libraries: they use lazy evaluation everywhere. It's not general practice to fully enumerate sequences except where absolutely necessary; indeed, several methods prefer using IList.Count when available over any iteration at all - even when that iteration may have side effects.
Further, .Single() without predicate does not exhibit this behavior: that terminates as soon as possible. If the argument were that .Single() should respect side-effects of enumeration, you'd expect all overloads to do so equivalently.
Why the case for speed doesn't hold
Peter Lillevold made the interesting observation that it may be faster to do...
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
}
if(count!=1)...
than
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
if(count>1) ...
}
if(count==0)...
After all, the second version, which would exit the iteration as soon as the first conflict is detected, would require an extra test in the loop - a test which in the "correct" is purely ballast. Neat theory, right?
Except, that's not bourne out by the numbers; for example on my machine (YMMV) Enumerable.Range(0,100000000).Where(x=>x==123).Single() is actually faster than Enumerable.Range(0,100000000).Single(x=>x==123)!
It's possibly a JITter quirk of this precise expression on this machine - I'm not claiming that Where followed by predicateless Single is always faster.
But whatever the case, the fail-fast solution is very unlikely to be significantly slower. After all, even in the normal case, we're dealing with a cheap branch: a branch that is never taken and thus easy on the branch predictor. And of course; the branch is further only ever encountered when pred holds - that's once per call in the normal case. That cost is simply negligible compared to the cost of the delegate call pred and its implementation, plus the cost of the interface methods .MoveNext() and .get_Current() and their implementations.
It's simply extremely unlikely that you'll notice the performance degradation caused by one predictable branch in comparison to all that other abstraction penalty - not to mention the fact that most sequences and predicates actually do something themselves.
It seems very clear to me.
Single is intended for the case where the caller knows that the enumeration contains exactly one match, since in any other case an expensive exception is thrown.
For this use case, the overload that takes a predicate must iterate over the whole enumeration. It is slightly faster to do so without an additional condition on every loop.
In my view the current implementation is correct: it is optimized for the expected use case of an enumeration that contains exactly one matching element.
That does appear to be a bad implementation, in my opinion.
Just to illustrate the potential severity of the problem:
var oneMillion = Enumerable.Range(1, 1000000)
.Select(x => { Console.WriteLine(x); return x; });
int firstEven = oneMillion.Single(x => x % 2 == 0);
The above will output all the integers from 1 to 1000000 before throwing an exception.
It's a head-scratcher for sure.
I only found this question after filing a report at https://connect.microsoft.com/VisualStudio/feedback/details/810457/public-static-tsource-single-tsource-this-ienumerable-tsource-source-func-tsource-bool-predicate-doesnt-throw-immediately-on-second-matching-result#
The side-effect argument doesn't hold water, because:
Having side-effects isn't really functional, and they're called Func for a reason.
If you do want side-effects, it makes no more sense to claim the version that has the side-effects throughout the whole sequence is desirable than it does to claim so for the version that throws immediately.
It does not match the behaviour of First or the other overload of Single.
It does not match at least some other implementations of Single, e.g. Linq2SQL uses TOP 2 to ensure that only the two matching cases needed to test for more than one match are returned.
We can construct cases where we should expect a program to halt, but it does not halt.
We can construct cases where OverflowException is thrown, which is not documented behaviour, and hence clearly a bug.
Most importantly of all, if we're in a condition where we expected the sequence to have only one matching element, and yet we're not, then something has clearly gone wrong. Apart from the general principle that the only thing you should do upon detecting an error state is clean-up (and this implementation delays that) before throwing, the case of an sequence having more than one matching element is going to overlap with the case of a sequence having more elements in total than expected - perhaps because the sequence has a bug that is causing it to loop unexpectedly. So it's precisely in one possible set of bugs that should trigger the exception, that the exception is most delayed.
Edit:
Peter Lillevold's mention of a repeated test may be a reason why the author chose to take the approach they did, as an optimisation for the non-exceptional case. If so it was needless though, even aside from Eamon Nerbonne showing it wouldn't improve much. There's no need to have a repeated test in the initial loop, as we can just change what we're testing for upon the first match:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if(source == null)
throw new ArgumentNullException("source");
if(predicate == null)
throw new ArgumentNullException("predicate");
using(IEnumerator<TSource> en = source.GetEnumerator())
{
while(en.MoveNext())
{
TSource cur = en.Current;
if(predicate(cur))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("Sequence contains more than one matching element");
return cur;
}
}
}
throw new InvalidOperationException("Sequence contains no matching element");
}
Basically I am wondering if it's ok to use an enumeration more than once in code subsequently. Whether you break early or not, would the enumeration always reset in every foreach case below, so giving us consistent enumeration from start to end of the effects collection?
var effects = this.EffectsRecursive;
foreach (Effect effect in effects)
{
...
}
foreach (Effect effect in effects)
{
if(effect.Name = "Pixelate")
break;
}
foreach (Effect effect in effects)
{
...
}
EDIT: The implementation of EffectsRecursive is this:
public IEnumerable<Effect> Effects
{
get
{
for ( int i = 0; i < this.IEffect.NumberOfChildren; ++i )
{
IEffect effect = this.IEffect.GetChildEffect ( i );
if ( effect != null )
yield return new Effect ( effect );
}
}
}
public IEnumerable<Effect> EffectsRecursive
{
get
{
foreach ( Effect effect in this.Effects )
{
yield return effect;
foreach ( Effect seffect in effect.ChildrenRecursive )
yield return seffect;
}
}
}
Yes this is legal to do. The IEnumerable<T> pattern is meant to support multiple enumerations on a single source. Collections which can only be enumerated once should expose IEnumerator instead.
The code that consumes the sequence is fine. As spender points out, the code that produces the enumeration might have performance problems if the tree is deep.
Suppose at the deepest point your tree is four deep; think about what happens on the nodes that are four deep. To get that node, you iterate the root, which calls an iterator, which calls an iterator, which calls an iterator, which passes the node back to code that passes the node back to code that passes the node back... Instead of just handing the node to the caller, you've made a little bucket brigade with four guys in it, and they're tramping the data around from object to object before it finally gets to the loop that wanted it.
If the tree is only four deep, no big deal probably. But suppose the tree is ten thousand elements, and has a thousand nodes forming a linked list at the top and the remaining nine thousand nodes on the bottom. Now when you iterate those nine thousand nodes each one has to pass through a thousand iterators, for a total of nine million copies to fetch nine thousand nodes. (Of course, you've probably gotten a stack overflow error and crashed the process as well.)
The way to deal with this problem if you have it is to manage the stack yourself rather than pushing new iterators on the stack.
public IEnumerable<Effect> EffectsNotRecursive()
{
var stack = new Stack<Effect>();
stack.Push(this);
while(stack.Count != 0)
{
var current = stack.Pop();
yield return current;
foreach(var child in current.Effects)
stack.Push(child);
}
}
The original implementation has a time complexity of O(nd) where n is the number of nodes and d is the average depth of the tree; since d can in the worst case be O(n), and in the best case be O(lg n), that means that the algorithm is between O(n lg n) and O(n^2) in time. It is O(d) in heap space (for all the iterators) and O(d) in stack space (for all the recursive calls.)
The new implementation has a time complexity of O(n), and is O(d) in heap space, and O(1) in stack space.
One down side of this is that the order is different; the tree is traversed from top to bottom and right to left in the new algorithm, instead of top to bottom and left to right. If that bothers you then you can just say
foreach(var child in current.Effects.Reverse())
instead.
For more analysis of this problem, see my colleague Wes Dyer's article on the subject:
http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx
Legal, yes. Whether it will function as you expect depends on:
The implementation of the IEnumerable returned by EffectsRecursive and whether it always returns the same set;
Whether you want to enumerate the same set both times
If it returns an IEnumerable that requires some intensive work, and it doesn't cache the results internally, then you may need to .ToList() it yourself. If it does cache the results, then ToList() would be slightly redundant but probably no harm.
Also, if GetEnumerator() is implemented in a typical/proper (*) way, then you can safely enumerate any number of times - each foreach will be a new call to GetEnumerator() which returns a new instance of the IEnumerator. But it could be that in some situations it returns the same IEnumerator instance that's already been partially or fully enumerated, so it all really just depends on the specific intended usage of that particular IEnumerable.
*I'm pretty sure returning the same enumerator multiple times is actually a violation of the implied contract for the pattern, but I have seen some implementations that do it anyway.
Most probably, yes. Most implementations of IEnumerable return a fresh IEnumerator which starts at the beginning of the list.
It all depends on the implementation of the type of EffectsRecursive.
I know what yield does, and I've seen a few examples, but I can't think of real life applications, have you used it to solve some specific problem?
(Ideally some problem that cannot be solved some other way)
I realise this is an old question (pre Jon Skeet?) but I have been considering this question myself just lately. Unfortunately the current answers here (in my opinion) don't mention the most obvious advantage of the yield statement.
The biggest benefit of the yield statement is that it allows you to iterate over very large lists with much more efficient memory usage then using say a standard list.
For example, let's say you have a database query that returns 1 million rows. You could retrieve all rows using a DataReader and store them in a List, therefore requiring list_size * row_size bytes of memory.
Or you could use the yield statement to create an Iterator and only ever store one row in memory at a time. In effect this gives you the ability to provide a "streaming" capability over large sets of data.
Moreover, in the code that uses the Iterator, you use a simple foreach loop and can decide to break out from the loop as required. If you do break early, you have not forced the retrieval of the entire set of data when you only needed the first 5 rows (for example).
Regarding:
Ideally some problem that cannot be solved some other way
The yield statement does not give you anything you could not do using your own custom iterator implementation, but it saves you needing to write the often complex code needed. There are very few problems (if any) that can't solved more than one way.
Here are a couple of more recent questions and answers that provide more detail:
Yield keyword value added?
Is yield useful outside of LINQ?
actually I use it in a non traditional way on my site IdeaPipe
public override IEnumerator<T> GetEnumerator()
{
// goes through the collection and only returns the ones that are visible for the current user
// this is done at this level instead of the display level so that ideas do not bleed through
// on services
foreach (T idea in InternalCollection)
if (idea.IsViewingAuthorized)
yield return idea;
}
so basically it checks if viewing the idea is currently authorized and if it is it returns the idea. If it isn't, it is just skipped. This allows me to cache the Ideas but still display the ideas to the users that are authorized. Else I would have to re pull them each time based on permissions, when they are only re-ranked every 1 hour.
One interesting use is as a mechanism for asynchronous programming esp for tasks that take multiple steps and require the same set of data in each step. Two examples of this would be Jeffery Richters AysncEnumerator Part 1 and Part 2. The Concurrency and Coordination Runtime (CCR) also makes use of this technique CCR Iterators.
LINQ's operators on the Enumerable class are implemented as iterators that are created with the yield statement. It allows you to chain operations like Select() and Where() without actually enumerating anything until you actually use the enumerator in a loop, typically by using the foreach statement. Also, since only one value is computed when you call IEnumerator.MoveNext() if you decide to stop mid-collection, you'll save the performance hit of calculating all of the results.
Iterators can also be used to implement other kinds of lazy evaluation where expressions are evaluated only when you need it. You can also use yield for more fancy stuff like coroutines.
Another good use for yield is to perform a function on the elements of an IEnumerable and to return a result of a different type, for example:
public delegate T SomeDelegate(K obj);
public IEnumerable<T> DoActionOnList(IEnumerable<K> list, SomeDelegate action)
{
foreach (var i in list)
yield return action(i);
}
Using yield can prevent downcasting to a concrete type. This is handy to ensure that the consumer of the collection doesn't manipulate it.
You can also use yield return to treat a series of function results as a list. For instance, consider a company that pays its employees every two weeks. One could retrieve a subset of payroll dates as a list using this code:
void Main()
{
var StartDate = DateTime.Parse("01/01/2013");
var EndDate = DateTime.Parse("06/30/2013");
foreach (var d in GetPayrollDates(StartDate, EndDate)) {
Console.WriteLine(d);
}
}
// Calculate payroll dates in the given range.
// Assumes the first date given is a payroll date.
IEnumerable<DateTime> GetPayrollDates(DateTime startDate, DateTime endDate, int daysInPeriod = 14) {
var thisDate = startDate;
while (thisDate < endDate) {
yield return thisDate;
thisDate = thisDate.AddDays(daysInPeriod);
}
}