Could locking an enumerable potentially cause multiple enumeration? - c#

I think ReSharper is lying to me.
I have this extension method that (hopefully) returns an xor of two enumerations:
public static IEnumerable<T> Xor<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
lock (first)
{
lock (second)
{
var firstAsList = first.ToList();
var secondAsList = second.ToList();
return firstAsList.Except(secondAsList).Union(secondAsList.Except(firstAsList));
}
}
}
ReSharper thinks I'm performing a multiple enumeration of an IEnumerable, as you can see, on both the arguments. If I remove the locks, then it's satisfied that I'm not.
Is ReSharper right or wrong? I believe it's wrong.
edit: I do realize that I'm enumerating the lists multiple times, but ReSharper is saying I'm enumerating over the original arguments multiple times, which I don't think is true. I'm enumerating both arguments once into a list so I may then perform the actual set manipulation, but as I see it, I'm not actually iterating over the arguments passed multiple times.
For example, if the passed arguments are actually query results, my belief is this method won't cause a storm of queries to be executed by the set manipulation. I do understand what ReSharper means by warning of multiple enumeration: if the enumerables passed are heavy to generate, then if they're enumerated multiple times, then performing multiple enumerations on them will be much slower.
Also, removing the locks definitely makes ReSharper happier:

You are indeed enumerating both of the lists multiple times. You are not enumerating the enumerables passed as parameters multiple times. Both lists are enumerated once for each call to Except. The call to Union is not enumerating either sequence an additional time, but rather is enumerating the results of the two calls to Except.
Of course, iterating a List multiple times in a context like this isn't really a problem; there aren't negative consequences to iterating an unchanging list multiple times.
The lock statements have nothing whatsoever to do with enumeration of the sequences. Locking on an IEnumerable does not iterate it. Of course, locking on two objects like this, specifically two objects that are not limited in scope to this section of code, is very dangerous. It's quite possible to end up deadlocking the program with locks used in this manor if code elsewhere (such as another invocation of this method) ends up taking locks on the same objects in the opposite order).

This is a bit of a funny one.
First things first: as you've correctly identified, R# is raising this inspection not against the multiple usages of the Lists - there is of course nothing to worry about in multiply enumerating a List - but against (what R' sees as multiple usages of) the IEnumerable arguments. I'm presuming you already know why this would be potentially bad, so I'll skip that.
Now to the question of whether R# is right to complain here. To quote the C# spec,
A lock statement of the form
lock (x) ...
where x is an expression of a reference-type, is precisely
equivalent to
System.Threading.Monitor.Enter(x);
try {
...
}
finally {
System.Threading.Monitor.Exit(x);
}
except that x is only evaluated once.
(I've put in this emphasis because I like this wording; it avoids debates (that I'm definitely not qualified to enter) about whether this is "syntatic sugar" or not.)
Taking a minimal example which produces this R# inspection:
private static void Method(IEnumerable<int> enumerable)
{
lock (enumerable)
{
var list = enumerable.ToList();
}
}
and replacing it by what I think is the precisely equivalent version as mandated by the spec:
private static void Method(IEnumerable<int> enumerable)
{
var x = enumerable;
System.Threading.Monitor.Enter(x);
try
{
var list = enumerable.ToList();
}
finally
{
System.Threading.Monitor.Exit(x);
}
}
also produces the inspection.
The question then is: is R# right to produce this inspection? And this is where I think we get into a grey area. When I pass the following enumerable to either of these methods:
static IEnumerable<int> MyEnumerable()
{
Console.WriteLine("Enumerable enumerated");
yield return 1;
yield return 2;
}
it is not multiply enumerated, which would suggest that R# is wrong to warn here; however, I can't actually find anything in documentation that guarantees this behaviour of either lock or Monitor.Enter. So for me it's not quite as clear-cut as this R# bug I reported, where use of GetType flagged this inspection; but nonetheless I'd guess you're safe.
If you raise this on the R# bug tracker, you can get JetBrains' finest looking at a) whether this behaviour is indeed guaranteed. and b) whether R# can be adjusted to either not warn, or provide a justification for warning.
That said, of course, using locking here probably isn't actually achieving what you want to achieve, as stated in other answers and comments...

Related

Why do we need two interfaces to enumerate a collection?

It is quite a while that I have been trying to understand the idea behind IEnumerable and IEnumerator. I read all the questions and answers I could find over the net, and on StackOverflow in particular, but I am not satisfied. I got to the point where I understand how those interfaces should be used, but not why they are used this way.
I think that the essence of my misunderstanding is that we need two interfaces for one operation. I realized that if both are needed, one was probably not enough. So I took the "hard coded" equivalent of foreach (as I found here):
while (enumerator.MoveNext())
{
object item = enumerator.Current;
// logic
}
and tried to get it to work with one interface, thinking something would go wrong which would make me understand why another interface is needed.
So I created a collection class, and implemented IForeachable:
class Collection : IForeachable
{
private int[] array = { 1, 2, 3, 4, 5 };
private int index = -1;
public int Current => array[index];
public bool MoveNext()
{
if (index < array.Length - 1)
{
index++;
return true;
}
index = -1;
return false;
}
}
and used the foreach equivalent to nominate the collection:
var collection = new Collection();
while (collection.MoveNext())
{
object item = collection.Current;
Console.WriteLine(item);
}
And it works! So what is missing here that make another interface required?
Thanks.
Edit:
My question is not a duplicate of the questions listed in the comments:
This question is why interfaces are needed for enumerating in the first place.
This question and this question are about what are those interfaces and how should they be used.
My question is why they are designed the way they are, not what are they, how they work, and why do we need them in the first place.
What are the two interfaces and what do they do?
The IEnumerable interface is placed on the collection object and defines the GetEnumerator() method, this returns a (normally new) object that has implements the IEnumerator interface. The foreach statement in C# and For Each statement in VB.NET use IEnumerable to access the enumerator in order to loop over the elements in the collection.
The IEnumerator interface is esentially the contract placed on the object that actually does the iteration. It stores the state of the iteration and updates it as the code moves through the collection.
Why not just have the collection be the enumerator too? Why have two separate interfaces?
There is nothing to stop IEnumerator and IEnumerable being implemented on the same class. However, there is a penalty for doing this – It won’t be possible to have two, or more, loops on the same collection at the same time. If it can be absolutely guaranteed that there won’t ever be a need to loop on the collection twice at the same time then that’s fine. But in the majority of circumstances that isn’t possible.
When would someone iterate over a collection more than once at a time?
Here are two examples.
The first example is when there are two loops nested inside each other on the same collection. If the collection was also the enumerator then it wouldn’t be possible to support nested loops on the same collection, when the code gets to the inner loop it is going to collide with the outer loop.
The second example is when there are two, or more, threads accessing the same collection. Again, if the collection was also the enumerator then it wouldn’t be possible to support safe multithreaded iteration over the same collection. When the second thread attempts to loop over the elements in the collection the state of the two enumerations will collide.
Also, because the iteration model used in .NET does not permit alterations to a collection during enumeration these operations are otherwise completely safe.
-- This was from a blog post I wrote many years ago: https://colinmackay.scot/2007/06/24/iteration-in-net-with-ienumerable-and-ienumerator/
Your IForeachable cannot even be iterated from two different threads (you cannot have multiple active iterations at all - even from the same thread), because current enumeration state stored in IForeachable itself. You also have to reset your current position each time you finished enumeration, and if you forgot to do that - well, next caller will think your collection is empty. I can only imagine all kind of hard to track bugs this all might lead to.
On the other hand, because IEnumerable returns new IEnumerator for each caller - you can have multiple enumerations in progress simultaneously, because each caller has it's own enumeration state. I think this reason alone is enough to justify two interfaces. Enumeration is essentially read operation, and it would have been very confusing if you cannot read the same thing simultaneously in multiple places.

Why Single(IEnumerable<T>,Predicate<T>) is so inefficient [duplicate]

I came across this implementation in Enumerable.cs by reflector.
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
//check parameters
TSource local = default(TSource);
long num = 0L;
foreach (TSource local2 in source)
{
if (predicate(local2))
{
local = local2;
num += 1L;
//I think they should do something here like:
//if (num >= 2L) throw Error.MoreThanOneMatch();
//no necessary to continue
}
}
//return different results by num's value
}
I think they should break the loop if there are more than 2 items meets the condition, why they always loop through the whole collection? In case of that reflector disassembles the dll incorrectly, I write a simple test:
class DataItem
{
private int _num;
public DataItem(int num)
{
_num = num;
}
public int Num
{
get{ Console.WriteLine("getting "+_num); return _num;}
}
}
var source = Enumerable.Range(1,10).Select( x => new DataItem(x));
var result = source.Single(x => x.Num < 5);
For this test case, I think it will print "getting 0, getting 1" and then throw an exception. But the truth is, it keeps "getting 0... getting 10" and throws an exception.
Is there any algorithmic reason they implement this method like this?
EDIT Some of you thought it's because of side effects of the predicate expression, after a deep thought and some test cases, I have a conclusion that side effects doesn't matter in this case. Please provide an example if you disagree with this conclusion.
Yes, I do find it slightly strange especially because the overload that doesn't take a predicate (i.e. works on just the sequence) does seem to have the quick-throw 'optimization'.
In the BCL's defence however, I would say that the InvalidOperation exception that Single throws is a boneheaded exception that shouldn't normally be used for control-flow. It's not necessary for such cases to be optimized by the library.
Code that uses Single where zero or multiple matches is a perfectly valid possibility, such as:
try
{
var item = source.Single(predicate);
DoSomething(item);
}
catch(InvalidOperationException)
{
DoSomethingElseUnexceptional();
}
should be refactored to code that doesn't use the exception for control-flow, such as (only a sample; this can be implemented more efficiently):
var firstTwo = source.Where(predicate).Take(2).ToArray();
if(firstTwo.Length == 1)
{
// Note that this won't fail. If it does, this code has a bug.
DoSomething(firstTwo.Single());
}
else
{
DoSomethingElseUnexceptional();
}
In other words, we should leave the use of Single to cases when we expect the sequence to contain only one match. It should behave identically to Firstbut with the additional run-time assertion that the sequence doesn't contain multiple matches. Like any other assertion, failure, i.e. cases when Single throws, should be used to represent bugs in the program (either in the method running the query or in the arguments passed to it by the caller).
This leaves us with two cases:
The assertion holds: There is a single match. In this case, we want Single to consume the entire sequence anyway to assert our claim. There's no benefit to the 'optimization'. In fact, one could argue that the sample implementation of the 'optimization' provided by the OP will actually be slower because of the check on every iteration of the loop.
The assertion fails: There are zero or multiple matches. In this case, we do throw later than we could, but this isn't such a big deal since the exception is boneheaded: it is indicative of a bug that must be fixed.
To sum up, if the 'poor implementation' is biting you performance-wise in production, either:
You are using Single incorrectly.
You have a bug in your program. Once the bug is fixed, this particular performance problem will go away.
EDIT: Clarified my point.
EDIT: Here's a valid use of Single, where failure indicates bugs in the calling code (bad argument):
public static User GetUserById(this IEnumerable<User> users, string id)
{
if(users == null)
throw new ArgumentNullException("users");
// Perfectly fine if documented that a failure in the query
// is treated as an exceptional circumstance. Caller's job
// to guarantee pre-condition.
return users.Single(user => user.Id == id);
}
Update:
I got some very good feedback to my answer, which has made me re-think. Thus I will first provide the answer that states my "new" point of view; you can still find my original answer just below. Make sure to read the comments in-between to understand where my first answer misses the point.
New answer:
Let's assume that Single should throw an exception when it's pre-condition is not met; that is, when Single detects than either none, or more than one item in the collection matches the predicate.
Single can only succeed without throwing an exception by going through the whole collection. It has to make sure that there is exactly one matching item, so it will have to check all items in the collection.
This means that throwing an exception early (as soon as it finds a second matching item) is essentially an optimization that you can only benefit from when Single's pre-condition cannot be met and when it will throw an exception.
As user CodeInChaos says clearly in a comment below, the optimization wouldn't be wrong, but it is meaningless, because one usually introduces optimizations that will benefit correctly-working code, not optimizations that will benefit malfunctioning code.
Thus, it is actually correct that Single could throw an exception early; but it doesn't have to, because there's practically no added benefit.
Old answer:
I cannot give a technical reason why that method is implemented the way it is, since I didn't implement it. But I can state my understanding of the Single operator's purpose, and from there draw my personal conclusion that it is indeed badly implemented:
My understanding of Single:
What is the purpose of Single, and how is it different from e.g. First or Last?
Using the Single operator basically expresses one's assumption that exactly one item must be returned from the collection:
If you don't specify a predicate, it should mean that the collection is expected to contain exactly one item.
If you do specify a predicate, it should mean that exactly one item in the collection is expected to satisfy that condition. (Using a predicate should have the same effect as items.Where(predicate).Single().)
This is what makes Single different from other operators such as First, Last, or Take(1). None of those operators have the requirement that there should be exactly one (matching) item.
When should Single throw an exception?
Basically, when it finds that your assumption was wrong; i.e. when the underlying collection does not yield exactly one (matching) item. That is, when there are zero or more than one items.
When should Single be used?
The use of Single is appropriate when your program's logic can guarantee that the collection will yield exactly one item, and one item only. If an exception gets thrown, that should mean that your program's logic contains a bug.
If you process "unreliable" collections, such as I/O input, you should first validate the input before you pass it to Single. Single, together with an exception catch block, is not appropriate for making sure that the collection has only one matching item. By the time you invoke Single, you should already have made sure that there'll be only one matching item.
Conclusion:
The above states my understanding of the Single LINQ operator. If you follow and agree with this understanding, you should come to the conclusion that Single ought to throw an exception as early as possible. There is no reason to wait until the end of the (possibly very large) collection, because the pre-condition of Single is violated as soon as it detects a second (matching) item in the collection.
When considering this implementation we must remember that this is the BCL: general code that is supposed to work good enough in all sorts of scenarios.
First, take these scenarios:
Iterate over 10 numbers, where the first and second elements are equal
Iterate over 1.000.000 numbers, where the first and third elements are equal
The original algorithm will work well enough for 10 items, but 1M will have a severe waste of cycles. So in these cases where we know that there are two or more early in the sequences, the proposed optimization would have a nice effect.
Then, look at these scenarios:
Iterate over 10 numbers, where the first and last elements are equal
Iterate over 1.000.000 numbers, where the first and last elements are equal
In these scenarios the algorithm is still required to inspect every item in the lists. There is no shortcut. The original algorithm will perform good enough, it fulfills the contract. Changing the algorithm, introducing an if on each iteration will actually decrease performance. For 10 items it will be negligible, but 1M it will be a big hit.
IMO, the original implementation is the correct one, since it is good enough for most scenarios. Knowing the implementation of Single is good though, because it enables us to make smart decisions based on what we know about the sequences we use it on. If performance measurements in one particular scenario shows that Single is causing a bottleneck, well: then we can implement our own variant that works better in that particular scenario.
Update: as CodeInChaos and Eamon have correctly pointed out, the if test introduced in the optimization is indeed not performed on each item, only within the predicate match block. I have in my example completely overlooked the fact that the proposed changes will not affect the overall performance of the implementation.
I agree that introducing the optimization would probably benefit all scenarios. It is good to see though that eventually, the decision to implement the optimization is made on the basis of performance measurements.
I think it's a premature optimization "bug".
Why this is NOT reasonable behavior due to side effects
Some have argued that due to side effects, it should be expected that the entire list is evaluated. After all, in the correct case (the sequence indeed has just 1 element) it is completely enumerated, and for consistency with this normal case it's nicer to enumerate the entire sequence in all cases.
Although that's a reasonable argument, it flies in the face of the general practice throughout the LINQ libraries: they use lazy evaluation everywhere. It's not general practice to fully enumerate sequences except where absolutely necessary; indeed, several methods prefer using IList.Count when available over any iteration at all - even when that iteration may have side effects.
Further, .Single() without predicate does not exhibit this behavior: that terminates as soon as possible. If the argument were that .Single() should respect side-effects of enumeration, you'd expect all overloads to do so equivalently.
Why the case for speed doesn't hold
Peter Lillevold made the interesting observation that it may be faster to do...
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
}
if(count!=1)...
than
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
if(count>1) ...
}
if(count==0)...
After all, the second version, which would exit the iteration as soon as the first conflict is detected, would require an extra test in the loop - a test which in the "correct" is purely ballast. Neat theory, right?
Except, that's not bourne out by the numbers; for example on my machine (YMMV) Enumerable.Range(0,100000000).Where(x=>x==123).Single() is actually faster than Enumerable.Range(0,100000000).Single(x=>x==123)!
It's possibly a JITter quirk of this precise expression on this machine - I'm not claiming that Where followed by predicateless Single is always faster.
But whatever the case, the fail-fast solution is very unlikely to be significantly slower. After all, even in the normal case, we're dealing with a cheap branch: a branch that is never taken and thus easy on the branch predictor. And of course; the branch is further only ever encountered when pred holds - that's once per call in the normal case. That cost is simply negligible compared to the cost of the delegate call pred and its implementation, plus the cost of the interface methods .MoveNext() and .get_Current() and their implementations.
It's simply extremely unlikely that you'll notice the performance degradation caused by one predictable branch in comparison to all that other abstraction penalty - not to mention the fact that most sequences and predicates actually do something themselves.
It seems very clear to me.
Single is intended for the case where the caller knows that the enumeration contains exactly one match, since in any other case an expensive exception is thrown.
For this use case, the overload that takes a predicate must iterate over the whole enumeration. It is slightly faster to do so without an additional condition on every loop.
In my view the current implementation is correct: it is optimized for the expected use case of an enumeration that contains exactly one matching element.
That does appear to be a bad implementation, in my opinion.
Just to illustrate the potential severity of the problem:
var oneMillion = Enumerable.Range(1, 1000000)
.Select(x => { Console.WriteLine(x); return x; });
int firstEven = oneMillion.Single(x => x % 2 == 0);
The above will output all the integers from 1 to 1000000 before throwing an exception.
It's a head-scratcher for sure.
I only found this question after filing a report at https://connect.microsoft.com/VisualStudio/feedback/details/810457/public-static-tsource-single-tsource-this-ienumerable-tsource-source-func-tsource-bool-predicate-doesnt-throw-immediately-on-second-matching-result#
The side-effect argument doesn't hold water, because:
Having side-effects isn't really functional, and they're called Func for a reason.
If you do want side-effects, it makes no more sense to claim the version that has the side-effects throughout the whole sequence is desirable than it does to claim so for the version that throws immediately.
It does not match the behaviour of First or the other overload of Single.
It does not match at least some other implementations of Single, e.g. Linq2SQL uses TOP 2 to ensure that only the two matching cases needed to test for more than one match are returned.
We can construct cases where we should expect a program to halt, but it does not halt.
We can construct cases where OverflowException is thrown, which is not documented behaviour, and hence clearly a bug.
Most importantly of all, if we're in a condition where we expected the sequence to have only one matching element, and yet we're not, then something has clearly gone wrong. Apart from the general principle that the only thing you should do upon detecting an error state is clean-up (and this implementation delays that) before throwing, the case of an sequence having more than one matching element is going to overlap with the case of a sequence having more elements in total than expected - perhaps because the sequence has a bug that is causing it to loop unexpectedly. So it's precisely in one possible set of bugs that should trigger the exception, that the exception is most delayed.
Edit:
Peter Lillevold's mention of a repeated test may be a reason why the author chose to take the approach they did, as an optimisation for the non-exceptional case. If so it was needless though, even aside from Eamon Nerbonne showing it wouldn't improve much. There's no need to have a repeated test in the initial loop, as we can just change what we're testing for upon the first match:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if(source == null)
throw new ArgumentNullException("source");
if(predicate == null)
throw new ArgumentNullException("predicate");
using(IEnumerator<TSource> en = source.GetEnumerator())
{
while(en.MoveNext())
{
TSource cur = en.Current;
if(predicate(cur))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("Sequence contains more than one matching element");
return cur;
}
}
}
throw new InvalidOperationException("Sequence contains no matching element");
}

Is there a way to determine whether IEnumerable<T> is a sequence generator or true collection?

Let me provide an example:
Suppose i have a method:
public void DoStuff(IEnumerable<T> sequence)
{
if (/* is lazy sequence */) throw ...
// Do stuff...
}
And I want to guard against potentially infinite sequences.
Edit:
To elaborate, guarding against infinite collection is only one of the uses. As Jon mentioned. You can easily have infinite IList. Good point.
Other use might be to detect whether data is potentially unrepeatable. Like a random generator. True collection has data already in memory and iterating it twice will give me same data.
There's nothing that's guaranteed, no. You could see whether the sequence also implements IList<T> - that would prohibit iterator block implementations, but could still be a "virtual" list which continues forever, and it would also fail for some other finite non-iterator-block collections.
In terms of what you're trying to protect against, would a 2-billion-long sequence be okay for what you're trying to do? Would an extension method which threw an exception if you ended up with more than some "limit" (but do so lazily) work for you? (Something like Take, but which blew up at the end.)
This is impossible.
The IEnumerable API cannot possibly tell you whether it is infinite or not. You can try casting it to ICollection to catch the common case where someone has passed you one of those, but if that's what you want then why don't you take an ICollection<...> object in the first place?
Iterators don't support the IEnumerable<>.Reset() method:
public void DoStuff(IEnumerable<T> sequence)
{
sequence.Reset();
// etc..
}
You get a NotSupportedException, which is good, with the Message "Specified method is not supported", which is tolerable.

When NOT to use yield (return) [duplicate]

This question already has answers here:
Closed 12 years ago.
This question already has an answer here:
Is there ever a reason to not use 'yield return' when returning an IEnumerable?
There are several useful questions here on SO about the benefits of yield return. For example,
Can someone demystify the yield
keyword
Interesting use of the c# yield
keyword
What is the yield keyword
I'm looking for thoughts on when NOT to use yield return. For example, if I expect to need to return all items in a collection, it doesn't seem like yield would be useful, right?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
It's a good idea to think carefully about your use of "yield return" when dealing with recursively defined structures. For example, I often see this:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
if (root == null) yield break;
yield return root.Value;
foreach(T item in PreorderTraversal(root.Left))
yield return item;
foreach(T item in PreorderTraversal(root.Right))
yield return item;
}
Perfectly sensible-looking code, but it has performance problems. Suppose the tree is h deep. Then there will at most points be O(h) nested iterators built. Calling "MoveNext" on the outer iterator will then make O(h) nested calls to MoveNext. Since it does this O(n) times for a tree with n items, that makes the algorithm O(hn). And since the height of a binary tree is lg n <= h <= n, that means that the algorithm is at best O(n lg n) and at worst O(n^2) in time, and best case O(lg n) and worse case O(n) in stack space. It is O(h) in heap space because each enumerator is allocated on the heap. (On implementations of C# I'm aware of; a conforming implementation might have other stack or heap space characteristics.)
But iterating a tree can be O(n) in time and O(1) in stack space. You can write this instead like:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
var stack = new Stack<Tree<T>>();
stack.Push(root);
while (stack.Count != 0)
{
var current = stack.Pop();
if (current == null) continue;
yield return current.Value;
stack.Push(current.Left);
stack.Push(current.Right);
}
}
which still uses yield return, but is much smarter about it. Now we are O(n) in time and O(h) in heap space, and O(1) in stack space.
Further reading: see Wes Dyer's article on the subject:
http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx
What are the cases where use of yield
will be limiting, unnecessary, get me
into trouble, or otherwise should be
avoided?
I can think of a couple of cases, IE:
Avoid using yield return when you return an existing iterator. Example:
// Don't do this, it creates overhead for no reason
// (a new state machine needs to be generated)
public IEnumerable<string> GetKeys()
{
foreach(string key in _someDictionary.Keys)
yield return key;
}
// DO this
public IEnumerable<string> GetKeys()
{
return _someDictionary.Keys;
}
Avoid using yield return when you don't want to defer execution code for the method. Example:
// Don't do this, the exception won't get thrown until the iterator is
// iterated, which can be very far away from this method invocation
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
yield ...
}
// DO this
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
return new BazIterator(baz);
}
The key thing to realize is what yield is useful for, then you can decide which cases do not benefit from it.
In other words, when you do not need a sequence to be lazily evaluated you can skip the use of yield. When would that be? It would be when you do not mind immediately having your entire collection in memory. Otherwise, if you have a huge sequence that would negatively impact memory, you would want to use yield to work on it step by step (i.e., lazily). A profiler might come in handy when comparing both approaches.
Notice how most LINQ statements return an IEnumerable<T>. This allows us to continually string different LINQ operations together without negatively impacting performance at each step (aka deferred execution). The alternative picture would be putting a ToList() call in between each LINQ statement. This would cause each preceding LINQ statement to be immediately executed before performing the next (chained) LINQ statement, thereby forgoing any benefit of lazy evaluation and utilizing the IEnumerable<T> till needed.
There are a lot of excellent answers here. I would add this one: Don't use yield return for small or empty collections where you already know the values:
IEnumerable<UserRight> GetSuperUserRights() {
if(SuperUsersAllowed) {
yield return UserRight.Add;
yield return UserRight.Edit;
yield return UserRight.Remove;
}
}
In these cases the creation of the Enumerator object is more expensive, and more verbose, than just generating a data structure.
IEnumerable<UserRight> GetSuperUserRights() {
return SuperUsersAllowed
? new[] {UserRight.Add, UserRight.Edit, UserRight.Remove}
: Enumerable.Empty<UserRight>();
}
Update
Here's the results of my benchmark:
These results show how long it took (in milliseconds) to perform the operation 1,000,000 times. Smaller numbers are better.
In revisiting this, the performance difference isn't significant enough to worry about, so you should go with whatever is the easiest to read and maintain.
Update 2
I'm pretty sure the above results were achieved with compiler optimization disabled. Running in Release mode with a modern compiler, it appears performance is practically indistinguishable between the two. Go with whatever is most readable to you.
Eric Lippert raises a good point (too bad C# doesn't have stream flattening like Cw). I would add that sometimes the enumeration process is expensive for other reasons, and therefore you should use a list if you intend to iterate over the IEnumerable more than once.
For example, LINQ-to-objects is built on "yield return". If you've written a slow LINQ query (e.g. that filters a large list into a small list, or that does sorting and grouping), it may be wise to call ToList() on the result of the query in order to avoid enumerating multiple times (which actually executes the query multiple times).
If you are choosing between "yield return" and List<T> when writing a method, consider: is each single element expensive to compute, and will the caller need to enumerate the results more than once? If you know the answers are yes and yes, you shouldn't use yield return (unless, for example, the List produced is very large and you can't afford the memory it would use. Remember, another benefit of yield is that the result list doesn't have to be entirely in memory at once).
Another reason not to use "yield return" is if interleaving operations is dangerous. For example, if your method looks something like this,
IEnumerable<T> GetMyStuff() {
foreach (var x in MyCollection)
if (...)
yield return (...);
}
this is dangerous if there is a chance that MyCollection will change because of something the caller does:
foreach(T x in GetMyStuff()) {
if (...)
MyCollection.Add(...);
// Oops, now GetMyStuff() will throw an exception
// because MyCollection was modified.
}
yield return can cause trouble whenever the caller changes something that the yielding function assumes does not change.
I would avoid using yield return if the method has a side effect that you expect on calling the method. This is due to the deferred execution that Pop Catalin mentions.
One side effect could be modifying the system, which could happen in a method like IEnumerable<Foo> SetAllFoosToCompleteAndGetAllFoos(), which breaks the single responsibility principle. That's pretty obvious (now...), but a not so obvious side effect could be setting a cached result or similar as an optimisation.
My rules of thumb (again, now...) are:
Only use yield if the object being returned requires a bit of processing
No side effects in the method if I need to use yield
If have to have side effects (and limiting that to caching etc), don't use yield and make sure the benefits of expanding the iteration outweigh the costs
Yield would be limiting/unnecessary when you need random access. If you need to access element 0 then element 99, you've pretty much eliminated the usefulness of lazy evaluation.
One that might catch you out is if you are serialising the results of an enumeration and sending them over the wire. Because the execution is deferred until the results are needed, you will serialise an empty enumeration and send that back instead of the results you want.
I have to maintain a pile of code from a guy who was absolutely obsessed with yield return and IEnumerable. The problem is that a lot of third party APIs we use, as well as a lot of our own code, depend on Lists or Arrays. So I end up having to do:
IEnumerable<foo> myFoos = getSomeFoos();
List<foo> fooList = new List<foo>(myFoos);
thirdPartyApi.DoStuffWithArray(fooList.ToArray());
Not necessarily bad, but kind of annoying to deal with, and on a few occasions it's led to creating duplicate Lists in memory to avoid refactoring everything.
When you don't want a code block to return an iterator for sequential access to an underlying collection, you dont need yield return. You simply return the collection then.
If you're defining a Linq-y extension method where you're wrapping actual Linq members, those members will more often than not return an iterator. Yielding through that iterator yourself is unnecessary.
Beyond that, you can't really get into much trouble using yield to define a "streaming" enumerable that is evaluated on a JIT basis.

Bug in the File.ReadLines(..) method of the .net framework 4.0

This code :
IEnumerable<string> lines = File.ReadLines("file path");
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
throws an ObjectDisposedException : {"Cannot read from a closed TextReader."} if the second foreach is executed.
It seems that the iterator object returned from File.ReadLines(..) can't be enumerated more than once. You have to obtain a new iterator object by calling File.ReadLines(..) and then use it to iterate.
If I replace File.ReadLines(..) with my version(parameters are not verified, it's just an example):
public static IEnumerable<string> MyReadLines(string path)
{
using (var stream = new TextReader(path))
{
string line;
while ((line = stream.ReadLine()) != null)
{
yield return line;
}
}
}
it's possible to iterate more than once the lines of the file.
An investigation using .Net Reflector showed that the implementation of the File.ReadLines(..) calls a private File.InternalReadLines(TextReader reader) that creates the actual iterator. The reader passed as a parameter is used in the MoveNext() method of the iterator to get the lines of the file and is disposed when we reach the end of the file. This means that once MoveNext() returns false there is no way to iterate a second time because the reader is closed and you have to get a new reader by creating a new iterator with the ReadLines(..) method.In my version a new reader is created in the MoveNext() method each time we start a new iteration.
Is this the expected behavior of the File.ReadLines(..) method?
I find troubling the fact that it's necessary to call the method each time before you enumerate the results. You would also have to call the method each time before you iterate the results of a Linq query that uses the method.
I know this is old, but i actually just ran into this while working on some code on a Windows 7 machine. Contrary to what people were saying here, this actually was a bug. See this link.
So the easy fix is to update your .net framefork. I thought this was worth updating since this was the top search result.
I don't think it's a bug, and I don't think it's unusual -- in fact that's what I'd expect for something like a text file reader to do. IO is an expensive operation, so in general you want to do everything in one pass.
It isn't a bug. But I believe you can use ReadAllLines() to do what you want instead. ReadAllLines creates a string array and pulls in all the lines into the array, instead of just a simple enumerator over a stream like ReadLines does.
If you need to access the lines twice you can always buffer them into a List<T>
using System.Linq;
List<string> lines = File.ReadLines("file path").ToList();
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
I don't know if it can be considered a bug or not if it's by design but I can certainly say two things...
This should be posted on Connect, not StackOverflow although they're not going to change it before 4.0 is released. And that usually means they won't ever fix it.
The design of the method certainly appears to be flawed.
You are correct in noting that returning an IEnumerable implies that it should be reusable and it does not guarantee the same results if iterated twice. If it had returned an IEnumerator instead then it would be a different story.
So anyway, I think it's a good find and I think the API is a lousy one to begin with. ReadAllLines and ReadAllText give you a nice convenient way of getting at the entire file but if the caller cares enough about performance to be using a lazy enumerable, they shouldn't be delegating so much responsibility to a static helper method in the first place.
I believe you are confusing an IQueryable with an IEnumerable. Yes, it's true that IQueryable can be treated as an IEnumerable, but they are not exactly the same thing. An IQueryable queries each time it's used, while an IEnumerable has no such implied reuse.
A Linq Query returns an IQueryable. ReadLines returns an IEnumerable.
There's a subtle distinction here because of the way an Enumerator is created. An IQueryable creates an IEnumerator when you call GetEnumerator() on it (which is done automatically by foreach). ReadLines() creates the IEnumerator when the ReadLines() function is called. As such, when you reuse an IQueryable, it creates a new IEnumerator when you reuse it, but since the ReadLines() creates the IEnumerator (and not an IQueryable), the only way to get a new IEnumerator is to call ReadLines() again.
In other words, you should only be able to expect to reuse an IQueryable, not an IEnumerator.
EDIT:
On further reflection (no pun intended) I think my initial response was a bit too simplistic. If IEnumerable was not reusable, you couldn't do something like this:
List<int> li = new List<int>() {1, 2, 3, 4};
IEnumerable<int> iei = li;
foreach (var i in iei) { Console.WriteLine(i); }
foreach (var i in iei) { Console.WriteLine(i); }
Clearly, one would not expect the second foreach to fail.
The problem, as is so often the case with these kinds of abstractions, is that not everything fits perfectly. For example, Streams are typically one-way, but for network use they had to be adapted to work bi-directionally.
In this case, an IEnumerable was originally envisioned to be a reusable feature, but it has since been adapted to be so generic that reusability is not a guarantee or even should be expected. Witness the explosion of various libraries that use IEnumerables in non-reusable ways, such as Jeffery Richters PowerThreading library.
I simply don't think we can assume IEnumerables are reusable in all cases anymore.
It's not a bug. File.ReadLines() uses lazy evaluation and it is not idempotent. That's why it's not safe to enumerate it twice in a row. Remember an IEnumerable represents a data source that can be enumerated, it does not state it is safe to be enumerated twice, although this might be unexpected since most people are used to using IEnumerable over idempotent collections.
From the MSDN:
The ReadLines(String, System) and
ReadAllLines(String, System) methods
differ as follows: When you use
ReadLines, you can start enumerating
the collection of strings before the
whole collection is returned; when you
use ReadAllLines, you must wait for
the whole array of strings be returned
before you can access the
array.Therefore, when you are working
with very large files, ReadLines can
be more efficient.
Your findings via reflector are correct and verify this behavior. The implementation you provided avoids this unexpected behavior but makes still use of lazy evaluation.

Categories