I know what yield does, and I've seen a few examples, but I can't think of real life applications, have you used it to solve some specific problem?
(Ideally some problem that cannot be solved some other way)
I realise this is an old question (pre Jon Skeet?) but I have been considering this question myself just lately. Unfortunately the current answers here (in my opinion) don't mention the most obvious advantage of the yield statement.
The biggest benefit of the yield statement is that it allows you to iterate over very large lists with much more efficient memory usage then using say a standard list.
For example, let's say you have a database query that returns 1 million rows. You could retrieve all rows using a DataReader and store them in a List, therefore requiring list_size * row_size bytes of memory.
Or you could use the yield statement to create an Iterator and only ever store one row in memory at a time. In effect this gives you the ability to provide a "streaming" capability over large sets of data.
Moreover, in the code that uses the Iterator, you use a simple foreach loop and can decide to break out from the loop as required. If you do break early, you have not forced the retrieval of the entire set of data when you only needed the first 5 rows (for example).
Regarding:
Ideally some problem that cannot be solved some other way
The yield statement does not give you anything you could not do using your own custom iterator implementation, but it saves you needing to write the often complex code needed. There are very few problems (if any) that can't solved more than one way.
Here are a couple of more recent questions and answers that provide more detail:
Yield keyword value added?
Is yield useful outside of LINQ?
actually I use it in a non traditional way on my site IdeaPipe
public override IEnumerator<T> GetEnumerator()
{
// goes through the collection and only returns the ones that are visible for the current user
// this is done at this level instead of the display level so that ideas do not bleed through
// on services
foreach (T idea in InternalCollection)
if (idea.IsViewingAuthorized)
yield return idea;
}
so basically it checks if viewing the idea is currently authorized and if it is it returns the idea. If it isn't, it is just skipped. This allows me to cache the Ideas but still display the ideas to the users that are authorized. Else I would have to re pull them each time based on permissions, when they are only re-ranked every 1 hour.
One interesting use is as a mechanism for asynchronous programming esp for tasks that take multiple steps and require the same set of data in each step. Two examples of this would be Jeffery Richters AysncEnumerator Part 1 and Part 2. The Concurrency and Coordination Runtime (CCR) also makes use of this technique CCR Iterators.
LINQ's operators on the Enumerable class are implemented as iterators that are created with the yield statement. It allows you to chain operations like Select() and Where() without actually enumerating anything until you actually use the enumerator in a loop, typically by using the foreach statement. Also, since only one value is computed when you call IEnumerator.MoveNext() if you decide to stop mid-collection, you'll save the performance hit of calculating all of the results.
Iterators can also be used to implement other kinds of lazy evaluation where expressions are evaluated only when you need it. You can also use yield for more fancy stuff like coroutines.
Another good use for yield is to perform a function on the elements of an IEnumerable and to return a result of a different type, for example:
public delegate T SomeDelegate(K obj);
public IEnumerable<T> DoActionOnList(IEnumerable<K> list, SomeDelegate action)
{
foreach (var i in list)
yield return action(i);
}
Using yield can prevent downcasting to a concrete type. This is handy to ensure that the consumer of the collection doesn't manipulate it.
You can also use yield return to treat a series of function results as a list. For instance, consider a company that pays its employees every two weeks. One could retrieve a subset of payroll dates as a list using this code:
void Main()
{
var StartDate = DateTime.Parse("01/01/2013");
var EndDate = DateTime.Parse("06/30/2013");
foreach (var d in GetPayrollDates(StartDate, EndDate)) {
Console.WriteLine(d);
}
}
// Calculate payroll dates in the given range.
// Assumes the first date given is a payroll date.
IEnumerable<DateTime> GetPayrollDates(DateTime startDate, DateTime endDate, int daysInPeriod = 14) {
var thisDate = startDate;
while (thisDate < endDate) {
yield return thisDate;
thisDate = thisDate.AddDays(daysInPeriod);
}
}
Related
foreach (Person criminal in people.Where(person => person.isCriminal)
{
// do something
}
I have this piece of code and want to know how does it actually work. Is it equivalent to an if statement nested inside the foreach iteration or does it first loop through the list of people and repeats the loop with selected values? I care to know more about this from the perspective of efficiency.
foreach (Person criminal in people)
{
if (criminal.isCriminal)
{
// do something
}
}
Where uses deferred execution.
This means that the filtering does not occur immediately when you call Where. Instead, each time you call GetEnumerator().MoveNext() on the return value of Where, it checks if the next element in the sequence satisfies the condition. If it does not, it skips over this element and checks the next one. When there is an element that satisfies the condition, it stops advancing and you can get the value using Current.
Basically, it is like having an if statement inside a foreach loop.
To understand what happens, you must know how IEnumerables<T> work (because LINQ to Objects always work on IEnumerables<T>. IEnumerables<T> return an IEnumerator<T> which implements an iterator. This iterator is lazy, i.e. it always only yields one element of the sequence at once. There is no looping done in advance, unless you have an OrderBy or another command which requires it.
So if you have ...
foreach (string name in source.Where(x => x.IsChecked).Select(x => x.Name)) {
Console.WriteLine(name);
}
... this will happen: The foreach-statement requires the first item which is requested from the Select, which in turn requires one item from Where, which in turn retrieves one item from the source. The first name is printed to the console.
Then the foreach-statement requires the second item which is requested from the Select, which in turn requires one item from Where, which in turn retrieves one item from the source. The second name is printed to the console.
and so on.
This means that both of your code snipptes are logically equivalent.
It depends on what people is.
If people is an IEnumerable object (like a collection, or the result of a method using yield) then the two pieces of code in your question are indeed equivalent.
A naïve Where could be implemented as:
public static IEnumerable<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
// Error handling left out for simplicity.
foreach (TSource item in source)
{
if (predicate(item))
{
yield return item;
}
}
}
The actual code in Enumerable is a bit different to make sure that errors from passing a null source or predicate happen immediately rather than on the deferred execution, and to optimise for a few cases (e.g. source.Where(x => x.IsCriminal).Where(x => x.IsOnParole) is turned into the equivalent of source.Where(x => x.IsCriminal && x.IsOnParole) so that there's one fewer step in the chains of iterations), but that's the basic principle.
If however people is an IQueryable then things are different, and depend on the details of the query provider in question.
The simplest possibility is that the query provider can't do anything special with the Where and so it ends up just doing pretty much the above, because that will still work.
But often the query provider can do something else. Let's say people is a DbSet<Person> in Entity Framework assocated with a table in a database called people. If you do:
foreach(var person in people)
{
DoSomething(person);
}
Then Entity Framework will run SQL similar to:
SELECT *
FROM people
And then create a Person object for each row returned. We could do the same filtering in about to implement Where but we can also do better.
If you do:
foreach (Person criminal in people.Where(person => person.isCriminal)
{
DoSomething(person);
}
Then Entity Framework will run SQL similar to:
SELECT *
FROM people
WHERE isCriminal = 1
This means that the logic of deciding which elements to return is done in the database before it comes back to .NET. It allows for indices to be used in computing the WHERE which can be much more efficient, but even in the worse case of there being no useful indices and the database having to do a full scan it will still mean that those records we don't care about are never reported back from the database and there is no object created for them just to be thrown away again, so the difference in performance can be immense.
I care to know more about this from the perspective of efficiency
You are hopefully satisfied that there's no double pass as you suggested might happen, and happy to learn that it's even more efficient than the foreach … if you suggested when possible.
A bare foreach and if will still beat .Where() against an IEnumerable (but not against a database source) as there are a few overheads to Where that foreach and if don't have, but it's to a degree that is only worth caring about in very hot paths. Generally Where can be used with reasonable confidence in its efficiency.
I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.
What's the common approach to design applications, which strongly rely on lazy evaluation in C# (LINQ, IEnumerable, IQueryable, ...)?
Right now I usually attempt to make every query as lazy as possible, using yield return and LINQ queries, but in runtime this could usually lead to "too lazy" behavior, when every query gets builts from it's beginning obviously resulting in severe visual performance degradation.
What I usually do means putting ToList() projection operators somewhere to cache the data, but I suspect this approach might be incorrect.
What's the appropriate / common ways to design this sort of applications from the very beginning?
I find it useful to classify each IEnumerable into one of three categories.
fast ones - e.g. lists and arrays
slow ones - e.g. database queries or heavy calculations
non-deterministic ones - e.g. list.Select(x => new { ... })
For category 1, I tend keep the concrete type when appropriate, arrays or IList etc.
For category 3, those are best to keep local within a method, to avoid hard-to find bugs.
Then we have category 2, and as always when optimizing performance, measure first to find the bottlenecks.
A few random thoughts - as the question itself is loosely defined:
Lazy is good only when the result might not be used hence loaded only when needed. Most operations, however, would need the data to be loaded so laziness is not good in that term.
Laziness can cause difficult bugs. We have seen it all with data contexts in ORMs
Lazy is good when it comes to MEF
Pretty broad question and unfortunately you're going to hear this a lot: It depends. Lazy-loading is great until it's not.
In general, if you're using the same IEnumerables over and over it might be best to cache them as lists.
But rarely does it make sense for your callers to know this either way. That is, if you're getting IEnumerables from a repository or something, it is best to let the repository do its job. It might cache it as a list internally or it might build it up every time. If your callers try to get too clever they might miss changes in the data, etc.
I would suggest doing a ToList in your DAL before returning the DTO
public IList<UserDTO> GetUsers()
{
using (var db = new DbContext())
{
return (from u in db.tblUsers
select new UserDTO()
{
Name = u.Name
}).ToList();
}
}
In the example above you have to do a ToList() before the DbContext scope ends.
I you need a certain sequence of data to be cached, call one of the aggregation operators (ToList, ToArray, etc.) on that sequence. Otherwise just use lazy evaluation.
Build your code around your data. What data is volatile and needs to be pulled fresh each time? Use lazy evaluation and don't cache. What data is relatively static and only needs to be pulled once? Cache that data in memory so you don't pull it unnecessarily.
Deferred execution and caching all items with .ToList() are not the only options. The third option is to cache the items while you are iterating by using a lazy List.
The execution is still deferred but all items are only yielded once. An example of how this work:
public class LazyListTest
{
private int _count = 0;
public void Test()
{
var numbers = Enumerable.Range(1, 40);
var numbersQuery = numbers.Select(GetElement).ToLazyList(); // Cache lazy
var total = numbersQuery.Take(3)
.Concat(numbersQuery.Take(10))
.Concat(numbersQuery.Take(3))
.Sum();
Console.WriteLine(_count);
}
private int GetElement(int value)
{
_count++;
// Some slow stuff here...
return value * 100;
}
}
If you run the Test() method, the _count is only 10. Without caching it would be 16 and with .ToList() it would be 40!
An example of the implementation of LazyList can be found here.
This question already has answers here:
Closed 12 years ago.
This question already has an answer here:
Is there ever a reason to not use 'yield return' when returning an IEnumerable?
There are several useful questions here on SO about the benefits of yield return. For example,
Can someone demystify the yield
keyword
Interesting use of the c# yield
keyword
What is the yield keyword
I'm looking for thoughts on when NOT to use yield return. For example, if I expect to need to return all items in a collection, it doesn't seem like yield would be useful, right?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?
It's a good idea to think carefully about your use of "yield return" when dealing with recursively defined structures. For example, I often see this:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
if (root == null) yield break;
yield return root.Value;
foreach(T item in PreorderTraversal(root.Left))
yield return item;
foreach(T item in PreorderTraversal(root.Right))
yield return item;
}
Perfectly sensible-looking code, but it has performance problems. Suppose the tree is h deep. Then there will at most points be O(h) nested iterators built. Calling "MoveNext" on the outer iterator will then make O(h) nested calls to MoveNext. Since it does this O(n) times for a tree with n items, that makes the algorithm O(hn). And since the height of a binary tree is lg n <= h <= n, that means that the algorithm is at best O(n lg n) and at worst O(n^2) in time, and best case O(lg n) and worse case O(n) in stack space. It is O(h) in heap space because each enumerator is allocated on the heap. (On implementations of C# I'm aware of; a conforming implementation might have other stack or heap space characteristics.)
But iterating a tree can be O(n) in time and O(1) in stack space. You can write this instead like:
public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
var stack = new Stack<Tree<T>>();
stack.Push(root);
while (stack.Count != 0)
{
var current = stack.Pop();
if (current == null) continue;
yield return current.Value;
stack.Push(current.Left);
stack.Push(current.Right);
}
}
which still uses yield return, but is much smarter about it. Now we are O(n) in time and O(h) in heap space, and O(1) in stack space.
Further reading: see Wes Dyer's article on the subject:
http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx
What are the cases where use of yield
will be limiting, unnecessary, get me
into trouble, or otherwise should be
avoided?
I can think of a couple of cases, IE:
Avoid using yield return when you return an existing iterator. Example:
// Don't do this, it creates overhead for no reason
// (a new state machine needs to be generated)
public IEnumerable<string> GetKeys()
{
foreach(string key in _someDictionary.Keys)
yield return key;
}
// DO this
public IEnumerable<string> GetKeys()
{
return _someDictionary.Keys;
}
Avoid using yield return when you don't want to defer execution code for the method. Example:
// Don't do this, the exception won't get thrown until the iterator is
// iterated, which can be very far away from this method invocation
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
yield ...
}
// DO this
public IEnumerable<string> Foo(Bar baz)
{
if (baz == null)
throw new ArgumentNullException();
return new BazIterator(baz);
}
The key thing to realize is what yield is useful for, then you can decide which cases do not benefit from it.
In other words, when you do not need a sequence to be lazily evaluated you can skip the use of yield. When would that be? It would be when you do not mind immediately having your entire collection in memory. Otherwise, if you have a huge sequence that would negatively impact memory, you would want to use yield to work on it step by step (i.e., lazily). A profiler might come in handy when comparing both approaches.
Notice how most LINQ statements return an IEnumerable<T>. This allows us to continually string different LINQ operations together without negatively impacting performance at each step (aka deferred execution). The alternative picture would be putting a ToList() call in between each LINQ statement. This would cause each preceding LINQ statement to be immediately executed before performing the next (chained) LINQ statement, thereby forgoing any benefit of lazy evaluation and utilizing the IEnumerable<T> till needed.
There are a lot of excellent answers here. I would add this one: Don't use yield return for small or empty collections where you already know the values:
IEnumerable<UserRight> GetSuperUserRights() {
if(SuperUsersAllowed) {
yield return UserRight.Add;
yield return UserRight.Edit;
yield return UserRight.Remove;
}
}
In these cases the creation of the Enumerator object is more expensive, and more verbose, than just generating a data structure.
IEnumerable<UserRight> GetSuperUserRights() {
return SuperUsersAllowed
? new[] {UserRight.Add, UserRight.Edit, UserRight.Remove}
: Enumerable.Empty<UserRight>();
}
Update
Here's the results of my benchmark:
These results show how long it took (in milliseconds) to perform the operation 1,000,000 times. Smaller numbers are better.
In revisiting this, the performance difference isn't significant enough to worry about, so you should go with whatever is the easiest to read and maintain.
Update 2
I'm pretty sure the above results were achieved with compiler optimization disabled. Running in Release mode with a modern compiler, it appears performance is practically indistinguishable between the two. Go with whatever is most readable to you.
Eric Lippert raises a good point (too bad C# doesn't have stream flattening like Cw). I would add that sometimes the enumeration process is expensive for other reasons, and therefore you should use a list if you intend to iterate over the IEnumerable more than once.
For example, LINQ-to-objects is built on "yield return". If you've written a slow LINQ query (e.g. that filters a large list into a small list, or that does sorting and grouping), it may be wise to call ToList() on the result of the query in order to avoid enumerating multiple times (which actually executes the query multiple times).
If you are choosing between "yield return" and List<T> when writing a method, consider: is each single element expensive to compute, and will the caller need to enumerate the results more than once? If you know the answers are yes and yes, you shouldn't use yield return (unless, for example, the List produced is very large and you can't afford the memory it would use. Remember, another benefit of yield is that the result list doesn't have to be entirely in memory at once).
Another reason not to use "yield return" is if interleaving operations is dangerous. For example, if your method looks something like this,
IEnumerable<T> GetMyStuff() {
foreach (var x in MyCollection)
if (...)
yield return (...);
}
this is dangerous if there is a chance that MyCollection will change because of something the caller does:
foreach(T x in GetMyStuff()) {
if (...)
MyCollection.Add(...);
// Oops, now GetMyStuff() will throw an exception
// because MyCollection was modified.
}
yield return can cause trouble whenever the caller changes something that the yielding function assumes does not change.
I would avoid using yield return if the method has a side effect that you expect on calling the method. This is due to the deferred execution that Pop Catalin mentions.
One side effect could be modifying the system, which could happen in a method like IEnumerable<Foo> SetAllFoosToCompleteAndGetAllFoos(), which breaks the single responsibility principle. That's pretty obvious (now...), but a not so obvious side effect could be setting a cached result or similar as an optimisation.
My rules of thumb (again, now...) are:
Only use yield if the object being returned requires a bit of processing
No side effects in the method if I need to use yield
If have to have side effects (and limiting that to caching etc), don't use yield and make sure the benefits of expanding the iteration outweigh the costs
Yield would be limiting/unnecessary when you need random access. If you need to access element 0 then element 99, you've pretty much eliminated the usefulness of lazy evaluation.
One that might catch you out is if you are serialising the results of an enumeration and sending them over the wire. Because the execution is deferred until the results are needed, you will serialise an empty enumeration and send that back instead of the results you want.
I have to maintain a pile of code from a guy who was absolutely obsessed with yield return and IEnumerable. The problem is that a lot of third party APIs we use, as well as a lot of our own code, depend on Lists or Arrays. So I end up having to do:
IEnumerable<foo> myFoos = getSomeFoos();
List<foo> fooList = new List<foo>(myFoos);
thirdPartyApi.DoStuffWithArray(fooList.ToArray());
Not necessarily bad, but kind of annoying to deal with, and on a few occasions it's led to creating duplicate Lists in memory to avoid refactoring everything.
When you don't want a code block to return an iterator for sequential access to an underlying collection, you dont need yield return. You simply return the collection then.
If you're defining a Linq-y extension method where you're wrapping actual Linq members, those members will more often than not return an iterator. Yielding through that iterator yourself is unnecessary.
Beyond that, you can't really get into much trouble using yield to define a "streaming" enumerable that is evaluated on a JIT basis.
Consider the following simple code pattern:
foreach(Item item in itemList)
{
if(item.Foo)
{
DoStuff(item);
}
}
If I want to parallelize it using Parallel Extensions(PE) I might simply replace the for loop construct as follows:
Parallel.ForEach(itemList, delegate(Item item)
{
if(item.Foo)
{
DoStuff(item);
}
});
However, PE will perform unnecessary work assigning work to threads for those items where Foo turned out to be false. Thus I was thinking an intermediate wrapper/filtering IEnumerable might be a reasonable approach here. Do you agree? If so what is the simplest way of achieving this? (BTW I'm currently using C#2, so I'd be grateful for at least one example that doesn't use lambda expressions etc.)
I'm not sure how the partitioning in PE for .NET 2 works, so it's difficult to say there. If each element is being pushed into a separate work item (which would be a fairly poor partitioning strategy), then filtering in advance would make quite a bit of sense.
If, however, item.Foo happened to be at all expensive (I wouldn't expect this, given that it's a property, but it's always possible), allowing it to be parallelized could be advantageous.
In addition, in .NET 4, the partitioning strategy used by the TPL will handle this fairly well. It was specifically designed to handle situations with varying levels of work. It does partitioning in "chunks", so one item does not get sent to one thread, but rather a thread gets assigned a set of items, which it processes in bulk. Depending on the frequency of item.Foo being false, paralellizing (using TPL) would quite possibly be faster than filtering in advance.
That all factors down to this single line:
Parallel.ForEach(itemList.Where(i => i.Foo), DoStuff);
But reading a comment to another post I now see you're in .Net 2.0 yet, so some of this may be a bit tricky to sneak past the compiler.
For .Net 2.0, I think you can do it like this (I'm a little unclear that passing the method names as delegates will still just work, but I think it will):
public IEnumerable<T> Where(IEnumerable<T> source, Predicate<T> predicate)
{
foreach(T item in source)
if (predicate(item))
yield return item;
}
public bool HasFoo(Item item) { return item.Foo; }
Parallel.ForEach(Where(itemList, HasFoo), DoStuff);
If I was to implement this, I would simply filter the list, before calling the foreach.
var actionableResults = from x in ItemList WHERE x.Foo select x;
This will filter the list to get the items that can be acted upon.
NOTE: this might be a pre-mature optimization, and could not make a major difference in your performance.