I was trying to get some Lists sorted using OrderBy within a foreach loop, but for some reason they weren't maintaining their sort order outside of the loop. Here's some simplified code and comments to highlight what was happening:
public class Parent
{
// Other properties...
public IList<Child> Children { get; set; }
}
public IEnumerable<Parent> DoStuff()
{
var result = DoOtherStuff() // Returns IEnumerable<Parent>
.OrderByDescending(SomePredicate)
.ThenBy(AnotherPredicate); // This sorting works as expected in the return value.
foreach (Parent parent in result)
{
parent.Children = parent.Children.OrderBy(YetAnotherPredicate).ToList();
// When I look at parent.Children here in the debugger, it's sorted properly.
}
return result;
// When I look at the return value, the Children are not sorted.
}
However, when I instead assign result like this:
var result = DoOtherStuff()
.OrderByDescending(SomePredicate)
.ThenBy(AnotherPredicate)
.ToList(); // <-- Added ToList here
then the return value has the Children sorted properly in each of the Parents.
What is the behavior of List<T> vs an IEnumerable<T> in a foreach loop?
There seems to be some sort of difference since turning result into a List fixed the problems with sorting in the foreach loop. It feels like the first code snippet creates an iterator that makes a copy of each element when you do the iteration with foreach (and thus my changes get applied to the copy but not the original object in result), while using ToList() made the enumerator give a pointer instead.
What's going on here?
The difference is that one is an expression that can procuce a set of Parent objects, and the other is a list of Parent objects.
Each time that you use the expression, it will use the original result from DoOtherStuff and then sort them. In your case it means that it will create a new set of Parent objects (as they obviously don't retain the children from the previous use).
This means that when you loop through the objects and sort the children, those objects will be thrown away. When you use the expression again to return the result, it will create a new set of objects where the children naturally is in the original order.
Sample code of what likely happens to add to Guffa's answer:
class Parent { public List<string> Children; }
Enumerable of "Parent", will create new "Parent" objects every time it is iterated:
var result = Enumerable.Range(0, 10)
.Select(_ => new Parent { Children = new List<sting>{"b", "a"});
Now first iteration with foreach there will be 10 "Parent" objects created (one for each iteration of the loop) and promptly discarded at the end of each iteration:
foreach (Parent parent in result)
{
// sorts children of just created parent object
parent.Children = parent.Children.OrderBy(YetAnotherPredicate).ToList();
// parent is no longer referenced by anything - discarded and eligible for GC
}
When you look at result again it will be re-iterated and new set of "Parent" objects created every time you look at it, hence "Children" are not sorted.
Note that depending on how DoOtherStuff() // Returns IEnumerable<Parent> is implemented result could be different. I.e. DoOtherStuff() can return collection of existing items from some cached collection:
List<Parent> allMyParents = ...;
IEnumerable<Parent> DoOtherStuff()
{
return allMyParents.Take(7);
}
Now every iteration of result will give you new collection, but each item in the collection will just be an item from allMyParents list - so modification "Children" property would change the instances in allMyParents and change would stick.
Remarks
The ToList(IEnumerable) method forces immediate
query evaluation and returns a List that contains the query
results. You can append this method to your query in order to obtain a
cached copy of the query results.
From: https://msdn.microsoft.com/en-us/library/bb342261(v=vs.110).aspx
If you omit the ToList() the query is not evaluated... Your debugger might do that for you but that is just a wild guess of me.
Related
One of my coworkers came to me with a question about this method that results in an infinite loop. The actual code is a bit too involved to post here, but essentially the problem boils down to this:
private IEnumerable<int> GoNuts(IEnumerable<int> items)
{
items = items.Select(item => items.First(i => i == item));
return items;
}
This should (you would think) just be a very inefficient way to create a copy of a list. I called it with:
var foo = GoNuts(new[]{1,2,3,4,5,6});
The result is an infinite loop. Strange.
I think that modifying the parameter is, stylistically a bad thing, so I changed the code slightly:
var foo = items.Select(item => items.First(i => i == item));
return foo;
That worked. That is, the program completed; no exception.
More experiments showed that this works, too:
items = items.Select(item => items.First(i => i == item)).ToList();
return items;
As does a simple
return items.Select(item => .....);
Curious.
It's clear that the problem has to do with reassigning the parameter, but only if evaluation is deferred beyond that statement. If I add the ToList() it works.
I have a general, vague, idea of what's going wrong. It looks like the Select is iterating over its own output. That's a little bit strange in itself, because typically an IEnumerable will throw if the collection it's iterating changes.
What I don't understand, because I'm not intimately familiar with the internals of how this stuff works, is why re-assigning the parameter causes this infinite loop.
Is there somebody with more knowledge of the internals who would be willing to explain why the infinite loop occurs here?
The key to answering this is deferred execution. When you do this
items = items.Select(item => items.First(i => i == item));
you do not iterate the items array passed into the method. Instead, you assign it a new IEnumerable<int>, which references itself back, and starts iterating only when the caller starts enumerating the results.
That is why all your other fixes have dealt with the problem: all you needed to do is to stop feeding IEnumerable<int> back to itself:
Using var foo breaks self-reference by using a different variable,
Using return items.Select... breaks self-reference by not using intermediate variables at all,
Using ToList() breaks self-reference by avoiding deferred execution: by the time items is re-assigned, old items has been iterated over, so you end up with a plain in-memory List<int>.
But if it's feeding on itself, how does it get anything at all?
That's right, it does not get anything! The moment you try iterating items and ask it for the first item, the deferred sequence asks the sequence fed to it for the first item to process, which means that the sequence is asking itself for the first item to process. At this point, it's turtles all the way down, because in order to return the first item to process the sequence must first get the first item to process from itself.
It looks like the Select is iterating over its own output
You are correct. You are returning a query that iterates over itself.
The key is that you reference items within the lambda. The items reference is not resolved ("closed over") until the query iterates, at which point items now references the query instead of the source collection. That's where the self-reference occurs.
Picture a deck of cards with a sign in front of it labelled items. Now picture a man standing beside the deck of cards whose assignment is to iterate the collection called items. But then you move the sign from the deck to the man. When you ask the man for the first "item" - he looks for the collection marked "items" - which is now him! So he asks himself for the first item, which is where the circular reference occurs.
When you assign the result to a new variable, you then have a query that iterates over a different collection, and so does not result in an infinite loop.
When you call ToList, you hydrate the query to a new collection and also do not get an infinite loop.
Other things that would break the circular reference:
Hydrating items within the lambda by calling ToList
Assigning items to another variable and referencing that within the lambda.
After studying the two answers given and poking around a bit, I came up with a little program that better illustrates the problem.
private int GetFirst(IEnumerable<int> items, int foo)
{
Console.WriteLine("GetFirst {0}", foo);
var rslt = items.First(i => i == foo);
Console.WriteLine("GetFirst returns {0}", rslt);
return rslt;
}
private IEnumerable<int> GoNuts(IEnumerable<int> items)
{
items = items.Select(item =>
{
Console.WriteLine("Select item = {0}", item);
return GetFirst(items, item);
});
return items;
}
If you call that with:
var newList = GoNuts(new[]{1, 2, 3, 4, 5, 6});
You'll get this output repeatedly until you finally get StackOverflowException.
Select item = 1
GetFirst 1
Select item = 1
GetFirst 1
Select item = 1
GetFirst 1
...
What this shows is exactly what dasblinkenlight made clear in his updated answer: the query goes into an infinite loop trying to get the first item.
Let's write GoNuts a slightly different way:
private IEnumerable<int> GoNuts(IEnumerable<int> items)
{
var originalItems = items;
items = items.Select(item =>
{
Console.WriteLine("Select item = {0}", item);
return GetFirst(originalItems, item);
});
return items;
}
If you run that, it succeeds. Why? Because in this case it's clear that the call to GetFirst is passing a reference to the original items that were passed to the method. In the first case, GetFirst is passing a reference to the new items collection, which hasn't yet been realized. In turn, GetFirst says, "Hey, I need to enumerate this collection." And thus begins the first recursive call that eventually leads to StackOverflowException.
Interestingly, I was right and wrong when I said that it was consuming its own output. The Select is consuming the original input, as I would expect. The First is trying to consume the output.
Lots of lessons to be learned here. To me, the most important is "don't modify the value of input parameters."
Thanks to dasblinkenlight, D Stanley, and Lucas Trzesniewski for their help.
I have a treenode collection as IEnumerable<treenode> nodes. Is there any method to create a collection of treenode.guid directly from nodes without iterating all the elements?
E.g.:
guidcollection nodeguids = nodes.somemethod();
The answer is not - you obviously need to iterate through the collection of nodes if you need some info from each of them. However you can do something like:
IEnumerable<Guid> nodeguids = nodes.Select(n => n.Id);
This way you do not perform an iteration manually at least. Although implementation of Select involves iteration over all the elements of the collection, and it will be done at the moment you will try to use the nodeguids collection somewhere.
LINQ uses a lazy iterator. So you can create a collection like this...
IEnumerable<TreeNode> myCollection = GetTreeNodes();
IEnumerable<Guid> guidCollection = myCollection.Select(tn => tn.Guid);
The guidCollection will not be populated until it is iterated, counted, acted upon in some way.
You could also just go old school.
function IEnumerable<Guid> GetIdsFromCollection(TreeNode collection)
{
foreach (var item in collection)
yield return item;
}
In C#, I have noticed that if I am running a foreach loop on a LINQ generated IEnumerable<T> collection and try to modify the contents of each T element, my modifications are not persistent.
On the other hand, if I apply the ToArray() or ToList() method when creating my collection, modification of the individual elements in the foreach loop are persistent.
I suspect that this is in some way related to deferred execution, but exactly how is not entirely obvious to me. I would really appreciate an explanation to this difference in behavior.
Here is some example code - I have a class MyClass with a constructor and auto-implemented property:
public class MyClass
{
public MyClass(int val) { Str = val.ToString(); }
public string Str { get; set; }
}
In my example application I use LINQ Select() to create two collections of MyClass objects based on a collection of integers, one IEnumerable<MyClass>, and one IList<MyClass> by applying the ToList() method in the end.
var ints = Enumerable.Range(1, 10);
var myClassEnumerable = ints.Select(i => new MyClass(i));
var myClassArray = ints.Select(i => new MyClass(i)).ToList();
Next, I run a foreach loop over each of the collections, and modify the contents of the looped-over MyClass objects:
foreach (var obj in myClassEnumerable) obj.Str = "Something";
foreach (var obj in myClassArray) obj.Str = "Something else";
Finally, I output the Str member of the first element in each collection:
Console.WriteLine(myClassEnumerable.First().Str);
Console.WriteLine(myClassArray.First().Str);
Somewhat counter-intuitively, the output is:
1
Something else
Deferred execution is the indeed the key point.
Executing myClassEnumerable.First().Str will reexecute your query ints.Select(i => new MyClass(i)); and so it will give you a new IEnumerable with a new list of integers.
You can see this in action using your debugger. Put a breakpoint at the new MyClass(i) part of the IEnumerable select and you will see that this part get's hit again when you execute it for Console.WriteLine
You are right, it is deferred execution. A new MyClass instance is created each time you iterate the IEnumerable. By calling ToList or ToArray you then create a List or Array and populate it with the new MyClass instances created from the iteration of the IEnumerable.
For example
var query = myDic.Where(x => !blacklist.Contains(x.Key));
foreach (var item in query)
{
if (condition)
blacklist.Add(item.key+1); //key is int type
ret.add(item);
}
return ret;
would this code be valid? and how do I improve it?
Updated
i am expecting my blacklist.add(item.key+1) would result in smaller ret then otherwise. The ToList() approach won't achieve my intention in this sense.
is there any other better ideas, correct and unambiguous.
That is perfectly safe to do and there shouldn't be any problems as you're not directly modifying the collection that you are iterating over. Though you are making other changes that affects where clause, it's not going to blow up on you.
The query (as written) is lazily evaluated so blacklist is updated as you iterate through the collection and all following iterations will see any newly added items in the list as it is iterated.
The above code is effectively the same as this:
foreach (var item in myDic)
{
if (!blacklist.Contains(item.Key))
{
if (condition)
blacklist.Add(item.key + 1);
}
}
So what you should get out of this is that as long as you are not directly modifying the collection that you are iterating over (the item after in in the foreach loop), what you are doing is safe.
If you're still not convinced, consider this and what would be written out to the console:
var blacklist = new HashSet<int>(Enumerable.Range(3, 100));
var query = Enumerable.Range(2, 98).Where(i => !blacklist.Contains(i));
foreach (var item in query)
{
Console.WriteLine(item);
if ((item % 2) == 0)
{
var value = 2 * item;
blacklist.Remove(value);
}
}
Yes. Changing a collections internal objects is strictly prohibited when iterating over a collection.
UPDATE
I initially made this a comment, but here is a further bit of information:
I should note that my knowledge comes from experience and articles I've read a long time ago. There is a chance that you can execute the code above because (I believe) the query contains references to the selected object within blacklist. blacklist might be able to change, but not query. If you were strictly iterating over blacklist, you would not be able to add to the blacklist collection.
Your code as presented would not throw an exception. The collection being iterated (myDic) is not the collection being modified (blacklist or ret).
What will happen is that each iteration of the loop will evaluate the current item against the query predicate, which would inspect the blacklist collection to see if it contains the current item's key. This is lazily evaluated, so a change to blacklist in one iteration will potentially impact subsequent iterations, but it will not be an error. (blacklist is fully evaluated upon each iteration, its enumerator is not being held.)
Let's say I have a class
public class MyObject
{
public int SimpleInt{get;set;}
}
And I have a List<MyObject>, and I ToList() it and then change one of the SimpleInt, will my change be propagated back to the original list. In other words, what would be the output of the following method?
public void RunChangeList()
{
var objs = new List<MyObject>(){new MyObject(){SimpleInt=0}};
var whatInt = ChangeToList(objs );
}
public int ChangeToList(List<MyObject> objects)
{
var objectList = objects.ToList();
objectList[0].SimpleInt=5;
return objects[0].SimpleInt;
}
Why?
P/S: I'm sorry if it seems obvious to find out. But I don't have compiler with me now...
Yes, ToList will create a new list, but because in this case MyObject is a reference type then the new list will contain references to the same objects as the original list.
Updating the SimpleInt property of an object referenced in the new list will also affect the equivalent object in the original list.
(If MyObject was declared as a struct rather than a class then the new list would contain copies of the elements in the original list, and updating a property of an element in the new list would not affect the equivalent element in the original list.)
From the Reflector'd source:
public static List<TSource> ToList<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
return new List<TSource>(source);
}
So yes, your original list won't be updated (i.e. additions or removals) however the referenced objects will.
ToList will always create a new list, which will not reflect any subsequent changes to the collection.
However, it will reflect changes to the objects themselves (Unless they're mutable structs).
In other words, if you replace an object in the original list with a different object, the ToList will still contain the first object.
However, if you modify one of the objects in the original list, the ToList will still contain the same (modified) object.
Yes, it creates a new list. This is by design.
The list will contain the same results as the original enumerable sequence, but materialized into a persistent (in-memory) collection. This allows you to consume the results multiple times without incurring the cost of recomputing the sequence.
The beauty of LINQ sequences is that they are composable. Often, the IEnumerable<T> you get is the result of combining multiple filtering, ordering, and/or projection operations. Extension methods like ToList() and ToArray() allow you to convert the computed sequence into a standard collection.
The accepted answer correctly addresses the OP's question based on his example. However, it only applies when ToList is applied to a concrete collection; it does not hold when the elements of the source sequence have yet to be instantiated (due to deferred execution). In case of the latter, you might get a new set of items each time you call ToList (or enumerate the sequence).
Here is an adaptation of the OP's code to demonstrate this behaviour:
public static void RunChangeList()
{
var objs = Enumerable.Range(0, 10).Select(_ => new MyObject() { SimpleInt = 0 });
var whatInt = ChangeToList(objs); // whatInt gets 0
}
public static int ChangeToList(IEnumerable<MyObject> objects)
{
var objectList = objects.ToList();
objectList.First().SimpleInt = 5;
return objects.First().SimpleInt;
}
Whilst the above code may appear contrived, this behaviour can appear as a subtle bug in other scenarios. See my other example for a situation where it causes tasks to get spawned repeatedly.
A new list is created but the items in it are references to the orginal items (just like in the original list). Changes to the list itself are independent, but to the items will find the change in both lists.
Just stumble upon this old post and thought of adding my two cents. Generally, if I am in doubt, I quickly use the GetHashCode() method on any object to check the identities. So for above -
public class MyObject
{
public int SimpleInt { get; set; }
}
class Program
{
public static void RunChangeList()
{
var objs = new List<MyObject>() { new MyObject() { SimpleInt = 0 } };
Console.WriteLine("objs: {0}", objs.GetHashCode());
Console.WriteLine("objs[0]: {0}", objs[0].GetHashCode());
var whatInt = ChangeToList(objs);
Console.WriteLine("whatInt: {0}", whatInt.GetHashCode());
}
public static int ChangeToList(List<MyObject> objects)
{
Console.WriteLine("objects: {0}", objects.GetHashCode());
Console.WriteLine("objects[0]: {0}", objects[0].GetHashCode());
var objectList = objects.ToList();
Console.WriteLine("objectList: {0}", objectList.GetHashCode());
Console.WriteLine("objectList[0]: {0}", objectList[0].GetHashCode());
objectList[0].SimpleInt = 5;
return objects[0].SimpleInt;
}
private static void Main(string[] args)
{
RunChangeList();
Console.ReadLine();
}
And answer on my machine -
objs: 45653674
objs[0]: 41149443
objects: 45653674
objects[0]: 41149443
objectList: 39785641
objectList[0]: 41149443
whatInt: 5
So essentially the object that list carries remain the same in above code. Hope the approach helps.
I think that this is equivalent to asking if ToList does a deep or shallow copy. As ToList has no way to clone MyObject, it must do a shallow copy, so the created list contains the same references as the original one, so the code returns 5.
ToList will create a brand new list.
If the items in the list are value types, they will be directly updated, if they are reference types, any changes will be reflected back in the referenced objects.
In the case where the source object is a true IEnumerable (i.e. not just a collection packaged an as enumerable), ToList() may NOT return the same object references as in the original IEnumerable. It will return a new List of objects, but those objects may not be the same or even Equal to the objects yielded by the IEnumerable when it is enumerated again
var objectList = objects.ToList();
objectList[0].SimpleInt=5;
This will update the original object as well. The new list will contain references to the objects contained within it, just like the original list. You can change the elements either and the update will be reflected in the other.
Now if you update a list (adding or deleting an item) that will not be reflected in the other list.
I don't see anywhere in the documentation that ToList() is always guaranteed to return a new list. If an IEnumerable is a List, it may be more efficient to check for this and simply return the same List.
The worry is that sometimes you may want to be absolutely sure that the returned List is != to the original List. Because Microsoft doesn't document that ToList will return a new List, we can't be sure (unless someone found that documentation). It could also change in the future, even if it works now.
new List(IEnumerable enumerablestuff) is guaranteed to return a new List. I would use this instead.