I was reading a piece of code from the "XStreamingReader" library (which seems like a really cool solution for being able to execute LINQ queries over XML documents but without loading the actual document into the memory (like in an XDocument object)
and was wondering about the following:
public IEnumerable<XElement> Elements()
{
using (var reader = readerFactory())
{
reader.MoveToContent();
MoveToNextElement(reader);
while (!reader.EOF)
{
yield return XElement.Load(reader.ReadSubtree());
MoveToNextFollowing(reader);
}
}
}
public IEnumerable<XElement> Elements(XName name)
{
return Elements().Where(x => x.Name == name);
}
Regarding the 2nd method Elements(XName) - The method first calls Elements(), and then use Where() to filter it's results, but i'm kind of intrigued about the order of executions in here since Elements() contains a yield statement.
From what I understand:
- Executing Elements() returns an IEnumerable collection, this collection physically does not contain any items YET.
- Where() is executed on that collection, behind the scene there's a loop which iterates through every item, new items are "Loaded" on the fly, since yield is being used.
- All items which matched the Where statement are returned as an IEnumerable collection, and are PHYSICALLY IN that collection.
First, am I correct with the above assumption?
Second, in case i'm right - what if I wanted to return a "yielded" collection rather than returning a collection which is filled up physically with all the filtered data?
I'm asking this because it loses the entire purpose of NOT reading an entire "matching" block into the memory, but iterating one matching element at a time...
I assume when you say that items are physically in a collection, you mean that there is a structure in memory that contains all the items right now. With Where(), that's not the case, it uses yield too internally (or something that acts the same as yield).
When you try to fetch the first item, Where() iterates the source collection, until it finds the first item that matches. So, the elements are streamed both in Elements() and in Elements(XName) and the whole collection is never in memory, only piece by piece.
Where() is executed on that collection
First, am I correct with the above assumption?
No. Where returns a lazy IEnumerable<XElement>. Later, when that IEnumerable<XElement> is enumerated, then the elements are yielded and filtered.
If the thing which enumerates that lazy IEnumerable happens to collect the elements (such as a call to ToList), then all the elements will be in memory at that point. If the thing which enumerates that lazy IEnumerable happens to process each item one at a time (such as a foreach loop, which does not retain a reference to the XElement), then only one item at a time will be in memory.
All items which matched the Where statement are returned as an IEnumerable collection, and are PHYSICALLY IN that collection. First, am I correct with the above assumption?
No. Where implements an additional enumerator internally, which does what you want it to do. If the IEnumerable is not enumerated, then the reader is never called, and the individual XElement instances never get created, and the filtering code is never run.
See Jon Skeet's article on re-implementing the behavior of the Where clause: http://msmvps.com/blogs/jon_skeet/archive/2010/09/03/reimplementing-linq-to-objects-part-2-quot-where-quot.aspx . He mimics the existing implementation (for explanitory purposes - no need to use his re-implementation in real code), and his code uses yield return.
Note that if you call ToList, though, then the entire enumeration will be evaluated and copied to a list, so be careful what you do with the IEnumerable that Where returns.
Also keep in mind that if the reader returned by readerFactory is reading from memory (e.g. StringReader), then the document will exist physically in memory - there just won't be any instance of DOM nodes until you enumerate them. And once you enumerate those elements, your document will exist twice in memory, one for the original document, one in DOM form. You may want to ensure that your streaming is done against a non-memory stream (e.g. directly from a file or network stream).
Related
does foreach correctly iterate over flexible list?
for example
//will iterate over all items in list?
foreach (var obj in list)
{
//list length changes here
//ex:
list.Add(...);
list.Remove(...);
list.Concat(...);
// and so on
}
and if it does ...how?
You can't modify a collection while enumerating it inside a foreach statement.
You should use another pattern to do what you are trying to do because the for each does not allow you to change the enumerator you are looping to.
For Example:
Imagine if you run a foreach on a sorted list from the beginning, you start processing item with key="A" then you go to "B" then you change "C" to "B", what's going to happen? Your list is resorted and you don't know anymore what you are looping and where you are.
In general you "could" do it with a for(int i=dictionary.count-1; i>=0; --i) or something like that but this also depends on your context, I would really try to use another approach.
Internal Working: IEnumerator<t> is designed to enable the iterator pattern for iterating over collections of elements, rather than the length-index. IEnumerator<t> includes two members.
The first is bool MoveNext(). Using this method, we can move from one element within the collection to the next while at the same time detecting when we have enumerated through every item using the Boolean return.
The second member, a read-only property called Current, returns the element currently in process. With these two members on the collection class, it is possible to iterate over the collection simply using a while loop.
The MoveNext() method in this listing returns false when it moves past the end of the collection. This replaces the need to count elements while looping. (The last member on IEnumerator<t> , Reset(), will reset the enumeration.)
Per the documentation, if changes are made inside the loop the behavior is undefined. Undefined means that there are no restrictions on what it can do, there is no "incorrect behavior" when the behavior is undefined...crash, do what you want, send an email to your boss calling him nasty names and quiting, all equally valid. I would hope for a crash in this case, but again, whatever happens, happens and is considered "correct" according to the documentation.
You cannot change the collection inside the for each loop of the same collection.
if you want you can use for loop to change the collection length.
The collection you use in a foreach loop is immutable. As per MSDN
The foreach statement is used to iterate through the collection to get
the information that you want, but can not be used to add or remove
items from the source collection to avoid unpredictable side effects.
If you need to add or remove items from the source collection, use a
for loop.
But as per this link, it looks like this is now possible from .Net 4.0
When using .FromCache() on an IQueryable result set, should I additionally call .ToList(), or can I just return the IEnumerable<> returned by the materialized query with FromCache?
I am assuming you are using a derivative of the code from http://petemontgomery.wordpress.com/2008/08/07/caching-the-results-of-linq-queries/ . If you look at the FromCache implementation, you will see the that the query.ToList() is already called. This means that the evaluated list is what is cached. So,
You do NOT need to call ToList()
That depends entirely on what you want to do with it. If you're just going to foreach over it once then you may as well just leave it as an IEnumerable. There's no need to build up a list just to discard it right away.
If you plan to iterate over it multiple times it's probably best to ToList it, so that you're not accessing the underlying IQueryable multiple times. You should also ToList it if it's possible for the underlying query to change over time and you don't want those changes to be reflected in your query.
If you are likely to not need to iterate all of the items (you may end up stopping after the first item, or half way, or something like that) then it's probably best to leave it as an IEnumerable to potentially avoid even fetching some amount of data in the first place.
If the method has no idea how it's going to be used, and it's just a helper method that will be used by not-yet-written code, then consider returning IEnumerable. The caller can call ToList on it if they have a compelling reason to turn it into a list.
For me, as a general rule, I leave such queries as IEnumerable unless I have some compelling reason to make it a List.
This is something that I was exploring to see if I could take what was
List<MdiChild> openMdiChildren = new List<MdiChild>();
foreach(child in MdiManager.Pages)
{
openMdiChildren.Add(child);
}
foreach(child in openMdiChild)
{
child.Close();
}
and shorten it to not require 2 foreach loops.
Note I've changed what the objects are called to simplify this for this example (these come from 3rd party controls). But for information and understanding
MdiManager.Pages inherits form CollectionBase, which in turn inherits IEnumerable
and MdiChild.Close() removes the open child from the MdiManager.Pages Collection, thus altering the collection and causing the enumeration to throw an exception if the collection was modified during enumeration, e.g..
foreach(child in MdiManage.Pages)
{
child.Close();
}
I was able to the working double foreach to
((IEnumerable) MdiManager.Pages).Cast<MdiChild>.ToList()
.ForEach(new Action<MdiChild>(c => c.Close());
Why does this not have the same issues dealing with modifying the collection during enumeration? My best guess is that when Enumerating over the List created by the ToList call that it is actually executing the actions on the matching item in the MdiManager.Pages collection and not the generated List.
Edit
I want to make it clear that my question is how can I simplify this, I just wanted to understand why there weren't issues with modifying a collection when I performed it as I have it written currently.
Your call to ToList() is what saves you here, as it's essentially duplicating what you're doing above. ToList() actually creates a List<T> (a List<MdiChild> in this case) that contains all of the elements in MdiManager.Pages, then your subsequent call to ForEach operates on that list, not on MdiManager.Pages.
In the end, it's a matter of style preference. I'm not personally a fan of the ForEach function (I prefer the query composition functions like Where and ToList() for their simplicity and the fact that they aren't engineered to have side-effects upon the original source, whereas ForEach is not).
You could also do:
foreach(child in MdiManager.Pages.Cast<MdiChild>().ToList())
{
child.Close();
}
Fundamentally, all three approaches do exactly the same thing (they cache the contents of MdiManager.Pages into a List<MdiChild>, then iterate over that cached list and call Close() on each element.
When you call the ToList() method you're actually enumerating the MdiManager.Pages and creating a List<MdiChild> right there (so that's your foreach loop #1). Then when the ForEach() method executes it will enumerate the List<MdiChild> created previously and execute your action on each item (so that's foreach loop #2).
So essentially it's another way of accomplishing the same thing, just using LINQ.
You could also write it as:
foreach(var page in MdiManager.Pages.Cast<MdiChild>.ToList())
page.Close();
In any case, when you call ToList() extension method on an IEnumerable; you are creating a brand new list. Deleted from its source collection ( in this case, MdiManager.Pages ) will not affect the list output by ToList().
This same technique can be used to delete elements from a source collection without worrying about affecting the source enumerable.
You're mostly right.
ToList() creates a copy of the enumeration, and therefore you are enumerating the copy.
You could also do this, which is equivalent, and shows what you are doing:
var copy = new List<MdiChild>(MdiManager.Pages.Cast<MdiChild>());
foreach(var child in copy)
{
child.Close();
}
Since you are enumerating the elements of the copy enumeration, you don't have to worry about modifying the Pages collection, since each object referece that existed in the Pages collection now also exists in copy and changes to Pages don't affect it.
All the remaining methods on the call, ForEach() and the casts, are superfluous and can be eliminated.
At first glance, the culprit is ToList(), which is a method returning a copy of the items as a List, thus circumventing the problem.
This code :
IEnumerable<string> lines = File.ReadLines("file path");
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
throws an ObjectDisposedException : {"Cannot read from a closed TextReader."} if the second foreach is executed.
It seems that the iterator object returned from File.ReadLines(..) can't be enumerated more than once. You have to obtain a new iterator object by calling File.ReadLines(..) and then use it to iterate.
If I replace File.ReadLines(..) with my version(parameters are not verified, it's just an example):
public static IEnumerable<string> MyReadLines(string path)
{
using (var stream = new TextReader(path))
{
string line;
while ((line = stream.ReadLine()) != null)
{
yield return line;
}
}
}
it's possible to iterate more than once the lines of the file.
An investigation using .Net Reflector showed that the implementation of the File.ReadLines(..) calls a private File.InternalReadLines(TextReader reader) that creates the actual iterator. The reader passed as a parameter is used in the MoveNext() method of the iterator to get the lines of the file and is disposed when we reach the end of the file. This means that once MoveNext() returns false there is no way to iterate a second time because the reader is closed and you have to get a new reader by creating a new iterator with the ReadLines(..) method.In my version a new reader is created in the MoveNext() method each time we start a new iteration.
Is this the expected behavior of the File.ReadLines(..) method?
I find troubling the fact that it's necessary to call the method each time before you enumerate the results. You would also have to call the method each time before you iterate the results of a Linq query that uses the method.
I know this is old, but i actually just ran into this while working on some code on a Windows 7 machine. Contrary to what people were saying here, this actually was a bug. See this link.
So the easy fix is to update your .net framefork. I thought this was worth updating since this was the top search result.
I don't think it's a bug, and I don't think it's unusual -- in fact that's what I'd expect for something like a text file reader to do. IO is an expensive operation, so in general you want to do everything in one pass.
It isn't a bug. But I believe you can use ReadAllLines() to do what you want instead. ReadAllLines creates a string array and pulls in all the lines into the array, instead of just a simple enumerator over a stream like ReadLines does.
If you need to access the lines twice you can always buffer them into a List<T>
using System.Linq;
List<string> lines = File.ReadLines("file path").ToList();
foreach (var line in lines)
{
Console.WriteLine(line);
}
foreach (var line in lines)
{
Console.WriteLine(line);
}
I don't know if it can be considered a bug or not if it's by design but I can certainly say two things...
This should be posted on Connect, not StackOverflow although they're not going to change it before 4.0 is released. And that usually means they won't ever fix it.
The design of the method certainly appears to be flawed.
You are correct in noting that returning an IEnumerable implies that it should be reusable and it does not guarantee the same results if iterated twice. If it had returned an IEnumerator instead then it would be a different story.
So anyway, I think it's a good find and I think the API is a lousy one to begin with. ReadAllLines and ReadAllText give you a nice convenient way of getting at the entire file but if the caller cares enough about performance to be using a lazy enumerable, they shouldn't be delegating so much responsibility to a static helper method in the first place.
I believe you are confusing an IQueryable with an IEnumerable. Yes, it's true that IQueryable can be treated as an IEnumerable, but they are not exactly the same thing. An IQueryable queries each time it's used, while an IEnumerable has no such implied reuse.
A Linq Query returns an IQueryable. ReadLines returns an IEnumerable.
There's a subtle distinction here because of the way an Enumerator is created. An IQueryable creates an IEnumerator when you call GetEnumerator() on it (which is done automatically by foreach). ReadLines() creates the IEnumerator when the ReadLines() function is called. As such, when you reuse an IQueryable, it creates a new IEnumerator when you reuse it, but since the ReadLines() creates the IEnumerator (and not an IQueryable), the only way to get a new IEnumerator is to call ReadLines() again.
In other words, you should only be able to expect to reuse an IQueryable, not an IEnumerator.
EDIT:
On further reflection (no pun intended) I think my initial response was a bit too simplistic. If IEnumerable was not reusable, you couldn't do something like this:
List<int> li = new List<int>() {1, 2, 3, 4};
IEnumerable<int> iei = li;
foreach (var i in iei) { Console.WriteLine(i); }
foreach (var i in iei) { Console.WriteLine(i); }
Clearly, one would not expect the second foreach to fail.
The problem, as is so often the case with these kinds of abstractions, is that not everything fits perfectly. For example, Streams are typically one-way, but for network use they had to be adapted to work bi-directionally.
In this case, an IEnumerable was originally envisioned to be a reusable feature, but it has since been adapted to be so generic that reusability is not a guarantee or even should be expected. Witness the explosion of various libraries that use IEnumerables in non-reusable ways, such as Jeffery Richters PowerThreading library.
I simply don't think we can assume IEnumerables are reusable in all cases anymore.
It's not a bug. File.ReadLines() uses lazy evaluation and it is not idempotent. That's why it's not safe to enumerate it twice in a row. Remember an IEnumerable represents a data source that can be enumerated, it does not state it is safe to be enumerated twice, although this might be unexpected since most people are used to using IEnumerable over idempotent collections.
From the MSDN:
The ReadLines(String, System) and
ReadAllLines(String, System) methods
differ as follows: When you use
ReadLines, you can start enumerating
the collection of strings before the
whole collection is returned; when you
use ReadAllLines, you must wait for
the whole array of strings be returned
before you can access the
array.Therefore, when you are working
with very large files, ReadLines can
be more efficient.
Your findings via reflector are correct and verify this behavior. The implementation you provided avoids this unexpected behavior but makes still use of lazy evaluation.
This is the situation:
I'm browsing through some code and I wondered if the following statement takes a reference of the selected collection or a copy with which it replaces the original object when the foreach loop finishes. If the first, will it take the new found pages and join them in the loop?
foreach(Page page in Pages)
{
page.AddRange(RetrieveSubPages(page.Id));
}
Edit: I'm sorry, I made a typo.
It should be this:
foreach(Page page in pages)
{
pages.AddRange(RetrieveSubPages(page.Id));
}
What i tried to say is that if i add some objects to the enumerating collection, will it join those objects in the foreach?
It looks like the code doesn't modify the Pages collection, but the content of the objects in the Page objects in the Pages collection. The Page type having at least collection like method.
In general each collection implements iteration in a way suitable for itself, and generally becomes unmodifiable while iterating, but one could implelment a collection which iterates by taking a snapshot of itself.
There is no mechanism to detect exit from a loop which would allow action to be taken at that point (consider how this would interact with exceptions, break and return in the body of the loop).
In most cases, foreach works against the live collection (no explicit clone), and if you try to change the collection while enumerating it, then the enumerator breaks with an exception. So if you are adding to Pages, expect problems.
I think the safest way is this:
Array<Page> newpages = new Array<Page>();
foreach(Page page in pages)
{
newpages.AddRange(RetrieveSubPages(page.Id));
}
pages.AddRange(newpages);
You'd have to extend this a bit if you wanted to recurse into the subpages.
In response to you question, it does not make a copy.
It creates an enumerator and iterates through the collection. If the collection is changed while this enumeration is happening, in the foreach itself, or asynchronously, you will get an exception:
An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
Additional information: Collection was modified; enumeration operation may not execute.
You can, use a temporary collection and join the two afterwards, or just not use an enumerator.
for (int i = 0; i < pages.Count; i++)
{
test.AddRange(RetrieveSubPages(pages[i].Id));
}
foreach uses an enumerator.
The collection over which you loop using foreach, has to implement IEnumerable (or IEnumerable<T>).
Then, foreach calls the GetEnumerator method of that collection, and uses the Enumerator to traverse the collection.
You are not modifying the collection you are enumerating, therefore you won't have any problems with this code.
It is also irrelevant, if an clone of the collection is being enumerated, because the objects contained by both, collection and clone, are still the same (reference equals).
I'm pretty sure you'll get an exception thrown complaining that the underlying collection was modified