At what point is a LINQ data source determined?

At what point is a LINQ data source determined? - c#

Given the below two samples of LINQ, at what point is a LINQ data source determined?
int[] numbers = new int[7] { 0, 1, 2, 3, 4, 5, 6 };
IEnumerable<int> linqToOjects = numbers.Where(x => true);
XElement root = XElement.Load("PurchaseOrder.xml");
IEnumerable<XElement> linqToXML = root.Elements("Address").Where(x => true);
My understanding is that the underlying code used to query these two different data sources lives within the IEnumerable object produced by the LINQ methods.
My question is, at what point exactly is it determined whether code will be generated to use the Linq To Objects library or the Linq To XML library?
I would assume that the underlying code (the code which actually does the work of querying the data) used to query these data sources exist within their own libraries and are called upon dependent on the data source. I have looked at https://referencesource.microsoft.com/ to look at the code of the Where clause/extension method thinking that the call to the desired provider might be in there, but it appears to be generic.
How is the magic which goes into the IEnumerable determined?

The "data source" is determined immediately. For example, in your first example, the return value of Where is an object that implements IEnumerable<int> (the Enumerable.WhereArrayIterator<int> class in particular) that has a dependency on the numbers object (stored as a field). And the return value of Where in the second example is an enumerable object that has an dependency on the xml element object. So even before you start enumerating, the resulting enumerable knows where to get the data from.

My question is, at what point exactly is it determined whether code
will be generated to use the Linq To Objects library or the Linq To
XML library?
I think there is no code generation. LINQ just uses the datasource enumerator.
You have a class that implement IEnumerable
Exposes the enumerator, which supports a simple iteration over a
collection of a specified type.
So you can use the method GetEnumerator.
Returns an enumerator that iterates through the collection.
And this all LINQ needs to work, an enumerator.
In your example you use the Where LINQ extension method to apply some filter.
IEnumerable<T> Where(this IEnumerable<T> source, Func<T, bool> predicate)
In the implementation we need to:
- get the enumerator (source.GetEnumerator())
- iterate through the collection and apply the filter (predicate)
In the Enumerable reference source you have the implementation of the method Where. You can see that he uses some specific implementation for array (TSource[]) and list (List), but he uses WhereEnumerableIterator for all the other classes that implement IEnumerable.
So there is no code generation, the code is there.
I think you can understand the implementation of the class WhereEnumerableIterator, you only need to understand first how to implement IEnumerator.
Here you can see the implementation of MoveNext. They call source.GetEnumerator() and then they iterate through the collection (enumerator.MoveNext()) and apply the filter (predicate(item)).
public override bool MoveNext() {
switch (state) {
case 1:
enumerator = source.GetEnumerator();
state = 2;
goto case 2;
case 2:
while (enumerator.MoveNext()) {
TSource item = enumerator.Current;
if (predicate(item)) {
current = item;
return true;
}
}
Dispose();
break;
}
return false;
}
XContainer.GetElement returns an IEnumerable using the yield keyword.
When you use the yield keyword in a statement, you indicate that the
method, operator, or get accessor in which it appears is an iterator.
Using yield to define an iterator removes the need for an explicit
extra class (the class that holds the state for an enumeration, see
IEnumerator for an example) when you implement the IEnumerable and
IEnumerator pattern for a custom collection type.
Thanks to the magic of yield keyword we can obtain an IEnumerable, and we can enumerate the collection. And this is the only thing LINQ needs.

Related

Does a combination of OfType() and FirstOrDefault() iterate over the entire array?

I have an array of BaseTool and I want to return the first element of type T:
public BaseTool GetTool<T>() where T : BaseTool
{
foreach (var tool in tools)
{
if (tool is T)
{
return tool;
}
}
return null;
}
Rider suggested me to use LINQ methods instead:
public BaseTool GetTool<T>() where T : BaseTool
{
return tools.OfType<T>().FirstOrDefault();
}
I was wondering if these two imeplementations will perform the same. The basic loop variant returns upon finding the first T instance. I know that OfType uses deferred execution. But I'm not sure if the above combination with FirstOrDefault will cause the evaluation of OfType on the entire array or not.

FirstOrDefault only iterates as far as it needs to (finding the first element) - so no, this won't cause the whole sqeuence to be evaluated, unless there are no elements of that type (or it's just the last element).
More details are in my Edulinq blog post, or you can look at the .NET Core implementation - which is slightly more complex than my Edulinq implementation, but still lazy (unless you have an IList<T> implementation with a terribly-implemented Count property, or something like that).

Does Linq's IEnumerable.Select return a reference to the original IEnumerable?

I was trying to clone an List in my code, because I needed to output that List to some other code, but the original reference was going to be cleared later on. So I had the idea of using the Select extension method to create a new reference to an IEnumerable of the same elements, for example:
List<int> ogList = new List<int> {1, 2, 3};
IEnumerable<int> enumerable = ogList.Select(s => s);
Now after doing ogList.Clear(), I was surprised to see that my new enumerable was also empty.
So I started fiddling around in LINQPad, and saw that even if my Select returned different objects entirely, the behaviour was the same.
List<int> ogList = new List<int> {1, 2, 3};
IEnumerable<int> enumerable = ogList.Select(s => 5); // Doesn't return the original int
enumerable.Count().Dump(); // Count is 3
ogList.Clear();
enumerable.Count().Dump(); // Count is 0!
Note that in LINQPad, the Dump()s are equivalent to Console.WriteLine().
Now probably my need to clone the list in the first place was due to bad design, and even if I didn't want to rethink the design I could easily clone it properly. But this got me thinking about what the Select extension method actually does.
According to the documentation for Select:
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
So then I tried adding this code before clearing:
foreach (int i in enumerable)
{
i.Dump();
}
The result was still the same.
Finally, I tried one last thing to figure out if the reference in my new enumerable was the same as the old one. Instead of clearing the original List, I did:
ogList.Add(4);
Then I printed out the contents of my enumerable (the "cloned" one), expecting to see '4' appended to the end of it. Instead, I got:
5
5
5
5 // Huh?
Now I have no choice but to admit that I have no idea how the Select extension method works behind the scenes. What's going on?

List/List<T> are for all intents and purposes fancy resizable arrays. They own and hold the data for value types such as your ints or references to the data for reference types in memory and they always know how many items they have.
IEnumerable/IEnumerable<T> are different beasts. They provide a different service/contract. An IEnumerable is fictional, it does not exist. It can create data out of thin air, with no physical backing. Their only promise is that they have a public method called GetEnumerator() that returns an IEnumerator/IEnumerator<T>. The promise that an IEnumerator makes is simple:
some item could be available or not at a time when you decide you need it. This is achieved through a simple method that the IEnumerator interface has: bool MoveNext() - which returns false when the enumeration is completed or true if there was in fact a new item that needed to be returned. You can read the data through a property that the IEnumerator interface has, conveniently called Current.
To get back to your observations/question: as far as the IEnumerable in your example is concerned, it does not even think about the data unless your code tells it to fetch some data.
When you are writing:
List<int> ogList = new List<int> {1, 2, 3};
IEnumerable<int> enumerable = ogList.Select(s => s);
You are saying: Listen here IEnumerable, I might come to you asking for some items at some point in the future. I'll tell you when I will need them, for now sit still and do nothing. With Select(s => s) you are conceptually defining an identity projection of int to int.
A very rough simplified, non-real-life implementation of the select you've written is:
IEnumerable<T> Select(this IEnumerable<int> source, Func<int,T> transformer) something like
{
foreach (var i in source) //create an enumerator for source and starts enumeration
{
yield return transformer(i); //yield here == return an item and wait for orders
}
}
(this explains why you got a 5 when expecting a for, your transform was s => 5)
For value types, such as the ints in your case: If you want to clone the list, clone the whole list or part of it for future enumeration by using the result of an enumeration materialized through a List. This way you create a list that is a clone of the original list, entirely detached from its original list:
IEnumerable<int> cloneOfEnumerable = ogList.Select(s => s).ToList();
Later edit: Of course ogList.Select(s => s) is equivalent to ogList. I'm leaving the projection here, as it was in the question.
What you are creating here is: a list from the result of an enumerable, further consumed through the IEnumerable<int> interface. Considering what I've said above about the nature of IList vs IEnumerable, I would prefer to write/read:
IList<int> cloneOfEnumerable = ogList.ToList();
CAUTION: Be careful with reference types. IList/List make no promise of keeping the objects "safe", they can mutate to null for all IList cares. Keyword if you ever need it: deep cloning.
CAUTION: Beware of infinite or non-rewindable IEnumerables

Provided answers explain why you are not obtaining a cloned list (due to deferred execution of some LINQ extension methods).
However, keep in mind that list.Select(e => e).ToList() will get a real clone only when dealing with value types such as int.
If you have a list of reference types you will receive a cloned list of references to existent objects. In this case you should consider one of the solutions provided here for deep-cloning or my favorite from here (which might be limited by object inner structure).

You have to be aware that an object that implements IEnumerable does not have to be a collection itself. It is an object that makes it possible to get an object that implements IEnumerator. Once you have the enumerator you can ask for the first element and for the next element until there are no more next elements.
Every LINQ function that returns an IEnumerable is not the sequence itself, it only enables you to ask for the enumerator. If you want a sequence, you'll have to use ToList.
There are several other LINQ functions that do not return an IEnumerable, but for instance a Dictionary, or only one element (FirstOrDefault(), Max(), Single(), Any(). These functions will get the enumerator from the IEnumerable and start enumerating until they have the result. Any will only have to check if you can start enumerating. Max will enumerate over all elements and remember the largest one. etc.
You'll have to be aware: as long as your LINQ statement is an IEnumerable of something, your source sequence is not accessed yet. If you change your source sequence before you start enumerating, the enumeration is over your changed source sequence.
If you don't want this, you'll have to do the enumeration before you change your source. Usually this will be ToList, but this can be any of the non-deferred function: Max(), Any(), FirstOrDefault(), etc.
List<TSource> sourceItems = ...
var myEnumerable = sourceItems
.Where(sourceItem => ...)
.GroupBy(sourceItem => ...)
.Select(group => ...);
// note: myEnumerable is an IEnumerable, it is not a sequence yet.
var list1 = sourceItems.ToList(); // Enumerate over the sequence
var first = sourceItems.FirstOrDefault(); // Enumerate and stop after the first
// now change the source, and to the same things again
sourceItems.Clear();
var list1 = sourceItems.ToList(); // returns empty list
var first = sourceItems.FirstOrDefault(); // return null: there is no first element
So every LINQ function that does not return IEnumerable, will start enumerating over sourceItems as the sequence is at the moment that you start enumerating. The IEnumerable is not the sequence itself.

This is an enumerable.
var enumerable = ogList.Select(s => s);
If you iterate through this enumerable, LINQ will in turn iterate over the original resultset. Each and every time. If you do anything to the original enumerable, the results will also be reflected in your LINQ calls.
If you need to freeze the data, store it in a list instead:
var enumerable = ogList.Select(s => s).ToList();
Now you've made a copy. Iterating over this list will not touch the original enumerable.

How to implement ICollection<T> on an IEnumerable<T>

I would like to know how to program what Microsoft is suggesting from the MSDN guidelines for collections, which state the following:
AVOID using ICollection<T> or ICollection as a parameter just to access the
Count property. Instead, consider using IEnumerable<T> or IEnumerable and
dynamically checking whether the object implements ICollection<T> or ICollection.
In short, how do I implement ICollection on an IEnumerable? Microsoft has links all over that article, but no "And here is how you do this" link.
Here is my scenario. I have an MVC web app with a grid that will paginate and have sorting capability on some of the collections. For instance, on an Employee administration screen I display a list of employees in a grid.
Initially I returned the collection as IEnumerable. That was convenient when I didn't need to paginate. But now I'm faced with paginating and needing to extract the Count of employee records to do that. One workaround was to pass an employeeCount integer by ref to my getEmployeeRecords() method and assign the value within that method, but that's just messy.
Based on what I've seen here on StackOverflow, the general recommendation is to use IEnumerable instead of ICollection, or Collection, or IList, or List. So I'm not trying to open up a conversation about that topic. All I want to know is how to make an IEnumerable implement an ICollection, and extract the record count, so my code is more aligned with Microsoft's recommendation. A code sample or clear article demonstrating this would be helpful.
Thanks for your help!

One thing to note is that if you use LINQ's Count() method, it already does the type checking for you:
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source == null) throw Error.ArgumentNull("source");
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext()) count++;
}
}
return count;
}

Initially I returned the collection as IEnumerable.
Well there's half your problem. Return types should be as explicit as possible. If you have a collection, make the return type that collection. (I forget where, but this is mentioned in the guidelines.)
Based on what I've seen here on StackOverflow, the general recommendation is to use IEnumerable instead of ICollection, or Collection, or IList, or List.
Some developers have an obsession with casting everything as IEnumerable. I have no idea why, as there is no guidance anywhere from Microsoft that says that is a good idea. (I do know that some think it somehow makes the return value immutable, but really anyone can cast it back to the base type and make changes to it. Or just use dynamic and never even notice you gave them an IEnumerable.)
That's the rule for return types and local variables. For parameters you should be as accepting as possible. In practice that means accepting either IEnumerable or IList depending on whether or not you need to access it by index.
AVOID using ICollection or ICollection as a parameter just to access the
Count property.
The reason for this is that if you need the Count, you probably need to access it by index as well. If not today, then tomorrow. So go ahead and use IList just in case.
(I'm not sure I agree, but it does make some sense.)
In short, how do I implement ICollection on an IEnumerable?
Short answer: the .Count() extension method. Make sure you import System.Linq.
Long answer:
int count = 0;
if (x is ICollection)
count = ((ICollection)x).Count;
else
foreach (var c in x)
count ++;

IEnumerable is an interface, and so is ICollection. It's the object's type that implements one or the other or both. You can check if an object implements ICollection with obj is ICollection.
Example:
public class MyCollection<T> : IEnumerable<T>, ICollection<T>
{
// ... Implemented methods
}
// ...
void Foo(IEnumerable<int> elements)
{
int count;
if (elements is ICollection<int>) {
count = ((ICollection<int>)elements).Count;
}
else {
// Use Linq to traverse the whole enumerable; less efficient, but correct
count = elements.Count();
}
}
// ...
MyCollection<int> myStuff;
Foo(myStuff);

Doesn't ICollection implement IEnumerable already? If you need a collection then you need a collection.

LINQ - IEnumerable.ToList() and Deferred Execution confusion

I have an IEnumerable variable named "query" which represents an entity framework query. If I do the following, will it enumerate the query more than once? My confusion is in "result" being an IEnumerable and not a List.
IEnumerable<T> result = query.ToList();
bool test = result.Any();
as apposed to this:
// Using List to only enumerate the query once
List<T> result = query.ToList();
bool test = result.Any();
Also, I know this is silly, but would it enumerate twice if I do the following, or would it somehow know that "query" was already enumerated even though the result of the enumeration is not being used?
List<T> result = query.ToList();
bool test = query.Any();
Thanks!

Once you're calling ToList or ToArray you will create an in-memory collection. From then on you're not dealing with the database anymore.
So even if you declare it as IEnumerable<T> it actually remains a List<T> after query.ToList().
Also, all related LINQ extension methods will check if the sequence can be casted to a collection type. You can see that for example in Enumerable.Count the Count property wil be used if possible:
public static int Count<TSource>(this IEnumerable<TSource> source) {
if (source == null) throw Error.ArgumentNull("source");
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator()) {
checked {
while (e.MoveNext()) count++;
}
}
return count;
}
According to your last question, if it makes a difference wether or not you use the list or again the query in this code snippet:
List<T> result = query.ToList();
bool test = query.Any();
Yes, in this case you are not using the list in memory but the query which will then ask the database again even if .Any is not as expensive as .ToList.

When you call ToList on your query it will be transformed into a list. The query will be evaluated right then, the items will be pulled out, and the list will be populated. From then on that List has no knowledge of the original query. No amount of manipulation of that list can in any way affect or evaluate that query, as it knows nothing about it.
It doesn't matter what you call that List, or what type of variable you stick it in, the list itself simply doesn't know anything about the IQueryable anymore. Iterating the variable holding the list multiple times will simply iterate that list multiple times.
In just the same way that the list doesn't know a thing about the query, the query doesn't know a thing about the list. It doesn't remember that it's items were put into a list and continue to return those items. (You can actually write query objects like this, in theory. It's called memoization, for the query to cache it's results when iterated and continue to provide objects from that cache when iterated later. EF doesn't memoize its queries by default, nor does it provide a tool for memoization by default, although 3rd party tools provide such extensions.) This means that the 3rd code snippet that you have will actually execute two separate database queries, not just one.

No the enumeration occour only in ToList().
http://msdn.microsoft.com/it-it/library/bb342261(v=vs.110).aspx

ArrayList Count vs Any

I am looking at some legacy code. The class uses an ArrayList to keep the items. The items are fetched from Database table and can be up to 6 million. The class exposes a method called 'ListCount' to get the count of the items in the Arraylist.
Class Settings
{
ArrayList settingsList ;
public Settings()
{
settingsList = GetSettings();//Get the settings from the DB. Can also return null
}
public int ListCount
{
get
{
if (settingsList == null )
return 0;
else
return settingsList.Count;
}
}
}
The ListCount is used to check if there are items in the list. I am wondering to introduce 'Any' method to the class.
public bool Any(Func<vpSettings, bool> predicate)
{
return settingsList !=null && settingsList.Cast<vpSettings>().Any(predicate);
}
The question is does the framework do some kind of optimization and maintains a count of the items or does it iterate over the Arraylist to get the count? Would it be advisable to add the 'Any' method as above.
Marc Gravel in the following question advises to use Any for IEnumerable
Which method performs better: .Any() vs .Count() > 0?

The .NET reference source says that ArrayList.Count returns a cached private variable.
For completeness, the source also lists the implementation of the Any() extension method here. Essentially the extension method does a null check and then tries to get the first element via the IEnumerable's enumerator.

The ArrayList is actually implementing IList, which should be faster than the .Any(). The reason though because it is implementing the Count Property not the Count Method. The Count Property should do a quick check then grab the proper property.
Which looks similar to:
ICollection<TSource> collection1 = source as ICollection<TSource>;
if (collection1 != null)
return collection1.Count;
ICollection collection2 = source as ICollection;
if (collection2 != null)
return collection2.Count;

Marc Gravel advises to use Any() over Count() (the extension method), but not necessarily over Count (the property).
The Count property is always going to be faster, because it's just looking up an int that's stored on the heap. Using linq requires a (relatively) expensive object allocation to create the IEnumerator, plus whatever overhead there is in MoveNext (which, if the list is not empty, will needlessly copy the value of the ArrayList's first member to the Current property before returning true).
Now this is all pretty trivial for performance, but the code to do it is more complex, so it should only be used if there is a compelling performance benefit. Since there's actually a trivial performance penalty, we should choose the simpler code. I would therefore implement Any() as return Count > 0;.
However, your example is implementing the parameterized overload of Any. In that case, your solution, delegating to the parameterized Any extension method seems best. There's no relationship between the parameterized Any extension method and the Count property.

ArrayList implements IList so it does have a Count property. Using that would be faster than Any(), if all you care is check the container (non-)emptiness.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.