I recently learned that the objects created by .NET's LINQ implementation is inefficient for specific enumeration types.
Take a look at this code:
public class DummyCollection : ICollection<int>
{
public IEnumerator<int> GetEnumerator()
{
throw new Exception();
}
public int Count
{
get
{
return 10;
}
}
//some more interface methods
}
basically, instances of DummyCollection have a size of 10, but throws an exception if it is actually enumerated.
now here:
var d = new DummyCollection();
Console.WriteLine(d.Count());
A 10 is printed without error, but this piece of code:
var l = d.Select(a=> a);
Console.WriteLine(l.Count());
throws an exception, despite it being trivial to say that l's size is 10 as well (since Select offers 1-to-1 mapping). What this basically means is, that when checking the length of an Ienumerable, the input might be a Select-wrapped Collection, thus extending the computation time from an O(1) to a staggering O(n) (could be even worse, if the selection function is particularly cumbersome).
I know that you sacrifice efficiency when you ask for LINQ's generics, but this seems like such a simple problem to fix. I checked online and couldn't find anyone addressing this. Is there a way to bypass this shortcoming? Is anyone looking into this? Is anyone fixing this? Is this just an edge case that isn't that much of a problem? Any insight is appreciated.
You can see how Count() extension method is implemented here. Basically is something like this:
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source == null) throw Error.ArgumentNull("source");
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator()) {
checked {
while (e.MoveNext()) count++;
}
}
return count;
}
As you can see the method check first is the source is of type ICollection<TSource> or ICollection, if that is the case then there is no need to iterate counting the elements, just return Count property.
In your first case Count property is called returning 10 and GetEnumerator() method is never called.
When you use Select() method you're wrapping the collection into another type that isn't an ICollection (in above link you can also see Select() implementation), therefore the iteration is necessary.
In your second case, when you call Count() your GetEnumerator() method is called and the exception is thrown.
IEnumerable<T> doesn't have a concept of Count. This exists in implementations, which (apart from the odd shortcut here and there) have no role in LINQ to Objects. If you project an implementation of IEnumerable<T> (such as ICollection<T>), with Select, the only real guarantee you have is that the output will be IEnumerable<T>... which has no Count.
LINQ should be thought of as dealing with sequences of items, one at a time, only with a concept of current and next item (or the end of sequence). Knowing about the number of items is a (potentially) costly operation that requires iteration of all the items being counted, other than in a few, optimized cases.
Given that LINQ relies on iteration in preference to indexes and counts means that an IEnumerable that errors when you try to iterate it is going to need some super weird special-casing to fly. To me, it wouldn't be a very useful use-case.
Related
I would like to know how to program what Microsoft is suggesting from the MSDN guidelines for collections, which state the following:
AVOID using ICollection<T> or ICollection as a parameter just to access the
Count property. Instead, consider using IEnumerable<T> or IEnumerable and
dynamically checking whether the object implements ICollection<T> or ICollection.
In short, how do I implement ICollection on an IEnumerable? Microsoft has links all over that article, but no "And here is how you do this" link.
Here is my scenario. I have an MVC web app with a grid that will paginate and have sorting capability on some of the collections. For instance, on an Employee administration screen I display a list of employees in a grid.
Initially I returned the collection as IEnumerable. That was convenient when I didn't need to paginate. But now I'm faced with paginating and needing to extract the Count of employee records to do that. One workaround was to pass an employeeCount integer by ref to my getEmployeeRecords() method and assign the value within that method, but that's just messy.
Based on what I've seen here on StackOverflow, the general recommendation is to use IEnumerable instead of ICollection, or Collection, or IList, or List. So I'm not trying to open up a conversation about that topic. All I want to know is how to make an IEnumerable implement an ICollection, and extract the record count, so my code is more aligned with Microsoft's recommendation. A code sample or clear article demonstrating this would be helpful.
Thanks for your help!
One thing to note is that if you use LINQ's Count() method, it already does the type checking for you:
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source == null) throw Error.ArgumentNull("source");
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext()) count++;
}
}
return count;
}
Initially I returned the collection as IEnumerable.
Well there's half your problem. Return types should be as explicit as possible. If you have a collection, make the return type that collection. (I forget where, but this is mentioned in the guidelines.)
Based on what I've seen here on StackOverflow, the general recommendation is to use IEnumerable instead of ICollection, or Collection, or IList, or List.
Some developers have an obsession with casting everything as IEnumerable. I have no idea why, as there is no guidance anywhere from Microsoft that says that is a good idea. (I do know that some think it somehow makes the return value immutable, but really anyone can cast it back to the base type and make changes to it. Or just use dynamic and never even notice you gave them an IEnumerable.)
That's the rule for return types and local variables. For parameters you should be as accepting as possible. In practice that means accepting either IEnumerable or IList depending on whether or not you need to access it by index.
AVOID using ICollection or ICollection as a parameter just to access the
Count property.
The reason for this is that if you need the Count, you probably need to access it by index as well. If not today, then tomorrow. So go ahead and use IList just in case.
(I'm not sure I agree, but it does make some sense.)
In short, how do I implement ICollection on an IEnumerable?
Short answer: the .Count() extension method. Make sure you import System.Linq.
Long answer:
int count = 0;
if (x is ICollection)
count = ((ICollection)x).Count;
else
foreach (var c in x)
count ++;
IEnumerable is an interface, and so is ICollection. It's the object's type that implements one or the other or both. You can check if an object implements ICollection with obj is ICollection.
Example:
public class MyCollection<T> : IEnumerable<T>, ICollection<T>
{
// ... Implemented methods
}
// ...
void Foo(IEnumerable<int> elements)
{
int count;
if (elements is ICollection<int>) {
count = ((ICollection<int>)elements).Count;
}
else {
// Use Linq to traverse the whole enumerable; less efficient, but correct
count = elements.Count();
}
}
// ...
MyCollection<int> myStuff;
Foo(myStuff);
Doesn't ICollection implement IEnumerable already? If you need a collection then you need a collection.
I am looking at some legacy code. The class uses an ArrayList to keep the items. The items are fetched from Database table and can be up to 6 million. The class exposes a method called 'ListCount' to get the count of the items in the Arraylist.
Class Settings
{
ArrayList settingsList ;
public Settings()
{
settingsList = GetSettings();//Get the settings from the DB. Can also return null
}
public int ListCount
{
get
{
if (settingsList == null )
return 0;
else
return settingsList.Count;
}
}
}
The ListCount is used to check if there are items in the list. I am wondering to introduce 'Any' method to the class.
public bool Any(Func<vpSettings, bool> predicate)
{
return settingsList !=null && settingsList.Cast<vpSettings>().Any(predicate);
}
The question is does the framework do some kind of optimization and maintains a count of the items or does it iterate over the Arraylist to get the count? Would it be advisable to add the 'Any' method as above.
Marc Gravel in the following question advises to use Any for IEnumerable
Which method performs better: .Any() vs .Count() > 0?
The .NET reference source says that ArrayList.Count returns a cached private variable.
For completeness, the source also lists the implementation of the Any() extension method here. Essentially the extension method does a null check and then tries to get the first element via the IEnumerable's enumerator.
The ArrayList is actually implementing IList, which should be faster than the .Any(). The reason though because it is implementing the Count Property not the Count Method. The Count Property should do a quick check then grab the proper property.
Which looks similar to:
ICollection<TSource> collection1 = source as ICollection<TSource>;
if (collection1 != null)
return collection1.Count;
ICollection collection2 = source as ICollection;
if (collection2 != null)
return collection2.Count;
Marc Gravel advises to use Any() over Count() (the extension method), but not necessarily over Count (the property).
The Count property is always going to be faster, because it's just looking up an int that's stored on the heap. Using linq requires a (relatively) expensive object allocation to create the IEnumerator, plus whatever overhead there is in MoveNext (which, if the list is not empty, will needlessly copy the value of the ArrayList's first member to the Current property before returning true).
Now this is all pretty trivial for performance, but the code to do it is more complex, so it should only be used if there is a compelling performance benefit. Since there's actually a trivial performance penalty, we should choose the simpler code. I would therefore implement Any() as return Count > 0;.
However, your example is implementing the parameterized overload of Any. In that case, your solution, delegating to the parameterized Any extension method seems best. There's no relationship between the parameterized Any extension method and the Count property.
ArrayList implements IList so it does have a Count property. Using that would be faster than Any(), if all you care is check the container (non-)emptiness.
I'm aware that .Count() is an extension method in LINQ, and that fundamentally it uses the .Count, so I'm wondering, when should I use Count() and when should I use .Count? Is .Count() predominately better saved for queryable collections that are yet to be executed, and therefore don't have an enumeration yet? Am I safer simply always using .Count() extension method, or vice versa for the property? Or is this solely conditional depending on the collection?
Any advice, or articles, are greatly appreciated.
Update 1
After decompiling the .Count() extension method in LINQ it appears to be using the .Count property if the IEnumerable<T> is an ICollection<T> or ICollection, which is what most answers have suggested. The only real overhead now that I can see is the additional null and type checks, which isn't huge I suppose, but could still make a small amount of difference if performance were of the utmost importance.
Here's the decompiled LINQ .Count() extension method in .NET 4.0.
public static int Count<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
return collection.Count;
}
ICollection collection2 = source as ICollection;
if (collection2 != null)
{
return collection2.Count;
}
int num = 0;
checked
{
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
num++;
}
}
return num;
}
}
The extension method works on any IEnumerable<T> but it is costly because it counts the sequence by iterating it. There is an optimization if the sequence is ICollection<T> meaning that the length of the collection is known. Then the Count property is used but that is an implementation detail.
The best advice is to use the Count property if available for performance reasons.
Is .Count() predominately better saved for queryable collections that are yet to be executed, and therefore don't have an enumeration yet?
If your collection is IQueryable<T> and not IEnumerable<T> then the query provider may be able to return the count in some efficient maner. In that case you will not suffer a performance penalty but it depends on the query provider.
An IQueryable<T> will not have a Count property so there is no choice between using the extension method and the property. However, if you query provider does not provide an efficient way of computing Count() you might consider using .ToList() to pull the collection to the client side. It really depends on how you intend to use it.
Count retrieves the property from a List (already calculated). Count() is an aggregation, like Sum(), Average(), etc. What it does is to count the items in the enumerable (I believe it internally uses the Count property if the enumerable is a list).
This is an example of the concrete use of the Count() method, when it doesn't just use the Count property:
var list = new List {1,2,3,4,5,6,7,8,9,10};
var count = list.Where(x => x > 5).Count();
Also, Count() has an overload that will count the items matching the predicate:
var count = list.Count(x => x > 5);
You should use Count() when all you have is an interface that doesn't expose a Count or Length property, such as IEnumerabe<T>.
If you're dealing with a collection or collection interface (such as List<T> or ICollection) then you can simply use the Count property, likewise if you have an array use the Length property.
The implementation of the Count() extension property will use the underlying collection's Count property if it is available. Otherwise the collection will be enumerated to calculate the count.
Agreed with comments of .Count if it is available (i.e. an object that implements ICollection<T> under the bonnet).
But they are wrong about .Count() being 'costly'. Enumerable.Count() will check if the object implements ICollection<T>.Count before it enumerates the elements and count them.
I.e. something like,
public int Enumerable.Count<TSource>(IEnumerable<TSource> source)
{
var collection = source as ICollection
if (collection != null)
{
return collection.Count;
}
}
I'm not sure it matters since Count() probably just reads the Count property. The performance difference is truly negligible. Use whichever you like. I use Count() when possible just to be consistent.
As others have said, if you have an ICollection<T>, use the Count property.
I would suggest that the IEnumerable.Count() method is really intended for use when the only thing you want to do with the elements of an enumeration is count them. The equivalent of SQL "SELECT COUNT(...".
If in addition you want to do something else with the elements of an enumeration, it makes more sense to generate a collection (usually a list using ToList()), then you can use the Count property and do whatever else you want to do.
What i need is a way to select the last 100 elements from a list, as list
public List<Model.PIP> GetPIPList()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Take(100);
}
I get error like this
'System.Collections.Generic.IEnumerable' to 'System.Collections.Generic.List'. An explicit conversion exists (are you missing a cast?)
somelist.Reverse().Take(100).Reverse().ToList();
This would be much cheaper than ordering :) Also preserves the original ordering.
If your list is large, you'll get the best performance by rolling your own:
public static class ListExtensions
{
public static IEnumerable<T> LastItems<T>(this IList<T> list, int numberOfItems) //Can also handle arrays
{
for (int index = Math.Max(list.Count - numberOfItems, 0); index < list.Count; index++)
yield return list[index];
}
}
Why is this faster than using Skip()? If you have a list with 50,000 items, Skip() calls MoveNext() on the enumerator 49,900 times before it starts returning items.
Why is it faster than using Reverse()? Because Reverse allocates a new array large enough to hold the list's elements, and copies them into the array. This is especially good to avoid if the array is large enough to go on the large object heap.
EDIT: I missed that you said you wanted the last 100 items, and weren't able to do that yet.
To get the last 100 items:
return Repository.PIPRepository.PIPList
.OrderByDescending(pip=>pip.??).Take(100)
.OrderBy(pip=>pip.??);
...and then change your method signature to return IEnumerable<Model.PIP>
?? signifies what ever property you would be sorting on.
Joel also gives a great solution, based on counting the number of items in the last, and skipping all but 100 of them. In many cases, that probably works better. I didn't want to post the same solution in my edit! :)
Try:
public List<Model.PIP> GetPIPList()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Take(100).ToList();
}
The .Take() method returns and IEnumerable<T> rather than a List<T>. This is a good thing, and you should strongly consider altering your method and your work habits to use IEnumerable<T> rather than List<T> as much as is practical.
Aside from that, .Take(100) will also return the first 100 elements, rather than the last 100 elements. You want something like this:
public IEnumerable<Model.PIP> GetPIPs()
{
if (Repository.PIPRepository.PIPList == null)
Repository.PIPRepository.Load();
return Repository.PIPRepository.PIPList.Skip(Math.Max(0,Repository.PIPRepository.PIPList.Count - 100));
}
If you really need a list rather than an enumerable (hint: you probably don't), it's still better to build this method using an IEnumerable and use .ToList() at the place where you call this method.
At some point in the future you'll want to go back and update your Load() code to also use IEnumerable, as well as code later on in the process. The ultimate goal here is to get to the point where you are effectively streaming your objects to the browser, and only ever one have one of them loaded into memory on your web server at a time. IEnumerable allows for this. List does not.
Using reflector I have noticed that System.Linq.Enumerable.Count method has a condition in it to optimize it for the case when the IEnumerable<T> passed is in fact an ICollection<T>. If the cast succeeds the Count method does not need to iterate over every element, but can call the Count method of ICollection.
Based on this I was starting to think that IEnumerable<T> can be used like a readonly view of a collection, without having the performance loss that I originally expected based on the API of IEnumerable<T>
I was interested whether the optimization of the Count still holds when the IEnumerable<T> is a result of a Select statement over an ICollection, but based on reflected code this case is not optimized, and requires an iteration through all elements.
Do you draw the same conclusions from reflector? What could be the reason behind the lack of this optimization? I seems like there is a lot of time wasted in this common operation. Does the spec require that the each element is evaluated even if the Count can be determined without doing that?
It doesn't really matter that the result of Select is lazily evaluated. The Count is always equivalent to the count of the original collection so it could have certainly been retrieved directly by returning a specific object from Select that could be used to short-circuit evaluation of the Count method.
The reason it's not possible to optimize out evaluation of the Count() method on the return value of a Select call from something with determined count (like a List<T>) is that it could change the meaning of the program.
The selector function passed to Select method is allowed to have side effects and its side effects are required to happen deterministically, in a predetermined order.
Assume:
new[]{1,2,3}.Select(i => { Console.WriteLine(i); return 0; }).Count();
The documentation requires this code to print
1
2
3
Even though the count is really known from the start and could be optimized, optimization would change the behavior of the program. That's why you can't avoid enumeration of the collection anyway. That's exactly one of the reasons why compiler optimizations are much easier in pure functional languages.
UPDATE: Apparently, it's not clear that it's perfectly possible to implement Select and Count so that Selects on ICollection<T> will still be lazily evaluated but the Count() will be evaluated in O(1) without enumerating the collection. I'm going to do that without changing the interface of any methods. A similar thing is already done for ICollection<T>:
private interface IDirectlyCountable {
int Count {get;}
}
private class SelectICollectionIterator<TSource,TResult> : IEnumerable<T>, IDirectlyCountable {
ICollection<TSource> sequence;
Func<TSource,TResult> selector;
public SelectICollectionIterator(ICollection<TSource> source, Func<TSource,TResult> selector) {
this.sequence = source;
this.selector = selector;
}
public int Count { get { return sequence.Count; } }
// ... GetEnumerator ...
}
public static IEnumerable<TResult> Select<TSource,TResult>(this IEnumerable<TSource> source, Func<TSource,TResult> selector) {
// ... error handling omitted for brevity ...
if (source is ICollection<TSource>)
return new SelectICollectionIterator<TSource,TResult>((ICollection<TSource>)source, selector);
// ... rest of the method ...
}
public static int Count<T>(this IEnumerable<T> source) {
// ...
ICollection<T> collection = source as ICollection<T>;
if (collection != null) return collection.Count;
IDirectlyCountable countableSequence = source as IDirectlyCountable;
if (countableSequence != null) return countableSequence.Count;
// ... enumerate and count the sequence ...
}
This will still evaluate the Count lazily. If you change the underlying collection, the count will get changed and the sequence is not cached. The only difference will be not doing the side effects in the selector delegate.
Edit 02-Feb-2010:
As I see it, there are at least two ways to interpret this question.
Why does the Select<T,
TResult> extension method, when
called on an instance of a class that
implements ICollection<T>, not
return an object that provides a
Count property; and why does the
Count<T> extension method not
check for this property so as to
provide O(1) performance when the two
methods are chained?
This version of the question makes no false assumptions about how Linq extensions work, and is a valid question since a call to ICollection<T>.Select.Count will, after all, always return the same value as ICollection<T>.Count. This is how Mehrdad interpreted the question, to which he has provided a thorough response.
But I read the question as asking...
If the Count<T> extension method provides O(1)
performance for an object of a class
implementing ICollection<T>, why
does it provide O(n) performance for
the return value of the
Select<T, TResult>
extension method?
In this version of the question, there is a mistaken assumption: that the Linq extension methods work together by assembling little collections one after another (in memory) and exposing them through the IEnumerable<T> interface.
If this were how the Linq extensions worked, the Select method might look something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
List<TResult> results = new List<TResult>();
foreach (T input in source)
results.Add(selector(input));
return results;
}
Moreover, if this were the implementation of Select, I think you'd find most code that utilizes this method would behave just the same. But it would be wasteful, and would in fact cause exceptions in certain cases like the one I described in my original answer.
In reality, I believe the implementation of the Select method is much closer to something like this:
public static IEnumerable<TResult> Select<T, TResult>(this IEnumerable<T> source, Func<T, TResult> selector) {
foreach (T input in source)
yield return selector(input);
yield break;
}
This is to provide lazy evaluation, and explains why a Count property is not accessible in O(1) time to the Count method.
So in other words, whereas Mehrdad answered the question of why Select wasn't designed differently so that Select.Count would behave differently, I have offered my best answer to the question of why Select.Count behaves the way it does.
ORIGINAL ANSWER:
Method side effects is not the answer.
According to Mehrdad's answer:
It doesn't really matter that the
result of Select is lazily evaluated.
I don't buy this. Let me explain why.
For starters, consider the following two very similar methods:
public static IEnumerable<double> GetRandomsAsEnumerable(int N) {
Random r = new Random();
for (int i = 0; i < N; ++i)
yield return r.NextDouble();
yield break;
}
public static double[] GetRandomsAsArray(int N) {
Random r = new Random();
double[] values = new double[N];
for (int i = 0; i < N; ++i)
values[i] = r.NextDouble();
return values;
}
OK, what do these methods do? Each one returns as many random doubles as the user desires (up to int.MaxValue). Does it matter whether either method is lazily evaluated or not? To answer this question, let's take a look at the following code:
public static double Invert(double value) {
return 1.0 / value;
}
public static void Test() {
int a = GetRandomsAsEnumerable(int.MaxValue).Select(Invert).Count();
int b = GetRandomsAsArray(int.MaxValue).Select(Invert).Count();
}
Can you guess what will happen with these two method calls? Let me spare you the trouble of copying this code and testing it out yourself:
The first variable, a, will (after a potentially significant amount of time) be initialized to int.MaxValue (currently 2147483647). The second one, b, will very likely be interrupted by an OutOfMemoryException.
Because Select and the other Linq extension methods are lazily evaluated, they allow you to do things you simply could not do otherwise. The above is a fairly trivial example. But my main point is to dispute the assertion that lazy evaluation is not important. Mehrdad's statement that a Count property "is really known from the start and could be optimized" actually begs the question. The issue might seem straightforward for the Select method, but Select is not really special; it returns an IEnumerable<T> just like the rest of the Linq extension methods, and for these methods to "know" the Count of their return values would require full collections to be cached and therefore prohibit lazy evaluation.
Lazy evaluation is the answer.
For this reason, I have to agree with one of the original responders (whose answer now seems to have disappeared) that lazy evaluation really is the answer here. The idea that method side effects need to be accounted for is really secondary, as this is already ensured as a byproduct of lazy evaluation anyway.
Postscript: I've made very assertive statements and emphasized my points mainly because I wanted to be clear on what my argument is, not out of any disrespect for any other responses, including Mehrdad's, which I feel is insightful but misses the mark.
An ICollection knows the number of items (Count) it contains. It doesn't have to iterate any items to determine it. Take for example the HashSet class (which implements ICollection).
An IEnumerable<T> doesn't know how many items it contains. You have to enumerate the whole list to determine the number of items (Count).
Wrapping the ICollection in a LINQ statement, doesn't make it more efficient. No matter how you twist and turn, the ICollection will have to be enumerated.