Is <Collection>.Count Expensive to Use? - c#

I'm writing a cache-eject method that essentially looks like this:
while ( myHashSet.Count > MAX_ALLOWED_CACHE_MEMBERS )
{
EjectOldestItem( myHashSet );
}
My question is about how Count is determined: is it just a private or protected int, or is it calculated by counting the elements each time its called?

From http://msdn.microsoft.com/en-us/library/ms132433.aspx:
Retrieving the value of this property is an O(1) operation.
This guarantees that accessing the Count won't iterate over the whole collection.
Edit: as many other posters suggested, IEnumerable<...>.Count() is however not guaranteed to be O(1). Use with care!
IEnumerable<...>.Count() is an extension method defined in System.Linq.Enumerable. The current implementation makes an explicit test if the counted IEnumerable<T> is indeed an instance of ICollection<T>, and makes use of ICollection<T>.Count if possible. Otherwise it traverses the IEnumerable<T> (possible making lazy evaluation expand) and counts items one by one.
I've not however found in the documentation whether it's guaranteed that IEnumerable<...>.Count() uses O(1) if possible, I only checked the implementation in .NET 3.5 with Reflector.
Necessary late addition: many popular containers are not derived from Collection<T>, but nevertheless their Count property is O(1) (that is, won't iterate over the whole collection). Examples are HashSet<T>.Count (this one is most likely what the OP wanted to ask about), Dictionary<K, V>.Count, LinkedList<T>.Count, List<T>.Count, Queue<T>.Count, Stack<T>.Count and so on.
All these collections implement ICollection<T> or just ICollection, so their Count is an implementation of ICollection<T>.Count (or ICollection.Count). It's not required for an implementation of ICollection<T>.Count to be an O(1) operation, but the ones mentioned above are doing that way, according to the documentation.
(Note aside: some containers, for instance, Queue<T>, implement non-generic ICollection but not ICollection<T>, so they "inherit" the Count property only from from ICollection.)

Your question does not specify a specific Collection class so...
It depends on the Collection class. ArrayList has an internal variable that tracks the count, as does List. However, it is implementation specific, and depending on the type of the collection, it could theoretically get recalculated on each call.

It is an internal value, and is not calculated. The documentation states that getting the value is an O(1) operation.

As others have noted, Count is maintained when modifying the collection. This is nearly always the case with every collection type in the framework. This is considerably different than using the Count extension method on an IEnumerable which will enumerate the collection each time.
Also, with the newer collection classes the Count property is not virtual which means that the jitter can inline the call to the Count accessor which makes it practically the same as accessing a field. In other words, very quick.

In case of a HashSet it's just an internal int field and even SortedSet (a binary tree based set for .net 4) has its count in an internal field.

According to Reflector, it is implemented as
public int Count{ get; }
so it is defined by the derived type

Just a quick note. Be ware that there are two ways to count a collection in .NET 3.5 when System.Linq is used. For a normal collection, the first choice should be to use the Count property, for the reasons already described in other answers.
An alternative method, via the LINQ .Count() extension method, is also available. The intriguing thing about .Count() is that it can be called on ANY enumerable, regardless of whether the underlying class implements ICollection or not, or whether it has a Count property. If you ever do call .Count() however, be aware that it WILL iterate over the collection to dynamically generate a count. That generally results in O(n) complexity.
The only reason I wanted to note this is, using IntelliSense, it is often easy to accidentally end up using the Count() extension rather than the Count property.

It's an internal int that get incremented each time a new item is added to the collection.

Related

How to count the items of an IEnumerable?

Given an instance IEnumerable o how can I get the item Count? (without enumerating through all the items)
For example, if the instance is of ICollection, ICollection<T> and IReadOnlyCollection<T>, each of these interfaces have their own Count method.
Is getting the Count property by reflection the only way?
Instead, can I check and cast o to ICollection<T> for example, so I can then call Count ?
It depends how badly you want to avoid enumerating the items if the count is not available otherwise.
If you can enumerate the items, you can use the LINQ method Enumerable.Count. It will look for a quick way to get the item count by casting into one of the interfaces. If it can't, it will enumerate.
If you want to avoid enumeration at any cost, you will have to perform a type cast. In a real life scenario you often will not have to consider all the interfaces you have named, since you usually use one of them (IReadOnlyCollection is rare and ICollection only used in legacy code). If you have to consider all of the interfaces, try them all in a separate method, which can be an extension:
static class CountExtensions {
public static int? TryCount<T>(this IEnumerable<T> items) {
switch (items) {
case ICollection<T> genCollection:
return genCollection.Count;
case ICollection legacyCollection:
return legacyCollection.Count;
case IReadOnlyCollection<T> roCollection:
return roCollection.Count;
default:
return null;
}
}
}
Access the extension method with:
int? count = myEnumerable.TryCount();
IEnumerable doesn't promise a count . What if it was a random sequence or a real time data feed from a sensor? It is entirely possible for the collection to be infinitely sized. The only way to count them is to start at zero and increment for each element that the enumerator provides. Which is exactly what LINQ does, so don't reinvent the wheel. LINQ is smart enough to use .Count properties of collections that support this.
The only way to really cover all your possible types for a collection is to use the generic interface and call the Count-method. This also covers other types such as streams or just iterators. Furthermore it will use the Count-property as of Count property vs Count() method? to avoid unneccessary overhead.
If you however have a non-generic collection you´d have to use reflection to use the correct property. However this is cumbersome and may fail if your collection doesn´t even have the property (e.g. an endless stream or just an iterator). On the other hand IEnumerable<T>.Count() will handle those types with the optimization mentioned above. Only if neccessary it will iterate the entire collection.

Why IReadOnlyCollection has ElementAt but not IndexOf

I am working with a IReadOnlyCollection of objects.
Now I'm a bit surprised, because I can use linq extension method ElementAt(). But I don't have access to IndexOf().
This to me looks a bit illogical: I can get the element at a given position, but I cannot get the position of that very same element.
Is there a specific reason for it?
I've already read -> How to get the index of an element in an IEnumerable? and I'm not totally happy with the response.
IReadOnlyCollection is a collection, not a list, so strictly speaking, it should not even have ElementAt(). This method is defined in IEnumerable as a convenience, and IReadOnlyCollection has it because it inherits it from IEnumerable. If you look at the source code, it checks whether the IEnumerable is in fact an IList, and if so it returns the element at the requested index, otherwise it proceeds to do a linear traversal of the IEnumerable until the requested index, which is inefficient.
So, you might ask why IEnumerable has an ElementAt() but not IndexOf(), but I do not find this question very interesting, because it should not have either of these methods. An IEnumerable is not supposed to be indexable.
Now, a very interesting question is why IReadOnlyList has no IndexOf() either.
IReadOnlyList<T> has no IndexOf() for no good reason whatsoever.
If you really want to find a reason to mention, then the reason is historical:
Back in the mid-nineties when C# was laid down, people had not quite started to realize the benefits of immutability and readonlyness, so the IList<T> interface that they baked into the language was, unfortunately, mutable.
The right thing would have been to come up with IReadOnlyList<T> as the base interface, and make IList<T> extend it, adding mutation methods only, but that's not what happened.
IReadOnlyList<T> was invented a considerable time after IList<T>, and by that time it was too late to redefine IList<T> and make it extend IReadOnlyList<T>. So, IReadOnlyList<T> was built from scratch.
They could not make IReadOnlyList<T> extend IList<T>, because then it would have inherited the mutation methods, so they based it on IReadOnlyCollection<T> and IEnumerable<T> instead. They added the this[i] indexer, but then they either forgot to add other methods like IndexOf(), or they intentionally omitted them since they can be implemented as extension methods, thus keeping the interface simpler. But they did not provide any such extension methods.
So, here, is an extension method that adds IndexOf() to IReadOnlyList<T>:
using Collections = System.Collections.Generic;
public static int IndexOf<T>( this Collections.IReadOnlyList<T> self, T elementToFind )
{
int i = 0;
foreach( T element in self )
{
if( Equals( element, elementToFind ) )
return i;
i++;
}
return -1;
}
Be aware of the fact that this extension method is not as powerful as a method built into the interface would be. For example, if you are implementing a collection which expects an IEqualityComparer<T> as a construction (or otherwise separate) parameter, this extension method will be blissfully unaware of it, and this will of course lead to bugs. (Thanks to Grx70 for pointing this out in the comments.)
It is because the IReadOnlyCollection (which implements IEnumerable) does not necessarily implement indexing, which often required when you want to numerically order a List. IndexOf is from IList.
Think of a collection without index like Dictionary for example, there is no concept of numeric index in Dictionary. In Dictionary, the order is not guaranteed, only one to one relation between key and value. Thus, collection does not necessarily imply numeric indexing.
Another reason is because IEnumerable is not really two ways traffic. Think of it this way: IEnumerable may enumerate the items x times as you specify and find the element at x (that is, ElementAt), but it cannot efficiently know if any of its element is located in which index (that is, IndexOf).
But yes, it is still pretty weird even you think it this way as would expect it to have either both ElementAt and IndexOf or none.
IndexOf is a method defined on List, whereas IReadOnlyCollection inherits just IEnumerable.
This is because IEnumerable is just for iterating entities. However an index doesn't apply to this concept, because the order is arbitrary and is not guaranteed to be identical between calls to IEnumerable. Furthermore the interface simply states that you can iterate a collection, whereas List states you can perform adding and removing also.
The ElementAt method sure does exactly this. However I won't use it as it reiterates the whole enumeration to find one single element. Better use First or just a list-based approach.
Anyway the API design seems odd to me as it allows an (inefficient) approach on getting an element at n-th position but does not allow to get the index of an arbitrary element which would be the same inefficient search leading to up to n iterations. I'd agree with Ian on either both (which I wouldn't recommend) or neither.
IReadOnlyCollection<T> has ElementAt<T>() because it is an extension to IEnumerable<T>, which has that method. ElementAt<T>() iterates over the IEnumerable<T> a specified number of iterations and returns value as that position.
IReadOnlyCollection<T> lacks IndexOf<T>() because, as an IEnumerable<T>, it does not have any specified order and thus the concept of an index does not apply. Nor does IReadOnlyCollection<T> add any concept of order.
I would recommend IReadOnlyList<T> when you want an indexable version of IReadOnlyCollection<T>. This allows you to correctly represent an unchangeable collection of objects with an index.
This extension method is almost the same as Mike's. The only difference is that it includes a predicate, so you can use it like this: var index = list.IndexOf(obj => obj.Id == id)
public static int IndexOf<T>(this IReadOnlyList<T> self, Func<T, bool> predicate)
{
for (int i = 0; i < self.Count; i++)
{
if (predicate(self[i]))
return i;
}
return -1;
}

Sorting ConcurrentDictionary makes any sense?

At first my thought was like "this is an hash-based data type, then it is unsorted".
Then since I was about to use it I examined the matter in depth and found out that this class implements IEnumerable and also this post confirmed that it is possible to iterate over this kind of data.
So, my question is: if I use foreach over a ConcurrentDictionary which is the order I read the elements in?
Then, as a second question, I'd like to know if the sorting methods inherited by its interfaces are of any kind of use. If I call a sorting method over a ConcurrentDictionary the new order will persist (for example for an incoming foreach)?.
Hope I've made myself clear
The current implementation makes no promises whatsoever regarding the order of the elements.
A future implementation can easily change the order by which the elements are enumerated.
As such, your code should not depend on that order.
From the Dictionary<TKey, TValue> msdn docs:
The order in which the items are returned is undefined.
(I couldn't find any reference regarding the ConcurrentDictionary, but the same principle applies.)
When you refer to "the sorting methods inherited by its interfaces", do you mean LINQ extensions? Like OrderBy? If so, these extensions are purely functional and always return a new collection. So, to answer your question "the new order will persist?": no, it won't. You can however use it like this:
foreach(KeyValuePair<T1, T2> kv in dictionary.OrderBy(...))
{
}
if I use foreach over a ConcurrentDictionary which is the order I read the elements in?
You get them in the order of buckets they belong to, and if a bucket contains multiple items, the items are in the order in which they've been added.
But as others have said, this is an implementation detail you shouldn't rely on.
I'd like to know if the sorting methods inherited by its interfaces
are of any kind of use. If I call a sorting method over a
ConcurrentDictionary the new order will persist (for example for an
incoming foreach)?.
I assume you're refering to the OrderBy() extension method on the IEnumnerable<KeyValuePair<TKey, TValue>> interface. No nothing will persist. This method returns another IEnumnerable<KeyValuePair<TKey, TValue>>. The dictionary remains as it is.
Sounds like you might be asking for trouble if you aren't particularly careful. As was mentioned by dcastro order of elements is not ensured. A more troublesome issue is that a ConcurrentDictionary can be changed at any time by other threads. This means that even if order was ensured there is no reason why new items being added while you iterate wouldn't be missed. Unless you know you can prevent other threads from changing the dictionary it's probably not a good idea to iterate over it.

Must IList be finite?

Must .NET's IList be finite? Suppose I write a class FibonacciList implementing IList<BigInteger>
The property Item[n] returns the nth Fibonacci number.
The property IsReadOnly returns true.
The methods IndexOf and Contains we can implement easily enough because the Fibonacci sequence is increasing - to test if the number m is Fibonacci, we need only to compute the finite sequence of Fibonacci numbers up to m.
The method GetEnumerator() doing the right thing
We've now implemented all the methods expected of read-only ILists except Count().
Is this cool, or an abuse of IList?
Fibonacci numbers get impractically big quickly (hence IList<BigInteger> above) . A bounded infinite sequence might be more sensible, it could implement IList<long> or IList<double>.
Addendum II: Fibonacci sequence may have been a bad example, because computing distant values is expensive - to find the nth value one has to compute all earlier values. Thus as Mošmondor said, one might as well make it an IEnumerable and use .ElementAt. However there exist other sequences where one can compute distant values quickly without computing earlier values. (Surprisingly the digits of pi are such a sequence). These sequences are more 'listy', they truly support random access.
Edit: No-one argues against infinite IEnumerables. How do they handle Count()?
To most developers, IList and ICollection imply that you have a pre-evaluated, in-memory collection to work with. With IList specifically, there is an implicit contract of constant-time Add* and indexing operations. This is why LinkedList<T> does not implement IList<T>. I would consider a FibonacciList to be a violation of this implied contract.
Note the following paragraph from a recent MSDN Magazine article discussing the reasons for adding read-only collection interfaces to .NET 4.5:
IEnumerable<T> is sufficient for most scenarios that deal with collections of types, but sometimes you need more power than it provides:
Materialization: IEnumerable<T> does not allow you to express whether the collection is already available (“materialized”) or whether it’s computed every time you iterate over it (for example, if it represents a LINQ query). When an algorithm requires multiple iterations over the collection, this can result in performance degradation if computing the sequence is expensive; it can also cause subtle bugs because of identity mismatches when objects are being generated again on subsequent passes.
As others have pointed out, there is also the question of what you would return for .Count.
It's perfectly fine to use IEnumerable or IQueryable in for such collections of data, because there is an expectation that these types can be lazily evaluated.
Regarding Edit 1: .Count() is not implemented by the IEnumerable<T> interface: it is an extension method. As such, developers need to expect that it can take any amount of time, and they need to avoid calling it in cases where they don't actually need to know the number of items. For example, if you just want to know whether an IEnumerable<T> has any items, it's better to use .Any(). If you know that there's a maximum number of items you want to deal with, you can use .Take(). If a collection has more than int.MaxValue items in it, .Count() will encounter an operation overflow. So there are some workarounds that can help to reduce the danger associated with infinite sequences. Obviously if programmers haven't taken these possibilities into account, it can still cause problems, though.
Regarding Edit 2: If you're planning to implement your sequence in a way that indexing is constant-time, that addresses my main point pretty handily. Sixlettervariables's answer still holds true, though.
*Obviously there's more to this: Add is only expected to work if IList.IsFixedSize returns false. Modification is only possible if IsReadOnly returns false, etc. IList was a poorly-thought-out interface in the first place: a fact which may finally be remedied by the introduction of read-only collection interfaces in .NET 4.5.
Update
Having given this some additional thought, I've come to the personal opinion that IEnumerable<>s should not be infinite either. In addition to materializing methods like .ToList(), LINQ has several non-streaming operations like .OrderBy() which must consume the entire IEnumerable<> before the first result can be returned. Since so many methods assume IEnumerable<>s are safe to traverse in their entirety, it would be a violation of the Liskov Substitution Principle to produce an IEnumerable<> that is inherently unsafe to traverse indefinitely.
If you find that your application often requires segments of the Fibonacci sequence as IEnumerables, I'd suggest creating a method with a signature similar to Enumerable.Range(int, int), which allows the user to define a starting and ending index.
If you'd like to embark on a Gee-Whiz project, you could conceivably develop a Fibonacci-based IQueryable<> provider, where users could use a limited subset of LINQ query syntax, like so:
// LINQ to Fibonacci!
var fibQuery = from n in Fibonacci.Numbers // (returns an IQueryable<>)
where n.Index > 5 && n.Value < 20000
select n.Value;
var fibCount = fibQuery.Count();
var fibList = fibQuery.ToList();
Since your query provider would have the power to evaluate the where clauses as lambda expressions, you could have enough control to implement Count methods and .GetEnumerator() in a way as to ensure that the query is restrictive enough to produce a real answer, or throw an exception as soon as the method is called.
But this reeks of being clever, and would probably be a really bad idea for any real-life software.
I would imagine that a conforming implementation must be finite, otherwise what would you return for ICollection<T>.Count?
/// <summary>
/// Gets the number of elements contained in the <see cref="ICollection{T}" />.
/// </summary>
int Count { get; }
Another consideration is CopyTo, which under its normal overload would never stop in a Fibonacci case.
What this means is an appropriate implementation of a Fibonacci Sequence would be simply IEnumerable<int> (using a generator pattern). (Ab)use of an IList<T> would just cause problems.
In your case, I would rather 'violate' IEnumerable and have my way with yield return.
:)
An infinite collection would probably best be implemented as an IEnumerable<T>, not an IList<T>. You could also make use of the yield return syntax when implementing, like so (ignore overflow issues, etc.):
public IEnumerable<long> Fib()
{
yield return 1;
yield return 1;
long l1 = 1;
long l2 = 1;
while (true)
{
long t = l1;
l1 = l2;
l2 = t + l1;
yield return l2;
}
}
As #CodeInChaos pointed out in the comments, the Item property of IList has signature
T this[ int index ] { get; set; }
We see ILists are indexed by ints, so their length is bounded by Int32.MaxValue . Elements of greater index would be inaccessible. This occurred to me when writing the question, but I left it out, because the problem is fun to think about otherwise.
EDIT
Having had a day to reflect on my answer and, in light of #StriplingWarrior's comment. I fear I have to make a reversal. I started trying this out last night and now I wonder what would I really lose by abandoning IList?
I think it would wiser to implement just IEnumerable and, declare a local Count() method that throws a NotSupportedException method to prevent the enumerator running until an OutOfMemoryException occurs. I would still add an IndexOf and Contains method and Item indexer property to expose higher performance alternatives like Binet's Formula but, I'd be free to change the signatures of these members to use extended datatypes potentially, even System.Numerics.BigInteger.
If I were implementing multiple series I would declare an ISeries interface for these members. Who know's, perhaps somthing like this will eventually be part of the framework.
I disagree with what appears to be a consensus view. Whilst IList has many members that cannot be implemented for an infinite series it does have an IsReadOnly member. It seems acceptable, certainly in the case of ReadOnlyCollection<>, to implement the majority of members with a NotSupportedException. Following this precedent, I don't see why this should be unacceptable if it is a side effect of some other gain in function.
In this specific Fibbonaci series case, there are established algortihms, see here and here, for shortcircuiting the normal cumalitive enumeration approach which I think would yield siginifcant performance benefits. Exposing these benefits through IList seems warranted to me.
Ideally, .Net would support some other, more appropriate super class of interface, somewhat closer to IEnumerable<> but, until that arrives in some future version, this has got to be a sensible approach.
I'm working on an implementation of IList<BigInteger> to illustrate
Summarising what I've seen so far:
You can fulfil 5 out of 6, throwing a NotSupportedException on Count()
I would have said this is probably good enough to go for it, however as servy has pointed out, the indexer is incredibly inefficient for any non-calculated and cached number.
In this case, I would say the only contract that fits your continual stream of calculations is IEnumerable.
The other option you have is to create something that looks a lot like an IList but isn't actually.

IEnumerable<T> vs T[]

I just realize that maybe I was mistaken all the time in exposing T[] to my views, instead of IEnumerable<T>.
Usually, for this kind of code:
foreach (var item in items) {}
item should be T[] or IEnumerable<T>?
Than, if I need to get the count of the items, would the Array.Count be faster over the IEnumerable<T>.Count()?
IEnumerable<T> is generally a better choice here, for the reasons listed elsewhere. However, I want to bring up one point about Count(). Quintin is incorrect when he says that the type itself implements Count(). It's actually implemented in Enumerable.Count() as an extension method, which means other types don't get to override it to provide more efficient implementations.
By default, Count() has to iterate over the whole sequence to count the items. However, it does know about ICollection<T> and ICollection, and is optimised for those cases. (In .NET 3.5 IIRC it's only optimised for ICollection<T>.) Now the array does implement that, so Enumerable.Count() defers to ICollection<T>.Count and avoids iterating over the whole sequence. It's still going to be slightly slower than calling Length directly, because Count() has to discover that it implements ICollection<T> to start with - but at least it's still O(1).
The same kind of thing is true for performance in general: the JITted code may well be somewhat tighter when iterating over an array rather than a general sequence. You'd basically be giving the JIT more information to play with, and even the C# compiler itself treats arrays differently for iteration (using the indexer directly).
However, these performance differences are going to be inconsequential for most applications - I'd definitely go with the more general interface until I had good reason not to.
It's partially inconsequential, but standard theory would dictate "Program against an interface, not an implementation". With the interface model you can change the actual datatype being passed without effecting the caller as long as it conforms to the same interface.
The contrast to that is that you might have a reason for exposing an array specifically and in which case would want to express that.
For your example I think IEnumerable<T> would be desirable. It's also worthy to note that for testing purposes using an interface could reduce the amount of headache you would incur if you had particular classes you would have to re-create all the time, collections aren't as bad generally, but having an interface contract you can mock easily is very nice.
Added for edit:
This is more inconsequential because the underlying datatype is what will implement the Count() method, for an array it should access the known length, I would not worry about any perceived overhead of the method.
See Jon Skeet's answer for an explanation of the Count() implementation.
T[] (one sized, zero based) also implements ICollection<T> and IList<T> with IEnumerable<T>.
Therefore if you want lesser coupling in your application IEnumerable<T> is preferable. Unless you want indexed access inside foreach.
Since Array class implements the System.Collections.Generic.IList<T>, System.Collections.Generic.ICollection<T>, and System.Collections.Generic.IEnumerable<T> generic interfaces, I would use IEnumerable, unless you need to use these interfaces.
http://msdn.microsoft.com/en-us/library/system.array.aspx
Your gut feeling is correct, if all the view cares about, or should care about, is having an enumerable, that's all it should demand in its interfaces.
What is it logically (conceptually) from the outside?
If it's an array, then return the array. If the only point is to enumerate, then return IEnumerable. Otherwise IList or ICollection may be the way to go.
If you want to offer lots of functionality but not allow it to be modified, then perhaps use a List internally and return the ReadonlyList returned from it's .AsReadOnly() method.
Given that changing the code from an array to IEnumerable at a later date is easy, but changing it the other way is not, I would go with a IEnumerable until you know you need the small spead benfit of return an array.

Categories