Must IList be finite?

Must IList be finite? - c#

Must .NET's IList be finite? Suppose I write a class FibonacciList implementing IList<BigInteger>
The property Item[n] returns the nth Fibonacci number.
The property IsReadOnly returns true.
The methods IndexOf and Contains we can implement easily enough because the Fibonacci sequence is increasing - to test if the number m is Fibonacci, we need only to compute the finite sequence of Fibonacci numbers up to m.
The method GetEnumerator() doing the right thing
We've now implemented all the methods expected of read-only ILists except Count().
Is this cool, or an abuse of IList?
Fibonacci numbers get impractically big quickly (hence IList<BigInteger> above) . A bounded infinite sequence might be more sensible, it could implement IList<long> or IList<double>.
Addendum II: Fibonacci sequence may have been a bad example, because computing distant values is expensive - to find the nth value one has to compute all earlier values. Thus as Mošmondor said, one might as well make it an IEnumerable and use .ElementAt. However there exist other sequences where one can compute distant values quickly without computing earlier values. (Surprisingly the digits of pi are such a sequence). These sequences are more 'listy', they truly support random access.
Edit: No-one argues against infinite IEnumerables. How do they handle Count()?

To most developers, IList and ICollection imply that you have a pre-evaluated, in-memory collection to work with. With IList specifically, there is an implicit contract of constant-time Add* and indexing operations. This is why LinkedList<T> does not implement IList<T>. I would consider a FibonacciList to be a violation of this implied contract.
Note the following paragraph from a recent MSDN Magazine article discussing the reasons for adding read-only collection interfaces to .NET 4.5:
IEnumerable<T> is sufficient for most scenarios that deal with collections of types, but sometimes you need more power than it provides:
Materialization: IEnumerable<T> does not allow you to express whether the collection is already available (“materialized”) or whether it’s computed every time you iterate over it (for example, if it represents a LINQ query). When an algorithm requires multiple iterations over the collection, this can result in performance degradation if computing the sequence is expensive; it can also cause subtle bugs because of identity mismatches when objects are being generated again on subsequent passes.
As others have pointed out, there is also the question of what you would return for .Count.
It's perfectly fine to use IEnumerable or IQueryable in for such collections of data, because there is an expectation that these types can be lazily evaluated.
Regarding Edit 1: .Count() is not implemented by the IEnumerable<T> interface: it is an extension method. As such, developers need to expect that it can take any amount of time, and they need to avoid calling it in cases where they don't actually need to know the number of items. For example, if you just want to know whether an IEnumerable<T> has any items, it's better to use .Any(). If you know that there's a maximum number of items you want to deal with, you can use .Take(). If a collection has more than int.MaxValue items in it, .Count() will encounter an operation overflow. So there are some workarounds that can help to reduce the danger associated with infinite sequences. Obviously if programmers haven't taken these possibilities into account, it can still cause problems, though.
Regarding Edit 2: If you're planning to implement your sequence in a way that indexing is constant-time, that addresses my main point pretty handily. Sixlettervariables's answer still holds true, though.
*Obviously there's more to this: Add is only expected to work if IList.IsFixedSize returns false. Modification is only possible if IsReadOnly returns false, etc. IList was a poorly-thought-out interface in the first place: a fact which may finally be remedied by the introduction of read-only collection interfaces in .NET 4.5.
Update
Having given this some additional thought, I've come to the personal opinion that IEnumerable<>s should not be infinite either. In addition to materializing methods like .ToList(), LINQ has several non-streaming operations like .OrderBy() which must consume the entire IEnumerable<> before the first result can be returned. Since so many methods assume IEnumerable<>s are safe to traverse in their entirety, it would be a violation of the Liskov Substitution Principle to produce an IEnumerable<> that is inherently unsafe to traverse indefinitely.
If you find that your application often requires segments of the Fibonacci sequence as IEnumerables, I'd suggest creating a method with a signature similar to Enumerable.Range(int, int), which allows the user to define a starting and ending index.
If you'd like to embark on a Gee-Whiz project, you could conceivably develop a Fibonacci-based IQueryable<> provider, where users could use a limited subset of LINQ query syntax, like so:
// LINQ to Fibonacci!
var fibQuery = from n in Fibonacci.Numbers // (returns an IQueryable<>)
where n.Index > 5 && n.Value < 20000
select n.Value;
var fibCount = fibQuery.Count();
var fibList = fibQuery.ToList();
Since your query provider would have the power to evaluate the where clauses as lambda expressions, you could have enough control to implement Count methods and .GetEnumerator() in a way as to ensure that the query is restrictive enough to produce a real answer, or throw an exception as soon as the method is called.
But this reeks of being clever, and would probably be a really bad idea for any real-life software.

I would imagine that a conforming implementation must be finite, otherwise what would you return for ICollection<T>.Count?
/// <summary>
/// Gets the number of elements contained in the <see cref="ICollection{T}" />.
/// </summary>
int Count { get; }
Another consideration is CopyTo, which under its normal overload would never stop in a Fibonacci case.
What this means is an appropriate implementation of a Fibonacci Sequence would be simply IEnumerable<int> (using a generator pattern). (Ab)use of an IList<T> would just cause problems.

In your case, I would rather 'violate' IEnumerable and have my way with yield return.
:)

An infinite collection would probably best be implemented as an IEnumerable<T>, not an IList<T>. You could also make use of the yield return syntax when implementing, like so (ignore overflow issues, etc.):
public IEnumerable<long> Fib()
{
yield return 1;
yield return 1;
long l1 = 1;
long l2 = 1;
while (true)
{
long t = l1;
l1 = l2;
l2 = t + l1;
yield return l2;
}
}

As #CodeInChaos pointed out in the comments, the Item property of IList has signature
T this[ int index ] { get; set; }
We see ILists are indexed by ints, so their length is bounded by Int32.MaxValue . Elements of greater index would be inaccessible. This occurred to me when writing the question, but I left it out, because the problem is fun to think about otherwise.

EDIT
Having had a day to reflect on my answer and, in light of #StriplingWarrior's comment. I fear I have to make a reversal. I started trying this out last night and now I wonder what would I really lose by abandoning IList?
I think it would wiser to implement just IEnumerable and, declare a local Count() method that throws a NotSupportedException method to prevent the enumerator running until an OutOfMemoryException occurs. I would still add an IndexOf and Contains method and Item indexer property to expose higher performance alternatives like Binet's Formula but, I'd be free to change the signatures of these members to use extended datatypes potentially, even System.Numerics.BigInteger.
If I were implementing multiple series I would declare an ISeries interface for these members. Who know's, perhaps somthing like this will eventually be part of the framework.
I disagree with what appears to be a consensus view. Whilst IList has many members that cannot be implemented for an infinite series it does have an IsReadOnly member. It seems acceptable, certainly in the case of ReadOnlyCollection<>, to implement the majority of members with a NotSupportedException. Following this precedent, I don't see why this should be unacceptable if it is a side effect of some other gain in function.
In this specific Fibbonaci series case, there are established algortihms, see here and here, for shortcircuiting the normal cumalitive enumeration approach which I think would yield siginifcant performance benefits. Exposing these benefits through IList seems warranted to me.
Ideally, .Net would support some other, more appropriate super class of interface, somewhat closer to IEnumerable<> but, until that arrives in some future version, this has got to be a sensible approach.
I'm working on an implementation of IList<BigInteger> to illustrate

Summarising what I've seen so far:
You can fulfil 5 out of 6, throwing a NotSupportedException on Count()
I would have said this is probably good enough to go for it, however as servy has pointed out, the indexer is incredibly inefficient for any non-calculated and cached number.
In this case, I would say the only contract that fits your continual stream of calculations is IEnumerable.
The other option you have is to create something that looks a lot like an IList but isn't actually.

Related

In a collection, is a lookup performed every time I access it by index, or is that compiled away?

List<Apple> apples = GetApples();
Console.Write(string.Format("{0}: {1}, {2}", apples[0].Name, apples[0].Color, apples[0].Size));
Is the lookup performed every time I get the object by its index, or does that get compiled away?
Is the following more performant?
List<Apple> apples = GetApples();
var first = apples[0];
Console.Write(string.Format("{0}: {1}, {2}", first.Name, first.Color, first.Size));
Obviously in this trivial example, called once, it wouldn't be an issue worth worrying about. I'm curious because the actual code I'm writing will be called many times in a loop.

It depends on how the collection is implemented. List<T> is backed by an array so it has array-like access.
I believe several of the non-deferred LINQ operations have shortcuts for certain collections, so calling .First() would have the same effect as apples[0].
I can't speak to how the C# compiler lowers to the call, but I would suspect that it would call the indexer everytime to invoke any (unfortunate) side-effects. If there aren't any side-effects, I doubt it would improve performance much. CPUs are really effective at branch predictions and caching.

The source code for List<T> is here. You can see where the "indexer" is defined starting where it says public T this[int index] (currently on line 145). You can see that it does a check to see if the specified index is out of range, and then simply returns the value from the underlying array.
So the two pieces of code are not exactly equivalent. That indexer block is executed each time you call apples[0]. In the case of List<T>, it's not a huge performance difference between the two - doing a quick integer comparison (the range check) three times is not terribly expensive.
HOWEVER, other classes besides List may not be so efficient with their indexers, so it's best to know what kind of performance you're dealing with if it matters in your situation.

IEnumerable<T> vs T[]

I just realize that maybe I was mistaken all the time in exposing T[] to my views, instead of IEnumerable<T>.
Usually, for this kind of code:
foreach (var item in items) {}
item should be T[] or IEnumerable<T>?
Than, if I need to get the count of the items, would the Array.Count be faster over the IEnumerable<T>.Count()?

IEnumerable<T> is generally a better choice here, for the reasons listed elsewhere. However, I want to bring up one point about Count(). Quintin is incorrect when he says that the type itself implements Count(). It's actually implemented in Enumerable.Count() as an extension method, which means other types don't get to override it to provide more efficient implementations.
By default, Count() has to iterate over the whole sequence to count the items. However, it does know about ICollection<T> and ICollection, and is optimised for those cases. (In .NET 3.5 IIRC it's only optimised for ICollection<T>.) Now the array does implement that, so Enumerable.Count() defers to ICollection<T>.Count and avoids iterating over the whole sequence. It's still going to be slightly slower than calling Length directly, because Count() has to discover that it implements ICollection<T> to start with - but at least it's still O(1).
The same kind of thing is true for performance in general: the JITted code may well be somewhat tighter when iterating over an array rather than a general sequence. You'd basically be giving the JIT more information to play with, and even the C# compiler itself treats arrays differently for iteration (using the indexer directly).
However, these performance differences are going to be inconsequential for most applications - I'd definitely go with the more general interface until I had good reason not to.

It's partially inconsequential, but standard theory would dictate "Program against an interface, not an implementation". With the interface model you can change the actual datatype being passed without effecting the caller as long as it conforms to the same interface.
The contrast to that is that you might have a reason for exposing an array specifically and in which case would want to express that.
For your example I think IEnumerable<T> would be desirable. It's also worthy to note that for testing purposes using an interface could reduce the amount of headache you would incur if you had particular classes you would have to re-create all the time, collections aren't as bad generally, but having an interface contract you can mock easily is very nice.
Added for edit:
This is more inconsequential because the underlying datatype is what will implement the Count() method, for an array it should access the known length, I would not worry about any perceived overhead of the method.
See Jon Skeet's answer for an explanation of the Count() implementation.

T[] (one sized, zero based) also implements ICollection<T> and IList<T> with IEnumerable<T>.
Therefore if you want lesser coupling in your application IEnumerable<T> is preferable. Unless you want indexed access inside foreach.

Since Array class implements the System.Collections.Generic.IList<T>, System.Collections.Generic.ICollection<T>, and System.Collections.Generic.IEnumerable<T> generic interfaces, I would use IEnumerable, unless you need to use these interfaces.
http://msdn.microsoft.com/en-us/library/system.array.aspx

Your gut feeling is correct, if all the view cares about, or should care about, is having an enumerable, that's all it should demand in its interfaces.

What is it logically (conceptually) from the outside?
If it's an array, then return the array. If the only point is to enumerate, then return IEnumerable. Otherwise IList or ICollection may be the way to go.
If you want to offer lots of functionality but not allow it to be modified, then perhaps use a List internally and return the ReadonlyList returned from it's .AsReadOnly() method.

Given that changing the code from an array to IEnumerable at a later date is easy, but changing it the other way is not, I would go with a IEnumerable until you know you need the small spead benfit of return an array.

The difference between lists and sequences

I'm trying to understand the difference between sequences and lists.
In F# there is a clear distinction between the two. However in C# I have seen programmers refer to IEnumerable collections as a sequence. Is what makes IEnumerable a sequence the fact that it returns an object to iterate through the collection?
Perhaps the real distinction is purely found in functional languages?

Not really - you tend to have random access to a list, as well as being able to get its count quickly etc. Admittedly linked lists don't have the random access nature... but then they don't implement IList<T>. There's a grey area between the facilities provided by a particular platform and the general concepts.
Sequences (as represented by IEnumerable<T>) are read-only, forward-only, one item at a time, and potentially infinite. Of course any one implementation of a sequence may also be a list (e.g. List<T>) but when you're treating it as a sequence, you can basically iterate over it (repeatedly) and that's it.

I think that the confusion may arise from the fact that collections like List<T> implement the interface IEnumerable<T>. If you have a subtype relationship in general (e.g. supertype Shape with two subtypes Rectangle and Circle), you can interpret the relation as an "is-a" hierarchy.
This means that it is perfectly fine to say that "Circle is a Shape" and similarly, people would say that "List<T> is an IEnumerable<T>" that is, "list is a sequence". This makes some sense, because a list is a special type of a sequence. In general, sequences can be also lazily generated and infinite (and these types cannot also be lists). An example of a (perfectly valid) sequence that cannot be generated by a list would look like this:
// C# version // F# version
IEnumerable<int> Numbers() { let rec loop n = seq {
int i = 0; yield n
while (true) yield return i++; yield! loop(n + 1) }
} let numbers = loop(0)
This would be also true for F#, because F# list type also implements IEnumerable<T>, but functional programming doesn't put that strong emphasis on object oriented point of view (and implicit conversions that enable the "is a" interpretation are used less frequently in F#).

Sequence content is calculated on demand so you can implement for example infinite sequence without affecting your memory.
So in C# you can write a sequence, for example
IEnumerable<int> Null() {
yield return 0;
}
It will return infinite sequence of zeros.
You can write
int[] array = Null().Take(10).ToArray()
And it will take 10*4 bytes of memory despite sequence is infinite.
So as you see, C# does have distinction between sequence and collection

Is <Collection>.Count Expensive to Use?

I'm writing a cache-eject method that essentially looks like this:
while ( myHashSet.Count > MAX_ALLOWED_CACHE_MEMBERS )
{
EjectOldestItem( myHashSet );
}
My question is about how Count is determined: is it just a private or protected int, or is it calculated by counting the elements each time its called?

From http://msdn.microsoft.com/en-us/library/ms132433.aspx:
Retrieving the value of this property is an O(1) operation.
This guarantees that accessing the Count won't iterate over the whole collection.
Edit: as many other posters suggested, IEnumerable<...>.Count() is however not guaranteed to be O(1). Use with care!
IEnumerable<...>.Count() is an extension method defined in System.Linq.Enumerable. The current implementation makes an explicit test if the counted IEnumerable<T> is indeed an instance of ICollection<T>, and makes use of ICollection<T>.Count if possible. Otherwise it traverses the IEnumerable<T> (possible making lazy evaluation expand) and counts items one by one.
I've not however found in the documentation whether it's guaranteed that IEnumerable<...>.Count() uses O(1) if possible, I only checked the implementation in .NET 3.5 with Reflector.
Necessary late addition: many popular containers are not derived from Collection<T>, but nevertheless their Count property is O(1) (that is, won't iterate over the whole collection). Examples are HashSet<T>.Count (this one is most likely what the OP wanted to ask about), Dictionary<K, V>.Count, LinkedList<T>.Count, List<T>.Count, Queue<T>.Count, Stack<T>.Count and so on.
All these collections implement ICollection<T> or just ICollection, so their Count is an implementation of ICollection<T>.Count (or ICollection.Count). It's not required for an implementation of ICollection<T>.Count to be an O(1) operation, but the ones mentioned above are doing that way, according to the documentation.
(Note aside: some containers, for instance, Queue<T>, implement non-generic ICollection but not ICollection<T>, so they "inherit" the Count property only from from ICollection.)

Your question does not specify a specific Collection class so...
It depends on the Collection class. ArrayList has an internal variable that tracks the count, as does List. However, it is implementation specific, and depending on the type of the collection, it could theoretically get recalculated on each call.

It is an internal value, and is not calculated. The documentation states that getting the value is an O(1) operation.

As others have noted, Count is maintained when modifying the collection. This is nearly always the case with every collection type in the framework. This is considerably different than using the Count extension method on an IEnumerable which will enumerate the collection each time.
Also, with the newer collection classes the Count property is not virtual which means that the jitter can inline the call to the Count accessor which makes it practically the same as accessing a field. In other words, very quick.

In case of a HashSet it's just an internal int field and even SortedSet (a binary tree based set for .net 4) has its count in an internal field.

According to Reflector, it is implemented as
public int Count{ get; }
so it is defined by the derived type

Just a quick note. Be ware that there are two ways to count a collection in .NET 3.5 when System.Linq is used. For a normal collection, the first choice should be to use the Count property, for the reasons already described in other answers.
An alternative method, via the LINQ .Count() extension method, is also available. The intriguing thing about .Count() is that it can be called on ANY enumerable, regardless of whether the underlying class implements ICollection or not, or whether it has a Count property. If you ever do call .Count() however, be aware that it WILL iterate over the collection to dynamically generate a count. That generally results in O(n) complexity.
The only reason I wanted to note this is, using IntelliSense, it is often easy to accidentally end up using the Count() extension rather than the Count property.

It's an internal int that get incremented each time a new item is added to the collection.

Would C# benefit from distinctions between kinds of enumerators, like C++ iterators?

I have been thinking about the IEnumerator.Reset() method. I read in the MSDN documentation that it only there for COM interop. As a C++ programmer it looks to me like a IEnumerator which supports Reset is what I would call a forward iterator, while an IEnumerator which does not support Reset is really an input iterator.
So part one of my question is, is this understanding correct?
The second part of my question is, would it be of any benefit in C# if there was a distinction made between input iterators and forward iterators (or "enumerators" if you prefer)? Would it not help eliminate some confusion among programmers, like the one found in this SO question about cloning iterators?
EDIT: Clarification on forward and input iterators. An input iterator only guarantees that you can enumerate the members of a collection (or from a generator function or an input stream) only once. This is exactly how IEnumerator works in C#. Whether or not you can enumerate a second time, is determined by whether or not Reset is supported. A forward iterator, does not have this restriction. You can enumerate over the members as often as you want.
Some C# programmers don't underestand why an IEnumerator cannot be reliably used in a multipass algorithm. Consider the following case:
void PrintContents(IEnumerator<int> xs)
{
while (iter.MoveNext())
Console.WriteLine(iter.Current);
iter.Reset();
while (iter.MoveNext())
Console.WriteLine(iter.Current);
}
If we call PrintContents in this context, no problem:
List<int> ys = new List<int>() { 1, 2, 3 }
PrintContents(ys.GetEnumerator());
However look at the following:
IEnumerable<int> GenerateInts() {
System.Random rnd = new System.Random();
for (int i=0; i < 10; ++i)
yield return Rnd.Next();
}
PrintContents(GenerateInts());
If the IEnumerator supported Reset, in other words supported multi-pass algorithms, then each time you iterated over the collection it would be different. This would be undesirable, because it would be surprising behavior. This example is a bit faked, but it does occur in the real world (e.g. reading from file streams).

Reset was a big mistake. I call shenanigans on Reset. In my opinion, the correct way to reflect the distinction you are making between "forward iterators" and "input iterators" in the .NET type system is with the distinction between IEnumerable<T> and IEnumerator<T>.
See also this answer, where Microsoft's Eric Lippert (in an unofficial capactiy, no doubt, my point is only that he's someone with more credentials than I have to make the claim that this was a design mistake) makes a similar point in comments. Also see also his awesome blog.

Interesting question. My take is that of course C# would benefit. However, it wouldn't be easy to add.
The distinction exists in C++ because of its much more flexible type system. In C#, you don't have a robust generic way to clone objects, which is necessary to represent forward iterators (to support multi-pass iteration). And of course, for this to be really useful, you'd also need to support bidirectional and random-access iterators/enumerators. And to get them all working smoothly, you really need some form of duck-typing, like C++ templates have.
Ultimately, the scopes of the two concepts are different.
In C++, iterators are supposed to represent everything you need to know about a range of values. Given a pair of iterators, I don't need the original container. I can sort, I can search, I can manipulate and copy elements as much as I like. The original container is out of the picture.
In C#, enumerators are not meant to do quite as much. Ultimately, they're just designed to let you run through the sequence in a linear manner.
As for Reset(), it is widely accepted that it was a mistake to add it in the first place. If it had worked, and been implemented correctly, then yes, you could say your enumerator was analogous to forward iterators, but in general, it's best to ignore it as a mistake. And then all enumerators are similar only to input iterators.
Unfortunately.

Coming from the C# perspective:
You almost never use IEnumerator directly. Usually you do a foreach statement, which expects a IEnumerable.
IEnumerable _myCollection;
...
foreach (var item in _myCollection) { /* Do something */ }
You don't pass around IEnumerator either. If you want to pass an collection which needs iteration, you pass IEnumerable. Since IEnumerable has a single function, which returns an IEnumerator, it can be used to iterate the collection multiple times (multiple passes).
There's no need for a Reset() function on IEnumerator because if you want to start over, you just throw away the old one (garbage collected) and get a new one.

The .NET framework would benefit immensely if there were a means of asking an IEnumerator<T> about what abilities it could support and what promises it could make. Such features would also be helpful in IEnumerable<T>, but being able to ask the questions of an enumerator would allow code that can receive an enumerator from wrappers like ReadOnlyCollection to use the underlying collection in improve ways without having to involve the wrapper.
Given any enumerator for a collection that is capable of being enumerated in its entirety and isn't too big, one could produce from it an IEnumerable<T> that would always yield the same sequence of items (specifically the set of items remaining in the enumerator) by reading its entire content to an array, disposing and discarding the enumerator, and getting an enumerators from the array (using that in place of the original abandoned enumerator), wrapping the array in a ReadOnlyCollection<T>, and returning that. Although such an approach would work with any kind of enumerable collection meeting the above criteria, it would be horribly inefficient with most of them. Having a means of asking an enumerator to yield its remaining contents in an immutable IEnumerable<T> would allow many kinds of enumerators to perform the indicated action much more efficiently.

I don't think so. I would call IEnumerable a forward iterator, and an input iterator. It does not allow you to go backwards, or modify the underlying collection. With the addition of the foreach keyword, iterators are almost a non-thought most of the time.
Opinion:
The difference between input iterators (get each one) vs. output iterators (do something to each one) is too trivial to justify an addition to the framework. Also, in order to do an output iterator, you would need to pass a delegate to the iterator. The input iterator seems more natural to C# programmers.
There's also IList<T> if the programmer wants random access.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.