Why is there a List<T>.BinarySearch(...)? - c#

I'm looking at List and I see a BinarySearch method with a few overloads, and I can't help wondering if it makes sense at all to have a method like that in List?
Why would I want to do a binary search unless the list was sorted? And if the list wasn't sorted, calling the method would just be a waste of CPU time. What's the point of having that method on List?

I note in addition to the other correct answers that binary search is surprisingly hard to write correctly. There are lots of corner cases and some tricky integer arithmetic. Since binary search is obviously a common operation on sorted lists, the BCL team did the world a service by writing the binary search algorithm correctly once rather than encouraging customers to all write their own binary search algorithm; a significant number of those customer-authored algorithms would be wrong.

Sorting and searching are two very common operations on lists. It would be unfriendly to limit a developer's options by not offering binary search on a regular list.
Library design requires compromises - the .NET designers chose to offer the binary search function on both arrays and lists in C# because they likely felt (as I do) that these are useful and common operations, and programmers who choose to use them understand their prerequisites (namely that the list is ordered) before calling them.
It's easy enough to sort a List<T> using one of the Sort() overloads. If you feel that you need an invariant that gaurantees sorting, you can always use SortedList<TKey,TValue> or SortedSet<T> instead.

BinarySearch only makes sense on a List<T> that is sorted, just like IList<T>.Add only makes sense for an IList<T> with IsReadOnly = false. It's messy, but it's just something to deal with: sometimes functionality X depends on criterion Y. The fact that Y isn't always true doesn't make X useless.
Now, in my opinion, it's frustrating that .NET doesn't have general Sort and BinarySearch methods for any IList<T> implementation (e.g., as extension methods). If it did, we could easily sort and search for items within any non-read-only collection providing random access.
Then again, you can always write your own (or copy someone else's).

Others have pointed out that BinarySearch is quite useful on a sorted List<T>. It doesn't really belong on List<T>, though, as anyone with C++ STL experience would immediately recognize.
With recent C# language developments, it makes more sense to define the notion of a sorted list (e.g., ISortedList<T> : IList<T>) and define BinarySearch (et. al.) as extension methods of that interface. This is a cleaner, more orthogonal type of design.
I've started doing just that as part of the Nito.Linq library. I expect the first stable release to be in a couple of months.

yes but List has Sort() method as well so you can call it before BinarySearch.

Searching and sorting are algorithmic primitives. It's helpful for the standard library to have fast reliable implementations. Otherwise, developers waste time reinventing the wheel.
However, in the case of the .NET Framework, it's unfortunate that the specific choices of algorithms happens to make them less useful than they might be. In some cases, their behaviour is not defined:
List<T>.BinarySearch If the List contains more than one element with the same value, the method returns only one of the occurrences, and it might return any one of the occurrences, not necessarily the first one.
List<T> This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.
That's a shame, because there are deterministic algorithms that are just as fast, and these would be much more useful as building blocks. It's noteworthy that the binary search algorithms in Python, Ruby and Go all find the first matching element.

I agree it's completely dumb to Call BinarySearch on an unsorted list, but it's perfect if you know your large list is sorted.
I've used it when checking if items from a stream exist in a (more or less) static list of 100,000 items or more.
Binary Searching the list is ORDERS of magnitude faster than doing a list.Find, which is many orders of magnitude faster than a database look up.
I makes sense, and I'm glad it there (not that it would be rocket science to implement it if it wasn't).

Perhaps another point is that an array could be equally as unsorted. So in theory, having a BinarySearch on an array could be invalid too.
However, as with all features of a higher level language, they need to be applied by someone with reason and understanding of the data, or they will fail. Sure, some cross-checks could be applied, and we could have a flag that said "IsSorted" and it would fail on binary search otherwise, but .....

Some pseudo code:
if List is sorted
use the BinarySearch method
else if List is not sorted and you think sorting it is "waste of CPU time"
use a different algorithm that is more suitable and efficient

Related

Must IList be finite?

Must .NET's IList be finite? Suppose I write a class FibonacciList implementing IList<BigInteger>
The property Item[n] returns the nth Fibonacci number.
The property IsReadOnly returns true.
The methods IndexOf and Contains we can implement easily enough because the Fibonacci sequence is increasing - to test if the number m is Fibonacci, we need only to compute the finite sequence of Fibonacci numbers up to m.
The method GetEnumerator() doing the right thing
We've now implemented all the methods expected of read-only ILists except Count().
Is this cool, or an abuse of IList?
Fibonacci numbers get impractically big quickly (hence IList<BigInteger> above) . A bounded infinite sequence might be more sensible, it could implement IList<long> or IList<double>.
Addendum II: Fibonacci sequence may have been a bad example, because computing distant values is expensive - to find the nth value one has to compute all earlier values. Thus as Mošmondor said, one might as well make it an IEnumerable and use .ElementAt. However there exist other sequences where one can compute distant values quickly without computing earlier values. (Surprisingly the digits of pi are such a sequence). These sequences are more 'listy', they truly support random access.
Edit: No-one argues against infinite IEnumerables. How do they handle Count()?
To most developers, IList and ICollection imply that you have a pre-evaluated, in-memory collection to work with. With IList specifically, there is an implicit contract of constant-time Add* and indexing operations. This is why LinkedList<T> does not implement IList<T>. I would consider a FibonacciList to be a violation of this implied contract.
Note the following paragraph from a recent MSDN Magazine article discussing the reasons for adding read-only collection interfaces to .NET 4.5:
IEnumerable<T> is sufficient for most scenarios that deal with collections of types, but sometimes you need more power than it provides:
Materialization: IEnumerable<T> does not allow you to express whether the collection is already available (“materialized”) or whether it’s computed every time you iterate over it (for example, if it represents a LINQ query). When an algorithm requires multiple iterations over the collection, this can result in performance degradation if computing the sequence is expensive; it can also cause subtle bugs because of identity mismatches when objects are being generated again on subsequent passes.
As others have pointed out, there is also the question of what you would return for .Count.
It's perfectly fine to use IEnumerable or IQueryable in for such collections of data, because there is an expectation that these types can be lazily evaluated.
Regarding Edit 1: .Count() is not implemented by the IEnumerable<T> interface: it is an extension method. As such, developers need to expect that it can take any amount of time, and they need to avoid calling it in cases where they don't actually need to know the number of items. For example, if you just want to know whether an IEnumerable<T> has any items, it's better to use .Any(). If you know that there's a maximum number of items you want to deal with, you can use .Take(). If a collection has more than int.MaxValue items in it, .Count() will encounter an operation overflow. So there are some workarounds that can help to reduce the danger associated with infinite sequences. Obviously if programmers haven't taken these possibilities into account, it can still cause problems, though.
Regarding Edit 2: If you're planning to implement your sequence in a way that indexing is constant-time, that addresses my main point pretty handily. Sixlettervariables's answer still holds true, though.
*Obviously there's more to this: Add is only expected to work if IList.IsFixedSize returns false. Modification is only possible if IsReadOnly returns false, etc. IList was a poorly-thought-out interface in the first place: a fact which may finally be remedied by the introduction of read-only collection interfaces in .NET 4.5.
Update
Having given this some additional thought, I've come to the personal opinion that IEnumerable<>s should not be infinite either. In addition to materializing methods like .ToList(), LINQ has several non-streaming operations like .OrderBy() which must consume the entire IEnumerable<> before the first result can be returned. Since so many methods assume IEnumerable<>s are safe to traverse in their entirety, it would be a violation of the Liskov Substitution Principle to produce an IEnumerable<> that is inherently unsafe to traverse indefinitely.
If you find that your application often requires segments of the Fibonacci sequence as IEnumerables, I'd suggest creating a method with a signature similar to Enumerable.Range(int, int), which allows the user to define a starting and ending index.
If you'd like to embark on a Gee-Whiz project, you could conceivably develop a Fibonacci-based IQueryable<> provider, where users could use a limited subset of LINQ query syntax, like so:
// LINQ to Fibonacci!
var fibQuery = from n in Fibonacci.Numbers // (returns an IQueryable<>)
where n.Index > 5 && n.Value < 20000
select n.Value;
var fibCount = fibQuery.Count();
var fibList = fibQuery.ToList();
Since your query provider would have the power to evaluate the where clauses as lambda expressions, you could have enough control to implement Count methods and .GetEnumerator() in a way as to ensure that the query is restrictive enough to produce a real answer, or throw an exception as soon as the method is called.
But this reeks of being clever, and would probably be a really bad idea for any real-life software.
I would imagine that a conforming implementation must be finite, otherwise what would you return for ICollection<T>.Count?
/// <summary>
/// Gets the number of elements contained in the <see cref="ICollection{T}" />.
/// </summary>
int Count { get; }
Another consideration is CopyTo, which under its normal overload would never stop in a Fibonacci case.
What this means is an appropriate implementation of a Fibonacci Sequence would be simply IEnumerable<int> (using a generator pattern). (Ab)use of an IList<T> would just cause problems.
In your case, I would rather 'violate' IEnumerable and have my way with yield return.
:)
An infinite collection would probably best be implemented as an IEnumerable<T>, not an IList<T>. You could also make use of the yield return syntax when implementing, like so (ignore overflow issues, etc.):
public IEnumerable<long> Fib()
{
yield return 1;
yield return 1;
long l1 = 1;
long l2 = 1;
while (true)
{
long t = l1;
l1 = l2;
l2 = t + l1;
yield return l2;
}
}
As #CodeInChaos pointed out in the comments, the Item property of IList has signature
T this[ int index ] { get; set; }
We see ILists are indexed by ints, so their length is bounded by Int32.MaxValue . Elements of greater index would be inaccessible. This occurred to me when writing the question, but I left it out, because the problem is fun to think about otherwise.
EDIT
Having had a day to reflect on my answer and, in light of #StriplingWarrior's comment. I fear I have to make a reversal. I started trying this out last night and now I wonder what would I really lose by abandoning IList?
I think it would wiser to implement just IEnumerable and, declare a local Count() method that throws a NotSupportedException method to prevent the enumerator running until an OutOfMemoryException occurs. I would still add an IndexOf and Contains method and Item indexer property to expose higher performance alternatives like Binet's Formula but, I'd be free to change the signatures of these members to use extended datatypes potentially, even System.Numerics.BigInteger.
If I were implementing multiple series I would declare an ISeries interface for these members. Who know's, perhaps somthing like this will eventually be part of the framework.
I disagree with what appears to be a consensus view. Whilst IList has many members that cannot be implemented for an infinite series it does have an IsReadOnly member. It seems acceptable, certainly in the case of ReadOnlyCollection<>, to implement the majority of members with a NotSupportedException. Following this precedent, I don't see why this should be unacceptable if it is a side effect of some other gain in function.
In this specific Fibbonaci series case, there are established algortihms, see here and here, for shortcircuiting the normal cumalitive enumeration approach which I think would yield siginifcant performance benefits. Exposing these benefits through IList seems warranted to me.
Ideally, .Net would support some other, more appropriate super class of interface, somewhat closer to IEnumerable<> but, until that arrives in some future version, this has got to be a sensible approach.
I'm working on an implementation of IList<BigInteger> to illustrate
Summarising what I've seen so far:
You can fulfil 5 out of 6, throwing a NotSupportedException on Count()
I would have said this is probably good enough to go for it, however as servy has pointed out, the indexer is incredibly inefficient for any non-calculated and cached number.
In this case, I would say the only contract that fits your continual stream of calculations is IEnumerable.
The other option you have is to create something that looks a lot like an IList but isn't actually.

what is difference between string array and list of string in c#

I hear on MSDN that an array is faster than a collection.
Can you tell me how string[] is faster then List<string>.
Arrays are a lower level abstraction than collections such as lists. The CLR knows about arrays directly, so there's slightly less work involved in iterating, accessing etc.
However, this should almost never dictate which you actually use. The performance difference will be negligible in most real-world applications. I rarely find it appropriate to use arrays rather than the various generic collection classes, and indeed some consider arrays somewhat harmful. One significant downside is that there's no such thing as an immutable array (other than an empty one)... whereas you can expose read-only collections through an API relatively easily.
The article is from 2004, that means it's about .net 1.1 and there was no generics.
Array vs collection performance actually was a problem back then because collection types caused a lot of exta boxing-unboxing operations. But since .net 2.0, where generics was introduced, difference in performance almost gone.
An array is not resizable. This means that when it is created one block of memory is allocated, large enough to hold as many elements as you specify.
A List on the other hand is implicitly resizable. Each time you Add an item, the framework may need to allocate more memory to hold the item you just added. This is an expensive operation, so we end up saying "List is slower than array".
Of course this is a very simplified explanation, but hopefully enough to paint the picture.
An array is the simplest form of collection, so it's faster than other collections. A List (and many other collections) actually uses an array internally to hold its items.
An array is of course also limited by its simplicity. Most notably you can't change the size of an array. If you want a dynamic collection you would use a List.
List<string> is class with a private member that is a string[]. The MSDN documentation states this fact in several places. The List class is basically a wrapper class around an array that gives the array other functionality.
The answer of which is faster all depends on what you are trying to do with the list/array. For accessing and assigning values to elements, the array is probably negligibly faster since the List is an abstraction of the array (as Jon Skeet has said).
If you intend on having a data structure that grows over time (gets more and more elements), performance (ave. speed) wise the List will start to shine. That is because each time you resize an array to add another element it is an O(n) operation. When you add an element to a List (and the list is already at capacity) the list will double itself in size. I won't get into the nitty gritty details, but basically this means that increasing the size of a List is on average a O(log n) operation. Of course this has drawbacks too (you could have almost twice the amount of memory allocated as you really need if you only go a couple items past its last capacity).
Edit: I got a little mixed up in the paragraph above. As Eric has said below, the number of resizes for a List is O(log n), but the actual cost associated with resizing the array is amortized to O(1).

Would C# benefit from distinctions between kinds of enumerators, like C++ iterators?

I have been thinking about the IEnumerator.Reset() method. I read in the MSDN documentation that it only there for COM interop. As a C++ programmer it looks to me like a IEnumerator which supports Reset is what I would call a forward iterator, while an IEnumerator which does not support Reset is really an input iterator.
So part one of my question is, is this understanding correct?
The second part of my question is, would it be of any benefit in C# if there was a distinction made between input iterators and forward iterators (or "enumerators" if you prefer)? Would it not help eliminate some confusion among programmers, like the one found in this SO question about cloning iterators?
EDIT: Clarification on forward and input iterators. An input iterator only guarantees that you can enumerate the members of a collection (or from a generator function or an input stream) only once. This is exactly how IEnumerator works in C#. Whether or not you can enumerate a second time, is determined by whether or not Reset is supported. A forward iterator, does not have this restriction. You can enumerate over the members as often as you want.
Some C# programmers don't underestand why an IEnumerator cannot be reliably used in a multipass algorithm. Consider the following case:
void PrintContents(IEnumerator<int> xs)
{
while (iter.MoveNext())
Console.WriteLine(iter.Current);
iter.Reset();
while (iter.MoveNext())
Console.WriteLine(iter.Current);
}
If we call PrintContents in this context, no problem:
List<int> ys = new List<int>() { 1, 2, 3 }
PrintContents(ys.GetEnumerator());
However look at the following:
IEnumerable<int> GenerateInts() {
System.Random rnd = new System.Random();
for (int i=0; i < 10; ++i)
yield return Rnd.Next();
}
PrintContents(GenerateInts());
If the IEnumerator supported Reset, in other words supported multi-pass algorithms, then each time you iterated over the collection it would be different. This would be undesirable, because it would be surprising behavior. This example is a bit faked, but it does occur in the real world (e.g. reading from file streams).
Reset was a big mistake. I call shenanigans on Reset. In my opinion, the correct way to reflect the distinction you are making between "forward iterators" and "input iterators" in the .NET type system is with the distinction between IEnumerable<T> and IEnumerator<T>.
See also this answer, where Microsoft's Eric Lippert (in an unofficial capactiy, no doubt, my point is only that he's someone with more credentials than I have to make the claim that this was a design mistake) makes a similar point in comments. Also see also his awesome blog.
Interesting question. My take is that of course C# would benefit. However, it wouldn't be easy to add.
The distinction exists in C++ because of its much more flexible type system. In C#, you don't have a robust generic way to clone objects, which is necessary to represent forward iterators (to support multi-pass iteration). And of course, for this to be really useful, you'd also need to support bidirectional and random-access iterators/enumerators. And to get them all working smoothly, you really need some form of duck-typing, like C++ templates have.
Ultimately, the scopes of the two concepts are different.
In C++, iterators are supposed to represent everything you need to know about a range of values. Given a pair of iterators, I don't need the original container. I can sort, I can search, I can manipulate and copy elements as much as I like. The original container is out of the picture.
In C#, enumerators are not meant to do quite as much. Ultimately, they're just designed to let you run through the sequence in a linear manner.
As for Reset(), it is widely accepted that it was a mistake to add it in the first place. If it had worked, and been implemented correctly, then yes, you could say your enumerator was analogous to forward iterators, but in general, it's best to ignore it as a mistake. And then all enumerators are similar only to input iterators.
Unfortunately.
Coming from the C# perspective:
You almost never use IEnumerator directly. Usually you do a foreach statement, which expects a IEnumerable.
IEnumerable _myCollection;
...
foreach (var item in _myCollection) { /* Do something */ }
You don't pass around IEnumerator either. If you want to pass an collection which needs iteration, you pass IEnumerable. Since IEnumerable has a single function, which returns an IEnumerator, it can be used to iterate the collection multiple times (multiple passes).
There's no need for a Reset() function on IEnumerator because if you want to start over, you just throw away the old one (garbage collected) and get a new one.
The .NET framework would benefit immensely if there were a means of asking an IEnumerator<T> about what abilities it could support and what promises it could make. Such features would also be helpful in IEnumerable<T>, but being able to ask the questions of an enumerator would allow code that can receive an enumerator from wrappers like ReadOnlyCollection to use the underlying collection in improve ways without having to involve the wrapper.
Given any enumerator for a collection that is capable of being enumerated in its entirety and isn't too big, one could produce from it an IEnumerable<T> that would always yield the same sequence of items (specifically the set of items remaining in the enumerator) by reading its entire content to an array, disposing and discarding the enumerator, and getting an enumerators from the array (using that in place of the original abandoned enumerator), wrapping the array in a ReadOnlyCollection<T>, and returning that. Although such an approach would work with any kind of enumerable collection meeting the above criteria, it would be horribly inefficient with most of them. Having a means of asking an enumerator to yield its remaining contents in an immutable IEnumerable<T> would allow many kinds of enumerators to perform the indicated action much more efficiently.
I don't think so. I would call IEnumerable a forward iterator, and an input iterator. It does not allow you to go backwards, or modify the underlying collection. With the addition of the foreach keyword, iterators are almost a non-thought most of the time.
Opinion:
The difference between input iterators (get each one) vs. output iterators (do something to each one) is too trivial to justify an addition to the framework. Also, in order to do an output iterator, you would need to pass a delegate to the iterator. The input iterator seems more natural to C# programmers.
There's also IList<T> if the programmer wants random access.

C#: Is it possible to return an IOrderedEnumerable<T>?

Is it possible to return an IOrderedEnumerable<T> from a method without using the OrderBy or OrderByDescending methods on an IEnumerable<T>?
Im guessing perhaps not... but... maybe I am wrong?
Reason: Mostly curiosity. It just kind of hit me when making this answer on returning digits in a number. And my method would return the digits in ascending order, according to their weight in the given number. So I thought it could be nice if they came out ordered in a way that the framework would recognize as ordered as well. Of course we can argue that from a common number point of view they were not ordered. But yeah... if they should or shouldn't wasn't the point here. Just if it was possible or not.
I guess I am also kind of implying a question (or am now at least) on if an IOrderedEnumerable<T> is more than just an IEnumerable<T> with a different name. Does it contain anything more? i know it has the ThenBy and ThenByDescending methods, but do they use anything inside the IOrderedEnumerable<T>, or is it just that they don't make sense to use directly on an IEnumerable<T>?
Well, you can implement IOrderedEnumerable<T> yourself if you want too... it's not hugely hard, depending on what you need to do. The easiest option is just to call OrderBy/OrderByDescending though :)
Why are you interested?

ArrayList versus an array of objects versus Collection of T

I have a class Customer (with typical customer properties) and I need to pass around, and databind, a "chunk" of Customer instances. Currently I'm using an array of Customer, but I've also used Collection of T (and List of T before I knew about Collection of T). I'd like the thinnest way to pass this chunk around using C# and .NET 3.5.
Currently, the array of Customer is working just fine for me. It data binds well and seems to be as lightweight as it gets. I don't need the stuff List of T offers and Collection of T still seems like overkill. The array does require that I know ahead of time how many Customers I'm adding to the chunk, but I always know that in advance (given rows in a page, for example).
Am I missing something fundamental or is the array of Customer OK? Is there a tradeoff I'm missing?
Also, I'm assuming that Collection of T makes the old loosely-typed ArrayList obsolete. Am I right there?
Yes, Collection<T> (or List<T> more commonly) makes ArrayList pretty much obsolete. In particular, I believe ArrayList isn't even supported in Silverlight 2.
Arrays are okay in some cases, but should be considered somewhat harmful - they have various disadvantages. (They're at the heart of the implementation of most collections, of course...) I'd go into more details, but Eric Lippert does it so much better than I ever could in the article referenced by the link. I would summarise it here, but that's quite hard to do. It really is worth just reading the whole post.
No one has mentioned the Framework Guidelines advice: Don't use List<T> in public API's:
We don’t recommend using List in
public APIs for two reasons.
List<T> is not designed to be extended. i.e. you cannot override any
members. This for example means that
an object returning List<T> from a
property won’t be able to get notified
when the collection is modified.
Collection<T> lets you overrides
SetItem protected member to get
“notified” when a new items is added
or an existing item is changed.
List has lots of members that are not relevant in many scenarios. We
say that List<T> is too “busy” for
public object models. Imagine
ListView.Items property returning
List<T> with all its richness. Now,
look at the actual ListView.Items
return type; it’s way simpler and
similar to Collection<T> or
ReadOnlyCollection<T>
Also, if your goal is two-way Databinding, have a look at BindingList<T> (with the caveat that it is not sortable 'out of the box'!)
Generally, you should 'pass around' IEnumerable<T> or ICollection<T> (depending on whether it makes sense for your consumer to add items).
If you have an immutable list of customers, that is... your list of customers will not change, it's relatively small, and you will always iterate over it first to last and you don't need to add to the list or remove from it, then an array is probably just fine.
If you're unsure, however, then your best bet is a collection of some type. What collection you choose depends on the operations you wish to perform on it. Collections are all about inserts, manipulations, lookups, and deletes. If you do frequent frequent searches for a given element, then a dictionary may be best. If you need to sort the data, then perhaps a SortedList will work better.
I wouldn't worry about "lightweight", unless you're talking a massive number of elements, and even then the advantages of O(1) lookups outweigh the costs of resources.
When you "pass around" a collection, you're only passing a reference, which is basically a pointer. So there is no performance difference between passing a collection and an array.
I'm going to put in a dissenting argument to both Jon and Eric Lippert )which means that you should be very weary of my answer, indeed!).
The heart of Eric Lippert's arguments against arrays is that the contents are immutable, while the data structure itself is not. With regards to returning them from methods, the contents of a List are just as mutable. In fact, because you can add or subtract elements from a List, I would argue that this makes the return value more mutable than an array.
The other reason I'm fond of Arrays is because sometime back I had a small section of performance critical code, so I benchmarked the performance characteristics of the two, and arrays blew Lists out of the water. Now, let me caveat this by saying it was a narrow test for how I was going to use them in a specific situation, and it goes against what I understand of both, but the numbers were wildly different.
Anyway, listen to Jon and Eric =), and I agree that List almost always makes more sense.
I agree with Alun, with one addition. If you may want to address the return value by subscript myArray[n], then use an IList.
An Array inherently supports IList (as well as IEnumerable and ICollection, for that matter). So if you pass by interface, you can still use the array as your underlying data structure. In this way, the methods that you are passing the array into don't have to "know" that the underlying datastructure is an array:
public void Test()
{
IList<Item> test = MyMethod();
}
public IList<Item> MyMethod()
{
Item[] items = new Item[] {new Item()};
return items;
}

Categories