Calling Distinct<>() on HashSet<T> - c#

I'm just curious.. When I call Distinct<>() (from Linq) on HashSet, does .NET know, that this IEnumerable always contains distinct set of values, and optimizes this call away?

Judging by looking at the code through Reflector, I would have to say no.
The code ends up construct an instance of an iterator method generated class regardless of what type you give it.
This problem is also compounded by the fact that you can specify comparer objects for both the Hashset and the Distinct method, which means the optimization would only be used in very few cases.
For instance, in the following case it could actually optimize the call away, but it wouldn't be able to know that:
var set = new HashSet<int>(new MyOwnInt32Comparer());
var distinct = set.Distinct(new MyOwnInt32Comparer());
Since I give it two instances of the comparer class, and such classes usually doesn't implement equality methods, the Distinct method would have no way of knowing that the two comparer implementations are actually identical.
In any case, this is a case where the programmer knows more about the code than the runtime, so take advantage of it. Linq may be very good but it's not omnipotent, so use your knowledge to your advantage.

I think No, because the input of Enumerable class for distinct method is IEnumerable and there is nothing specific for determining it's a hash set (so do not do anything).

No, looking at the implementation in reflector, it doesn't check if the enumeration is a HashSet<T>. The underlying iterator creates a new set and fills it during enumeration, so the overhead shouldn't be that large though.

Related

IEnumerable to IReadOnlyCollection

I have IEnumerable<Object> and need to pass to a method as a parameter but this method takes IReadOnlyCollection<Object>
Is it possible to convert IEnumerable<Object> to IReadOnlyCollection<Object> ?
One way would be to construct a list, and call AsReadOnly() on it:
IReadOnlyCollection<Object> rdOnly = orig.ToList().AsReadOnly();
This produces ReadOnlyCollection<object>, which implements IReadOnlyCollection<Object>.
Note: Since List<T> implements IReadOnlyCollection<T> as well, the call to AsReadOnly() is optional. Although it is possible to call your method with the result of ToList(), I prefer using AsReadOnly(), so that the readers of my code would see that the method that I am calling has no intention to modify my list. Of course they could find out the same thing by looking at the signature of the method that I am calling, but it is nice to be explicit about it.
Since the other answers seem to steer in the direction of wrapping the collections in a truly read-only type, let me add this.
I have rarely, if ever, seen a situation where the caller is so scared that an IEnumerable<T>-taking method might maliciously try to cast that IEnumerable<T> back to a List or other mutable type, and start mutating it. Cue organ music and evil laughter!
No. If the code you are working with is even remotely reasonable, then if it asks for a type that only has read functionality (IEnumerable<T>, IReadOnlyCollection<T>...), it will only read.
Use ToList() and be done with it.
As a side note, if you are creating the method in question, it is generally best to ask for no more than an IEnumerable<T>, indicating that you "just want a bunch of items to read". Whether or not you need its Count or need to enumerate it multiple times is an implementation detail, and is certainly prone to change. If you need multiple enumeration, simply do this:
items = items as IReadOnlyCollection<T> ?? items.ToList(); // Avoid multiple enumeration
This keeps the responsibility where it belongs (as locally as possible) and the method signature clean.
When returning a bunch of items, on the other hand, I prefer to return an IReadOnlyCollection<T>. Why? The goal is to give the caller something that fulfills reasonsable expectations - no more, no less. Those expectations are usually that the collection is materialized and that the Count is known - precisely what IReadOnlyCollection<T> provides (and a simple IEnumerable<T> does not). By being no more specific than this, our contract matches expectations, and the method is still free to change the underlying collection. (In contrast, if a method returns a List<T>, it makes me wonder what context there is that I should want to index into the list and mutate it... and the answer is usually "none".)
As an alternative to dasblinkenlight's answer, to prevent the caller casting to List<T>, instead of doing orig.ToList().AsReadOnly(), the following might be better:
ReadOnlyCollection<object> rdOnly = Array.AsReadOnly(orig.ToArray());
It's the same number of method calls, but one takes the other as a parameter instead of being called on the return value.

IEnumerable to List

Why IEnumerable.ToList() won't work if like:
var _listReleases= new List<string>;
_listReleases.Add("C#")
_listReleases.Add("Javascript");
_listReleases.Add("Python");
IEnumerable sortedItems = _listReleases.OrderBy(x => x);
_listReleases.Clear();
_listReleases.AddRange(sortedItems); // won't work
_listReleases.AddRange(sortedItems.ToList()); // won't work
Note: _listRelealse will be null
It doesn't work because of this line:
_listReleases.Clear();
First of all, _listReleases is not null at this point. It's merely empty, which is a completely different thing.
But to explain why this doesn't work as you expect: the IEnumerable interface type does not actually allocate or reserve storage for anything. It represents an object that you can use with a foreach loop, and nothing more. It does not actually need to store the items in the collection itself.
Sometimes, an IEnumerable reference does have those items in the same object, but it doesn't have to. That's what's going on here. The OrderBy() extension method only creates an object that knows how to look at the original list and return the items in a specific order. But this does not have storage for those items. It still depends on it's original data source.
The best solution for this situation is to stop using the _listReleases variable at this point, and instead just use the sortedItems variable. As long the former is not garabage collected, the latter will do what you need. But if you really want the _listReleases variable, you can do it like this:
_listReleases = sortedItems.ToList();
Now back to IEnumerables. There are some nice benefits to this property of not requiring immediate storage of the items themselves, and merely abstracting the ability to iterate over a collection:
Lazy Evaluation - That the work required to produce those items is not done until called for (and often, that means it won't need to be done all all, greatly improving performance).
Composition - An IEnumerable object can be modified during a program to incorprate new sets of rules or operations into the final result. This reduces program complexity and improves maintainability by allowing you to break apart a complex set of sorting or filtering requirements into it's component parts. This also makes it much easier to build a program where these rules can be easily determined by the user at run time, instead of in advance by the programmer at compile time.
Memory Efficiency - An IEnumerable makes it possible to iterate collections of data from sources such as a database in ways that only need to keep the current record loaded into memory at any given time. This feature can also be used to create unbounded collections: sets of items that may stretch on to infinity. You can build an IEnumerable with the BigInteger type to calculate the next prime on to infinity, if asked for. Moreover, you could use that collection in a useful way without crashing or hanging the program by combining this with the composition feature, so the program will know when to stop.
LINQ is lazily evaluated. When you run this line:
IEnumerable sortedItems = _listReleases.OrderBy(x => x);
You aren't actually ordering the items right then and there. Instead you're building an enumerable that will, when enumerated, return the objects that are currently in _listReleases in order. So when you Clear() the list, it no longer has any items to order.
You need to force it to evaluate before you clear _listReleases. An easy way to do this is to add a ToList() call. Also, the type IEnumerable isn't compatible with AddRange won't accept it. You can just use var to implicitly type it to List<string>, which will work because List<T> : IEnumerable<T> (it implements the interface).
var sortedItems = _listReleases.OrderBy(x => x).ToList();
_listReleases.Clear();
_listReleases.AddRange(sortedItems);
You should also note that methods like ToList() are extension methods for IEnumerable<T>, not IEnumerable, so ((IEnumerable)something).ToList() won't work. Unlike, say, Java, Something<T> and Something are completely distinct types in C#.

Why refactor argument of List<Term> to IEnumerable<Term>?

I have a method that looks like this:
public void UpdateTermInfo(List<Term> termInfoList)
{
foreach (Term termInfo in termInfoList)
{
UpdateTermInfo(termInfo);
}
m_xdoc.Save(FileName.FullName);
}
Resharper advises me to change the method signature to IEnumerable<Term> instead of List<Term>. What is the benefit of doing this?
The other answers point out that by choosing a "larger" type you permit a broader set of callers to call you. Which is a good enough reason in itself to make this change. However, there are other reasons. I would recommend that you make this change because when I see a method that takes a list or an array, the first thing I think is "what if that method tries to change an item in my list/array?"
You want the contents of a bucket, but you are requiring not just the bucket but also the ability to change its contents. Why would you require that if you're not going to use that ability? When you say "this method cannot take any old sequence; it has to take a mutable list that is indexed by integers" I think that you're making that requirement on the caller because you're going to take advantage of that power.
If "I'm planning on messing up your data structure" is not what you intend to communicate to the caller of the method then don't communicate that. A method that takes a sequence communicates "The most I'm going to do is read from this sequence in order".
Simply put, accepting an enumerable allows your function to be compatible with a broader scope of input arguments, such as arrays and LINQ queries.
To expound on accepting LINQ queries, one could do:
UpdateTermInfo(myTermList.Where(x => somefilter));
Additionally, specifying an interface rather than a concrete class allows others to provide their own implementation of that interface. In this way, you are being "subscriptive" rather than "proscriptive." (Yes, I did just make up a word.)
In general (with many exceptions relating to what sort of abilities you want to reserve for potential later modifications), it is a best-practice to implement functions using arguments that are the most general that they can be. This gives maximum flexibility to the consumer of your function.
As a result, if you are dead-set on using a list for this function (perhaps because at some later date you expect you might want to use properties such as Count or the index operator), I would strongly urge you to consider using IList<Term> instead of List<Term> for the reasons mentioned above.
List implements IEnumerable, using it would makes things more flexible. If an instance came along where you didn't want to use a List and wanted to use a different collection object it would cast from IEnumerable with ease.
For instance IEnumerable allows you to use Arrays and many others as opposed to always using a List.
Inumerable is simply a collection of items, dissimilar to a List, where you can add, remove, sort, use For Each, Count etc.
The main idea behind that refactor is that you make the method more general. You don't say what data structure you want, only what you need from it: that you can iterate through its elements.
So later, when you decide that O(n) search is not good enough for you, you only have to change one line and move along.
If you use List then you are confining yourself to only use a concrete implementation of List where as with IEnumerable you can pass in Arrays, Lists, Collections as they all implement that interface.

c# string[] vs IEnumerable<string>

What should I prefer if I know the number of elements before runtime?
Resharper offers me IEnumerable<string> instead of string[]?
ReSharper suggests IEnumerable<string> if you are only using methods defined for IEnumerable. It does so with the idea that, since you clearly do not need the value to be typed as array, you might want to hide the exact type from the consumers of (i.e., the code that uses) the value because you might want to change the type in the future.
In most cases, going with the suggestion is the right thing to do. The difference will not be something that you can observe while your program is running; rather, it's in how easily you will find it to make changes to your program in the future.
From the above you can also infer that the whole suggestion/question is meaningless unless the value we are talking about is passed across method boundaries (I don't remember if R# also offers it for a local variable).
If ReSharper suggests you use IEnumerable<string> it means you are only using features of that interface and no array specific features. Go with the suggestion of ReSharper and change it.
If you are trying to provide this method as an interface to other methods, I would prefer to have the output of your method more generic, hence would go for IEnumerable<string>.
Inside a method, if you are trying to instantiate and this is not being passed around to other methods, I would go for string[]. unless I need deferred execution. Although, it doesn't matter which one you use in this case.
The actual type should be string[] but depending on the user you may want to expose it as something else. e.g. IEnumerable<string> sequence = new string[5]... In particular if it's something like static readonly, then you should make it a ReadOnlyCollection so the entries can't be modified.
with string[] you can do more you can acces items by index with IEnumerable you have to loop to find specific index
It's probably suggesting this because it's looking for a better Liskov Substitution at this point in your code. Keep in mind the difference between the declared type and the implementing type. IEnumerable<> isn't an implementation, it's an interface. You can declare the variable as an IEnumerable<string> and build it with a string[] since the string array implements IEnumerable<string>.
What this does for you is allow you to pass around that string array as a more generic, more abstracted type. Anything which expects or returns an IEnumerable<string> (regardless of implementation, be it List<string> or string[] or anything else) can then use your string array, without having to worry about the specific implementation you pass it. As long as it satisfies the interface, it's polymorphic of the correct type.
Keep in mind that this isn't always the way to go. Sometimes you, as the developer, are very concerned with the implementation (perhaps for really fine-grained performance tuning, for example) and don't want to move up to an abstraction. The decision is up to you. ReSharper is merely making a suggestion to use an abstraction rather than an implementation in a variable/method declaration.
ReSharper is likely flagging it for you because you are not returning the least constrained type. If you aren't going to be using access on it by index in the future, I'd go with IEnumerable to have less constraint on the method which returns it.
Depends on your usage later on. If you need to enumare through these elements or sort or compare them later on then I would recommend IEnumerable otherwise go with array.
I wrote this response for a similar question regarding array or IEnumerable for return values, which was then closed as duplicate before I could post it. I thought the answer might be interesting to some so I post it here.
The main advantage of IEnumerable over T[] is that IEnumerable (for return values) can be made lazy. Ie it only computes the next element when needed.
Consider the difference between Directory.GetFiles and Directory.EnumerateFiles. GetFiles returns an Array, EnumerateFiles returns IEnumerable. This means that for a directory with two million files the Array will contain two million strings. EnumerateFiles only instansiate the strings as needed saving memory and improving response time.
However, it's not all benefits.
foreach is significantly less efficient on non-arrays (you can see this by disassembling the ILCode).
Array promises more, ie that its length will not change.
Lazy evaluation is not always better, consider the Directory class. The GetFiles implementation will open a find file handle, iterate over all files, close the find file handle and then return results. EnumerateFiles will do nothing until the first find file is requested, then the find file handle is opened and the files iterated, find file handle is closed when the enumerator is disposed. This means that the life-time of the find file handle is controlled by the caller, not the callee. Can be seen as less encapsulation and can give potential runtime errors with locked file handles.
In my humble opinion, I think R# is overzelous in suggestion IEnumerable over arrays especially so for return values (input parameters have less potential drawbacks). What I tend to do when I see a function that returns IEnumerable is a .ToArray in order to avoid potential issues with Lazy evaluation but if the Collection is already an Array this is inefficient.
I like the principle; promise alot, require little. Ie don't require that the input parameters must be arrays (use IEnumerable) but return Array over IEnumerable as Array is a bigger promise.

Is <Collection>.Count Expensive to Use?

I'm writing a cache-eject method that essentially looks like this:
while ( myHashSet.Count > MAX_ALLOWED_CACHE_MEMBERS )
{
EjectOldestItem( myHashSet );
}
My question is about how Count is determined: is it just a private or protected int, or is it calculated by counting the elements each time its called?
From http://msdn.microsoft.com/en-us/library/ms132433.aspx:
Retrieving the value of this property is an O(1) operation.
This guarantees that accessing the Count won't iterate over the whole collection.
Edit: as many other posters suggested, IEnumerable<...>.Count() is however not guaranteed to be O(1). Use with care!
IEnumerable<...>.Count() is an extension method defined in System.Linq.Enumerable. The current implementation makes an explicit test if the counted IEnumerable<T> is indeed an instance of ICollection<T>, and makes use of ICollection<T>.Count if possible. Otherwise it traverses the IEnumerable<T> (possible making lazy evaluation expand) and counts items one by one.
I've not however found in the documentation whether it's guaranteed that IEnumerable<...>.Count() uses O(1) if possible, I only checked the implementation in .NET 3.5 with Reflector.
Necessary late addition: many popular containers are not derived from Collection<T>, but nevertheless their Count property is O(1) (that is, won't iterate over the whole collection). Examples are HashSet<T>.Count (this one is most likely what the OP wanted to ask about), Dictionary<K, V>.Count, LinkedList<T>.Count, List<T>.Count, Queue<T>.Count, Stack<T>.Count and so on.
All these collections implement ICollection<T> or just ICollection, so their Count is an implementation of ICollection<T>.Count (or ICollection.Count). It's not required for an implementation of ICollection<T>.Count to be an O(1) operation, but the ones mentioned above are doing that way, according to the documentation.
(Note aside: some containers, for instance, Queue<T>, implement non-generic ICollection but not ICollection<T>, so they "inherit" the Count property only from from ICollection.)
Your question does not specify a specific Collection class so...
It depends on the Collection class. ArrayList has an internal variable that tracks the count, as does List. However, it is implementation specific, and depending on the type of the collection, it could theoretically get recalculated on each call.
It is an internal value, and is not calculated. The documentation states that getting the value is an O(1) operation.
As others have noted, Count is maintained when modifying the collection. This is nearly always the case with every collection type in the framework. This is considerably different than using the Count extension method on an IEnumerable which will enumerate the collection each time.
Also, with the newer collection classes the Count property is not virtual which means that the jitter can inline the call to the Count accessor which makes it practically the same as accessing a field. In other words, very quick.
In case of a HashSet it's just an internal int field and even SortedSet (a binary tree based set for .net 4) has its count in an internal field.
According to Reflector, it is implemented as
public int Count{ get; }
so it is defined by the derived type
Just a quick note. Be ware that there are two ways to count a collection in .NET 3.5 when System.Linq is used. For a normal collection, the first choice should be to use the Count property, for the reasons already described in other answers.
An alternative method, via the LINQ .Count() extension method, is also available. The intriguing thing about .Count() is that it can be called on ANY enumerable, regardless of whether the underlying class implements ICollection or not, or whether it has a Count property. If you ever do call .Count() however, be aware that it WILL iterate over the collection to dynamically generate a count. That generally results in O(n) complexity.
The only reason I wanted to note this is, using IntelliSense, it is often easy to accidentally end up using the Count() extension rather than the Count property.
It's an internal int that get incremented each time a new item is added to the collection.

Categories