LINQ Ring: Any() vs Contains() for Huge Collections

LINQ Ring: Any() vs Contains() for Huge Collections - c#

Given a huge collection of objects, is there a performance difference between the the following?
Collection.Contains:
myCollection.Contains(myElement)
Enumerable.Any:
myCollection.Any(currentElement => currentElement == myElement)

Contains() is an instance method, and its performance depends largely on the collection itself. For instance, Contains() on a List is O(n), while Contains() on a HashSet is O(1).
Any() is an extension method, and will simply go through the collection, applying the delegate on every object. It therefore has a complexity of O(n).
Any() is more flexible however since you can pass a delegate. Contains() can only accept an object.

It depends on the collection. If you have an ordered collection, then Contains might do a smart search (binary, hash, b-tree, etc.), while with `Any() you are basically stuck with enumerating until you find it (assuming LINQ-to-Objects).
Also note that in your example, Any() is using the == operator which will check for referential equality, while Contains will use IEquatable<T> or the Equals() method, which might be overridden.

I suppose that would depend on the type of myCollection is which dictates how Contains() is implemented. If a sorted binary tree for example, it could search smarter. Also it may take the element's hash into account. Any() on the other hand will enumerate through the collection until the first element that satisfies the condition is found. There are no optimizations for if the object had a smarter search method.

Contains() is also an extension method which can work fast if you use it in the correct way.
For ex:
var result = context.Projects.Where(x => lstBizIds.Contains(x.businessId)).Select(x => x.projectId).ToList();
This will give the query
SELECT Id
FROM Projects
INNER JOIN (VALUES (1), (2), (3), (4), (5)) AS Data(Item) ON Projects.UserId = Data.Item
while Any() on the other hand always iterate through the O(n).
Hope this will work....

Related

Does `Any()` forces linq execution?

I have a linq to entity query.
will Any() force linq execution (like ToList() does)?

There is very good MSDN article Classification of Standard Query Operators by Manner of Execution which describes all standard operators of LINQ. As you can see from table Any is executed immediately (as all operators which return single value). You can always refer this table if you have doubts about manner of operator execution.

Yes, and no. The any method will read items from the source right away, but it's not guaranteed to read all items.
The Any method will enumerate items from the source, but only as many as needed to determine the result.
Without any parameter, the Any method will only try to read the first item from the source.
With a parameter, the Any method will only read items from the source until it finds one that satisfies the condition. All items are only read from the source if no items satisfies the condition until the last item.

This is easy to discover: Any() returns a simple bool. Since a bool is always a bool, and not an IQueryable or IEnumerable (or any other type) that can have a custom implementation, we must conclude that Any() itself must calculate the boolean value to return.
The exception is of course if the Any() is used inside a subquery on a IQueryable, in which case the Linq provider will typically just analyse the presence of the call to Any() and convert it to corresponding SQL (for example).

Short question, short answer: Yes it will.
To find out if the any element of the list matches the given condition (or if there is any element at all) the list will have to be enumerated. As MSDN states:
This method does not return any one element of a collection. Instead, it determines whether the collection contains any elements.
The enumeration of source is stopped as soon as the result can be determined.
Deferred execution does not apply here, because this method delivers the result of an enumeration, not another IEnumerable.

Cost of enumerating ILookup results? Better to always use Dictionary<TKey,List<TElement>>?

I have a ILookup<TKey,TElement> lookup from which I fairly often get elements and iterate trough them using LINQ or foreach. I look up like this IEnumerable<TElement> results = lookup[key];.
Thus, results needs to be enumerated at least once every time I use lookup results (and even more if I'm iterating multiple times if I don't use .ToList() first).
Even though its not as "clean", wouldn't it be better (performance-wise) to use a Dictionary<TKey,List<TElement>>, so that all results from a key are only enumerated on construction of the dictionary? Just how taxing is ToList()?

ToLookup, like all the other ToXXX LINQ methods, uses immediate execution. The resulting object has no reference to the original source. It effectively does build a Dictionary<TKey, List<TElement>> - not those exact types, perhaps, but equivalent to it.
Note that there's a difference though, which may or may not be useful to you - the indexer for a lookup returns an empty sequence if you give it a key which doesn't exist, rather than throwing an exception. That can make life much easier if you want to be able to just index it by any key and iterate over the corresponding values.
Also note that although it's not explicitly documented, the implementation used for the value sequences does implement ICollection<T>, so calling the LINQ Count() method is O(1) - it doesn't need to iterate over all the elements.
See my Edulinq post on ToLookup for more details.

Assuming the implementation is System.Linq.Lookup (does ILookup have any other implementations?), the elements presented in lookup[key] are stored in an array of elements as a field of System.Linq.Lookup.Grouping. Repeatedly looking them up won't cause a re-iteration of source. Of course, rebuilding the Lookup will be more costly, but once built, the source is no longer accessed.

Count property vs Count() method?

Working with a collection I have the two ways of getting the count of objects; Count (the property) and Count() (the method). Does anyone know what the key differences are?
I might be wrong, but I always use the Count property in any conditional statements because I'm assuming the Count() method performs some sort of query against the collection, where as Count must have already been assigned prior to me 'getting.' But that's a guess - I don't know if performance will be affected if I'm wrong.
EDIT: Out of curiosity then, will Count() throw an exception if the collection is null? Because I'm pretty sure the Count property simply returns 0.

Decompiling the source for the Count() extension method reveals that it tests whether the object is an ICollection (generic or otherwise) and if so simply returns the underlying Count property:
So, if your code accesses Count instead of calling Count(), you can bypass the type checking - a theoretical performance benefit but I doubt it would be a noticeable one!
// System.Linq.Enumerable
public static int Count<TSource>(this IEnumerable<TSource> source)
{
checked
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
return collection.Count;
}
ICollection collection2 = source as ICollection;
if (collection2 != null)
{
return collection2.Count;
}
int num = 0;
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
num++;
}
}
return num;
}
}

Performance is only one reason to choose one or the other. Choosing .Count() means that your code will be more generic. I've had occasions where I refactored some code that no longer produced a collection, but instead something more generic like an IEnumerable, but other code broke as a result because it depended on .Count and I had to change it to .Count(). If I made a point to use .Count() everywhere, the code would likely be more reusable and maintainable. Usually opting to utilize the more generic interfaces if you can get away with it is your best bet. By more generic, I mean the simpler interface that is implemented by more types, and thus netting you greater compatibility between code.
I'm not saying .Count() is better, I'm just saying there's other considerations that deal more with the reusability of the code you are writing.

The .Count() method might be smart enough, or know about the type in question, and if so, it might use the underlying .Count property.
Then again, it might not.
I would say it is safe to assume that if the collection has a .Count property itself, that's going to be your best bet when it comes to performance.
If the .Count() method doesn't know about the collection, it will enumerate over it, which will be an O(n) operation.

Short Version: If you have the choice between a Count property and a Count() method always choose the property.
The difference is mainly around the efficiency of the operation. All BCL collections which expose a Count property do so in an O(1) fashion. The Count() method though can, and often will, cost O(N). There are some checks to try and get it to O(1) for some implementations but it's by no means guaranteed.

The Count() method is the LINQ method that works on any IEnumerable<>. You would expect the Count() method to iterate over the whole collection to find the count, but I believe the LINQ code actually has some optimizations in there to detect if a Count property exists and if so use that.
So they should both do almost identical things. The Count property is probably slightly better since there doesn't need to be a type check in there.

Count() method is an extension method that iterates each element of an IEnumerable<> and returns how many elements are there. If the instance of IEnumerable is actually a List<>, so it's optimized to return the Count property instead of iterating all elements.

Count() is there as an extension method from LINQ - Count is a property on Lists, actual .NET collection objects.
As such, Count() will almost always be slower, since it will enumerate the collection / queryable object. On a list, queue, stack etc, use Count. Or for an array - Length.

If there is a Count or Length property, you should always prefer that to the Count() method, which generally iterates the entire collection to count the number of elements within. Exceptions would be when the Count() method is against a LINQ to SQL or LINQ to Entities source, for example, in which case it would perform a count query against the datasource. Even then, if there is a Count property, you would want to prefer that, since it likely has less work to do.

The Count() method has an optimisation for ICollection<T> which results in the Count property being called. In this case there is probably no significant difference in performance.
There are types other than ICollection<T> which have more efficient alternatives to the Count() extension method though. This code analysis performance rule fires on the following types.
CA1829: Use Length/Count property instead of Enumerable.Count method
System.Array
System.Collections.Immutable.ImmutableArray<T>
System.Collections.ICollection
System.Collections.Generic.ICollection<T>
System.Collections.Generic.IReadOnlyCollection<T>
So, we should use Count and Length properties if they are available and fallback to the Count() extension method otherwise.

.Count is a property of a collection and gets the elements in the collection. Unlike .Count() which is an extension method for LINQ and counts the number of elements.
Generally .Count is faster than .Count() because it does not require the overhead of creating and enumerating a LINQ query.
It's better to use the .Count property unless you need the additional functionality provided by the .Count() method, such as the ability to specify a filtering predicate, e.g.
int count = numbers.Count(n => n.Id == 100);

LINQ performance vs. Dictionary<K,V>

In many situations, for simplicity's sake, I would have preferred to use a List or an HashSet in combination with LINQ instead of using a Dictionary. However, I usually sticked with Dictionary because I thought Dictionary would be more performant because of its hash table implementation.
For example:
When I do this in LINQ:
bool exists = hashset.Any(item => item.Key == someKey);
Do I lose significant performance compared to the following equivalent with a Dictionary?
bool exists = dictionary.ContainsKey(someKey); // an O(1) operation
Are the LINQ queries optimized in some way that would make them a justifiable choice against a Dictionary? Or is the above Any() a plain O(n) operation no matter on which type of collection it is executed?

In your case you are eliminating the benefit of the hashset, because Any in that case is an extension method defined on IEnumerable. It is simply iterating over the hashset as if it were a List and invoking the == operator on each item. In fact, these two code samples are not even strictly equivalent- the LINQ statement uses the == operator, and the dictionary uses hashcode/equals equality. These are equivalent for value types and Strings, but not for all classes.
What you can do is this:
bool exists = hashset.Contains(item.Key);
That will use the Hashset's optimized lookup, and not require you to keep a dummy value as you would with Dictionary.

Interview Question: .Any() vs if (.Length > 0) for testing if a collection has elements

In a recent interview I was asked what the difference between .Any() and .Length > 0 was and why I would use either when testing to see if a collection had elements.
This threw me a little as it seems a little obvious but feel I may be missing something.
I suggested that you use .Length when you simply need to know that a collection has elements and .Any() when you wish to filter the results.
Presumably .Any() takes a performance hit too as it has to do a loop / query internally.

Length only exists for some collection types such as Array.
Any is an extension method that can be used with any collection that implements IEnumerable<T>.
If Length is present then you can use it, otherwise use Any.
Presumably .Any() takes a performance hit too as it has to do a loop / query internally.
Enumerable.Any does not loop. It fetches an iterator and checks if MoveNext returns true. Here is the source code from .NET Reflector.
public static bool Any<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
if (enumerator.MoveNext())
{
return true;
}
}
return false;
}

I'm guessing the interviewer may have meant to ask about checking Any() versus Count() > 0 (as opposed to Length > 0).
Basically, here's the deal.
Any() will effectively try to determine if a collection has any members by enumerating over a single item. (There is an overload to check for a given criterion using a Func<T, bool>, but I'm guessing the interviewer was referring to the version of Any() that takes no arguments.) This makes it O(1).
Count() will check for a Length or Count property (from a T[] or an ICollection or ICollection<T>) first. This would generally be O(1). If that isn't available, however, it will count the items in a collection by enumerating over the entire thing. This would be O(n).
A Count or Length property, if available, would most likely be O(1) just like Any(), and would probably perform better as it would require no enumerating at all. But the Count() extension method does not ensure this. Therefore it is sometimes O(1), sometimes O(n).
Presumably, if you're dealing with a nondescript IEnumerable<T> and you don't know whether it implements ICollection<T> or not, you are much better off using Any() than Count() > 0 if your intention is simply to ensure the collection is not empty.

Length is a property of array types, while Any() is an extension method of Enumerable. Therefore, you can use Length only when working with arrays. When working with more abstract types (IEnumerable<T>), you can use Any().

.Length... System.Array
.Any ... IEnumerable (extension method).
I would prefer using "length" whenever i can find it. Property is anyhow light-weight than any method call.
Though, implementation of "Any" won't be doing anything more than the below mentioned code.
private static bool Any<T>(this IEnumerable<T> items)
{
return items!=null && items.GetEnumerator().MoveNext();
}
Also,
A better question could have been a difference beterrn ".Count" and ".Length", what say :).

I think this is a more general question of what to choose if we have 2 way to express something.
In does situation I would suggest the statement: "Be specific" quote from Peter Norvig in his book PAIP
Be specific mean use what best describe what your are doing.
Thus what you want to say is something like:
collection.isEmpty()
If you don't have such construct I will choose the common idiom that the communities used.
For me .Length > 0 is not the best one since it impose that you can size the object.
Suppose your implement infinite list. .Lenght would obviously not work.

Sounds quite similar to this Stackoverflow question about difference between .Count and .Any for checking for existence of a result: Check for Existence of a Result in Linq-to-xml
In that case it is better to use Any then Count, as Count will iterate all elements of an IEnumerable

We know that .Length is only used for Arrays and .Any() is used for collections of IEnumerable.
You can swap .Count for .Length and you have the same question for working with collections of IEnumberable
Both .Any() and .Count perform a null check before beginning an enumerator. So with regards to performance they are the same.
As for the array lets assume we have the following line:
int[] foo = new int[10];
Here foo.Length is 10. While this is correct it may not be the answer your looking for because we haven't added anything to the array yet. If foo is null it will throw an exception.

.Length iterates through the collection and returns the number of elements. Complexity is O(n)
.Any checks whether the collection has at least one item. Complexity is O(1).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

LINQ Ring: Any() vs Contains() for Huge Collections - c#

Given a huge collection of objects, is there a performance difference between the the following? Collection.Contains: myCollection.Contains(myElement) Enumerable.Any: myCollection.Any(currentElement => currentElement == myElement)

Related

Does `Any()` forces linq execution?

Cost of enumerating ILookup results? Better to always use Dictionary<TKey,List<TElement>>?

Count property vs Count() method?

LINQ performance vs. Dictionary<K,V>

Interview Question: .Any() vs if (.Length > 0) for testing if a collection has elements

Categories

Resources