LINQ performance vs. Dictionary<K,V> - c#

In many situations, for simplicity's sake, I would have preferred to use a List or an HashSet in combination with LINQ instead of using a Dictionary. However, I usually sticked with Dictionary because I thought Dictionary would be more performant because of its hash table implementation.
For example:
When I do this in LINQ:
bool exists = hashset.Any(item => item.Key == someKey);
Do I lose significant performance compared to the following equivalent with a Dictionary?
bool exists = dictionary.ContainsKey(someKey); // an O(1) operation
Are the LINQ queries optimized in some way that would make them a justifiable choice against a Dictionary? Or is the above Any() a plain O(n) operation no matter on which type of collection it is executed?

In your case you are eliminating the benefit of the hashset, because Any in that case is an extension method defined on IEnumerable. It is simply iterating over the hashset as if it were a List and invoking the == operator on each item. In fact, these two code samples are not even strictly equivalent- the LINQ statement uses the == operator, and the dictionary uses hashcode/equals equality. These are equivalent for value types and Strings, but not for all classes.
What you can do is this:
bool exists = hashset.Contains(item.Key);
That will use the Hashset's optimized lookup, and not require you to keep a dummy value as you would with Dictionary.

Related

The underlay of C# where() method and C# HashTable

I am new in C#, having got used to Java and its data structures. Recently, when I wrote C#, trying to get the list of results based on users' choices. I used HashTable in C# but people say I can call where() method in List.
I am wondering the underlay of where method (array? or some other data structure?)so that I can figure out the real cost of where() function and HashTable.
Also, I know most of HashTable is created based on array or BST, so how does HashTable work in C#?
In general, I wouldn't use Hashtable in C#, but instead use Dictionary<TKey, TValue>, as this provides type safety.
Both classes have access that approaches O(1) for accessing items in the collection. This is mentioned in the documentation:
Getting or setting the value of this property approaches an O(1) operation.
Note that using a List<T> will effectively use an array internally, which means .Where becomes an O(n) filter. For finding a single element, FirstOrDefault is typically a better choice. If there aren't many items in the collection, this is often fine, but if you need fast access, then a Dictionary<T,U> is a better option.

Cost of enumerating ILookup results? Better to always use Dictionary<TKey,List<TElement>>?

I have a ILookup<TKey,TElement> lookup from which I fairly often get elements and iterate trough them using LINQ or foreach. I look up like this IEnumerable<TElement> results = lookup[key];.
Thus, results needs to be enumerated at least once every time I use lookup results (and even more if I'm iterating multiple times if I don't use .ToList() first).
Even though its not as "clean", wouldn't it be better (performance-wise) to use a Dictionary<TKey,List<TElement>>, so that all results from a key are only enumerated on construction of the dictionary? Just how taxing is ToList()?
ToLookup, like all the other ToXXX LINQ methods, uses immediate execution. The resulting object has no reference to the original source. It effectively does build a Dictionary<TKey, List<TElement>> - not those exact types, perhaps, but equivalent to it.
Note that there's a difference though, which may or may not be useful to you - the indexer for a lookup returns an empty sequence if you give it a key which doesn't exist, rather than throwing an exception. That can make life much easier if you want to be able to just index it by any key and iterate over the corresponding values.
Also note that although it's not explicitly documented, the implementation used for the value sequences does implement ICollection<T>, so calling the LINQ Count() method is O(1) - it doesn't need to iterate over all the elements.
See my Edulinq post on ToLookup for more details.
Assuming the implementation is System.Linq.Lookup (does ILookup have any other implementations?), the elements presented in lookup[key] are stored in an array of elements as a field of System.Linq.Lookup.Grouping. Repeatedly looking them up won't cause a re-iteration of source. Of course, rebuilding the Lookup will be more costly, but once built, the source is no longer accessed.

How to join together all the elements in an IEnumerable of IEnumerables?

Just in case you're wondering how this came up, I'm working with some resultsets from Entity Framework.
I have an object that is an IEnumerable<IEnumerable<string>>; basically, a list of lists of strings.
I want to merge all the lists of strings into one big list of strings.
What is the best way to do this in C#.net?
Use the LINQ SelectMany method:
IEnumerable<IEnumerable<string>> myOuterList = // some IEnumerable<IEnumerable<string>>...
IEnumerable<String> allMyStrings = myOuterList.SelectMany(sl => sl);
To be very clear about what's going on here (since I hate the thought of people thinking this is some kind of sorcery, and I feel bad that some other folks deleted the same answer):
SelectMany is an extension method ( a static method that through syntactic sugar looks like an instance method on a specific type) on IEnumerable<T>. It takes your original enumeration of enumerations and a function for converting each item of that into a enumeration.
Because the items are already enumerations, the conversion function is simple- just return the input (sl => sl means "take a paremeter named sl and return it"). SelectMany then provides an enumeration over each of these in turn, resulting in your "flattened" list..
Use the Concat method:
firstEnumerable.Concat(secondEnumerable)
Using SelectMany will force an additional evaluation of each element of both enumerations that you don't need.

Count property vs Count() method?

Working with a collection I have the two ways of getting the count of objects; Count (the property) and Count() (the method). Does anyone know what the key differences are?
I might be wrong, but I always use the Count property in any conditional statements because I'm assuming the Count() method performs some sort of query against the collection, where as Count must have already been assigned prior to me 'getting.' But that's a guess - I don't know if performance will be affected if I'm wrong.
EDIT: Out of curiosity then, will Count() throw an exception if the collection is null? Because I'm pretty sure the Count property simply returns 0.
Decompiling the source for the Count() extension method reveals that it tests whether the object is an ICollection (generic or otherwise) and if so simply returns the underlying Count property:
So, if your code accesses Count instead of calling Count(), you can bypass the type checking - a theoretical performance benefit but I doubt it would be a noticeable one!
// System.Linq.Enumerable
public static int Count<TSource>(this IEnumerable<TSource> source)
{
checked
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
return collection.Count;
}
ICollection collection2 = source as ICollection;
if (collection2 != null)
{
return collection2.Count;
}
int num = 0;
using (IEnumerator<TSource> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
num++;
}
}
return num;
}
}
Performance is only one reason to choose one or the other. Choosing .Count() means that your code will be more generic. I've had occasions where I refactored some code that no longer produced a collection, but instead something more generic like an IEnumerable, but other code broke as a result because it depended on .Count and I had to change it to .Count(). If I made a point to use .Count() everywhere, the code would likely be more reusable and maintainable. Usually opting to utilize the more generic interfaces if you can get away with it is your best bet. By more generic, I mean the simpler interface that is implemented by more types, and thus netting you greater compatibility between code.
I'm not saying .Count() is better, I'm just saying there's other considerations that deal more with the reusability of the code you are writing.
The .Count() method might be smart enough, or know about the type in question, and if so, it might use the underlying .Count property.
Then again, it might not.
I would say it is safe to assume that if the collection has a .Count property itself, that's going to be your best bet when it comes to performance.
If the .Count() method doesn't know about the collection, it will enumerate over it, which will be an O(n) operation.
Short Version: If you have the choice between a Count property and a Count() method always choose the property.
The difference is mainly around the efficiency of the operation. All BCL collections which expose a Count property do so in an O(1) fashion. The Count() method though can, and often will, cost O(N). There are some checks to try and get it to O(1) for some implementations but it's by no means guaranteed.
The Count() method is the LINQ method that works on any IEnumerable<>. You would expect the Count() method to iterate over the whole collection to find the count, but I believe the LINQ code actually has some optimizations in there to detect if a Count property exists and if so use that.
So they should both do almost identical things. The Count property is probably slightly better since there doesn't need to be a type check in there.
Count() method is an extension method that iterates each element of an IEnumerable<> and returns how many elements are there. If the instance of IEnumerable is actually a List<>, so it's optimized to return the Count property instead of iterating all elements.
Count() is there as an extension method from LINQ - Count is a property on Lists, actual .NET collection objects.
As such, Count() will almost always be slower, since it will enumerate the collection / queryable object. On a list, queue, stack etc, use Count. Or for an array - Length.
If there is a Count or Length property, you should always prefer that to the Count() method, which generally iterates the entire collection to count the number of elements within. Exceptions would be when the Count() method is against a LINQ to SQL or LINQ to Entities source, for example, in which case it would perform a count query against the datasource. Even then, if there is a Count property, you would want to prefer that, since it likely has less work to do.
The Count() method has an optimisation for ICollection<T> which results in the Count property being called. In this case there is probably no significant difference in performance.
There are types other than ICollection<T> which have more efficient alternatives to the Count() extension method though. This code analysis performance rule fires on the following types.
CA1829: Use Length/Count property instead of Enumerable.Count method
System.Array
System.Collections.Immutable.ImmutableArray<T>
System.Collections.ICollection
System.Collections.Generic.ICollection<T>
System.Collections.Generic.IReadOnlyCollection<T>
So, we should use Count and Length properties if they are available and fallback to the Count() extension method otherwise.
.Count is a property of a collection and gets the elements in the collection. Unlike .Count() which is an extension method for LINQ and counts the number of elements.
Generally .Count is faster than .Count() because it does not require the overhead of creating and enumerating a LINQ query.
It's better to use the .Count property unless you need the additional functionality provided by the .Count() method, such as the ability to specify a filtering predicate, e.g.
int count = numbers.Count(n => n.Id == 100);

LINQ Ring: Any() vs Contains() for Huge Collections

Given a huge collection of objects, is there a performance difference between the the following?
Collection.Contains:
myCollection.Contains(myElement)
Enumerable.Any:
myCollection.Any(currentElement => currentElement == myElement)
Contains() is an instance method, and its performance depends largely on the collection itself. For instance, Contains() on a List is O(n), while Contains() on a HashSet is O(1).
Any() is an extension method, and will simply go through the collection, applying the delegate on every object. It therefore has a complexity of O(n).
Any() is more flexible however since you can pass a delegate. Contains() can only accept an object.
It depends on the collection. If you have an ordered collection, then Contains might do a smart search (binary, hash, b-tree, etc.), while with `Any() you are basically stuck with enumerating until you find it (assuming LINQ-to-Objects).
Also note that in your example, Any() is using the == operator which will check for referential equality, while Contains will use IEquatable<T> or the Equals() method, which might be overridden.
I suppose that would depend on the type of myCollection is which dictates how Contains() is implemented. If a sorted binary tree for example, it could search smarter. Also it may take the element's hash into account. Any() on the other hand will enumerate through the collection until the first element that satisfies the condition is found. There are no optimizations for if the object had a smarter search method.
Contains() is also an extension method which can work fast if you use it in the correct way.
For ex:
var result = context.Projects.Where(x => lstBizIds.Contains(x.businessId)).Select(x => x.projectId).ToList();
This will give the query
SELECT Id
FROM Projects
INNER JOIN (VALUES (1), (2), (3), (4), (5)) AS Data(Item) ON Projects.UserId = Data.Item
while Any() on the other hand always iterate through the O(n).
Hope this will work....

Categories