Performance difference between .where(...).Any() vs ..Any(...) [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
LINQ extension methods - Any() vs. Where() vs. Exists()
Given a list of objects in memory I ran the following two expressions:
myList.where(x => x.Name == "bla").Any()
vs
myList.Any(x => x.Name == "bla")
The latter was fastest always, I believe this is due to the Where enumerating all items. But this also happens when there's no matches.
Im not sure of the exact WHY though. Are there any cases where this viewed performance difference wouldn't be the case, like if it was querying Nhib?
Cheers.

The Any() with the predicate can perform its task without an iterator (yield return). Using a Where() creates an iterator, which adds has a performance impact (albeit very small).
Thus, performance-wise (by a bit), you're better off using the form of Any() that takes the predicate (x => x.Name == "bla"). Which, personally, I find more readable as well...
On a side note, Where() does not necessarily enumerate over all elements, it just creates an iterator that will travel over the elements as they are requested, thus the call to Any() after the Where() will drive the iteration, which will stop at the first item it finds that matches the condition.
So the performance difference is not that Where() iterates over all the items (in linq-to-objects) because it really doesn't need to (unless, of course, it doesn't find one that satisfies it), it's that the Where() clause has to set up an iterator to walk over the elements, whereas Any() with a predicate does not.

Assuming you correct where to Where and = to ==, I'd expect the "Any with a predicate" version to execute very slightly faster. However, I would expect the situations in which the difference was significant to be few and far between, so you should aim for readability first.
As it happens, I would normally prefer the "Any with a predicate" version in terms of readability too, so you win on both fronts - but you should really go with what you find more readable first. Measure the performance in scenarios you actually care about, and if a section of code isn't performing as you need it to, then consider micro-optimizing it - measuring at every step, of course.

I believe this is due to the Where enumerating all items.
If myList is a collection in memory, it doesn't. The Where method uses deferred execution, so it will only enumerate as many items as needed to determine the result. In that case you would not see any significant difference between .Any(...) and .Where(...).Any().
Are there any cases where this viewed performance difference wouldn't
be the case, like if it was querying Nhib?
Yes, if myList is a data source that will take the expression generated by the methods and translate to a query to run elsewhere (e.g. LINQ To SQL), you may see a difference. The code that translates the expression simply does a better job at translating one of the expressions.

Related

Cost of enumerating ILookup results? Better to always use Dictionary<TKey,List<TElement>>?

I have a ILookup<TKey,TElement> lookup from which I fairly often get elements and iterate trough them using LINQ or foreach. I look up like this IEnumerable<TElement> results = lookup[key];.
Thus, results needs to be enumerated at least once every time I use lookup results (and even more if I'm iterating multiple times if I don't use .ToList() first).
Even though its not as "clean", wouldn't it be better (performance-wise) to use a Dictionary<TKey,List<TElement>>, so that all results from a key are only enumerated on construction of the dictionary? Just how taxing is ToList()?
ToLookup, like all the other ToXXX LINQ methods, uses immediate execution. The resulting object has no reference to the original source. It effectively does build a Dictionary<TKey, List<TElement>> - not those exact types, perhaps, but equivalent to it.
Note that there's a difference though, which may or may not be useful to you - the indexer for a lookup returns an empty sequence if you give it a key which doesn't exist, rather than throwing an exception. That can make life much easier if you want to be able to just index it by any key and iterate over the corresponding values.
Also note that although it's not explicitly documented, the implementation used for the value sequences does implement ICollection<T>, so calling the LINQ Count() method is O(1) - it doesn't need to iterate over all the elements.
See my Edulinq post on ToLookup for more details.
Assuming the implementation is System.Linq.Lookup (does ILookup have any other implementations?), the elements presented in lookup[key] are stored in an array of elements as a field of System.Linq.Lookup.Grouping. Repeatedly looking them up won't cause a re-iteration of source. Of course, rebuilding the Lookup will be more costly, but once built, the source is no longer accessed.

Performance implications of calling ToArray inside a LINQ selector

If I have the following statement:
whatever.Select(x => collection.ToArray()[index]).ToList();
Is LINQ smart enough to perform the ToArray cast only once (I'm not really aware of how this closure is transformed and evaluated)?
I understand that this code is bad, just interested.
No, it will be performed once for every item in whatever.
You can have a peek at the code for LINQBridge, especially the Select method (that ends up calling SelectYield.
The essence of SelectYield is a simple for-loop:
foreach (var item in source)
yield return selector(item, i++);
Where selector is the lambda expression you pass in, in your case x => collection.ToArray()[index]. From here it is obvious that the whole lambda expression will be evaluated for every element in whatever.
Note that LINQBridge is a stand alone reimplementation of LINQ2Objects and thus not necessarily identical (but to a very large extent at least behaving exactly like LINQ2Objects, including side effects).

Linq with dot notation - which is better form or what's the difference between these two?

I've been reading Jon Skeet's C# In Depth: Second Edition and I noticed something slightly different in one of his examples from something I do myself.
He has something similar to the following:
var item = someObject.Where(user => user.Id == Id).Single();
Whereas I've been doing the following:
var item = someObject.Single(user => user.Id == Id);
Is there any real difference between the two? I know Jon Skeet is pretty much the c# god so I tend to think his knowledge in this area is better than mine so I might be misunderstanding something here. Hope someone can help.
The queries should be equal when the tree is evaluated, however depending on the target the actual execution could differ (IE L2S optimization).
I typically think in terms of "filter then take a single value". On the other hand, when it comes to Count I'll often use the version with the predicate. For Any I can go either way pretty easily depending on my mood :)
Ultimately it's unlikely to make any significant difference - use whichever is easier to understand at the time. I certainly wasn't trying to avoid using the versions with predicates etc - and I give some examples using them on P499.
While there are some situations with "definitely more readable" versions, many other cases are pretty much equally readable either way.
The only thing I can notice is that the IL produced using the Where method is one row bigger than the second example (tested through LinqPAD).
IL_0042: call System.Linq.Enumerable.Where
IL_0047: call System.Linq.Enumerable.Single
instead of a single call to System.Linq.Enumerable.Single
IL_006B: call System.Linq.Enumerable.Single
Specifically for Linq2Objects, I personally prefer using the Enumerable.Single(this, predicate) version directly, simply because adding the Enumerable.Where will introduce an additional enumeration which Enumerable.Single will pull the data from.
By using Enumerable.Single directly on the primary enumerable, no additional enumerations are started. I have never benchmarked this to evaluate the real performance impact, but why execute more code when you don't need to right? Especially if readability is not impacted, of course this last point is rather subjective.

How can I overcome the overhead of creating a List<T> from an IEnumerable<T>?

I am using some of the LINQ select stuff to create some collections, which return IEnumerable<T>.
In my case I need a List<T>, so I am passing the result to List<T>'s constructor to create one.
I am wondering about the overhead of doing this. The items in my collections are usually in the millions, so I need to consider this.
I assume, if the IEnumerable<T> contains ValueTypes, it's the worst performance.
Am I right? What about Ref Types? Either way there is also the cost of calling, List<T>.Add a million times, right?
Any way to solve this? Like can I "overload" methods like LINQ Select using extension methods)?
No, there's no particular penalty for the element type being value types, assuming you're using IEnumerable<T> instead of IEnumerable. You won't get any boxing going on.
If you actually know the size of the result beforehand (which the result of Select probably won't) you might want to consider creating the list with that size of buffer, then using AddRange to add the values. Otherwise the list will have to resize its buffer every time it fills it.
For instance, instead of doing:
Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(foo => foo.Name);
List<string> queryList = new List<string>(query);
you might do:
Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(x => x.Name);
List<string> queryList = new List<string>(foo.Length);
queryList.AddRange(query);
You know that calling Select will produce a sequence of the same length as the original query source, but nothing in the execution environment has that information as far as I'm aware.
It would be best to avoid the need for a list. If you can keep your caller using IEnumerable<T>, you will save yourself some headaches.
LINQ's ToList() will take your enumerable, and just construct a new List<T> directly from it, using the List<T>(IEnumerable<T>) constructor. This will be the same as making the list yourself, performance wise (although LINQ does a null check, as well).
If you're adding the elements yourself, use the AddRange method instead of the Add. ToList() is very similar to AddRange (since it's using the constructor which takes IEnumerable<T>), which typically will be your best bet, performance wise, in this case.
Generally speaking, a method returning IEnumerable doesn't have to evaluate any of the items before the item is actually needed. So, theoretically, when you return an IEnumerable none of you items need to exist at that time.
So creating a list means that you will really need to evaluate items, get them and place them somewhere in memory (at least their references). There is nothing that can be done about this - if you really need to have a list.
A number of other responders have already provided ideas for how to improve the performance of copying an IEnumerable<T> into a List<T> - I don't think that much can be added on that front.
However, based on what you have described you need to do with the results, and the fact that you get rid of the list when you're done (which I presume means that the intermediate results are not interesting) - you may want to consider whether you really need to materialize a List<T>.
Rather than creating a List<T> and operating on the contents of that list - consider writing a lazy extension method for IEnumerable<T> that performs the same processing logic. I've done this myself in a number of cases, and writing such logic in C# is not so bad when using the [yield return][1] syntax supported by the compiler.
This approach works well if all you're trying to do is visit each item in the results and collection some information from it. Often, what you need to do is just visit each element in the collection on demand, do some processing with it, and then move on. This approach is generally more scalable and performant that creating a copy of the collection just to iterate over it.
Now, this advice may not work for you for other reasons, but it's worth considering as an alternative to finding the most efficient way to materialize a very large list.
Don't pass an IEnumerable to the List constructor. IEnumerable has a ToList() method, which can't possibly do worse than that, and has nicer syntax (IMHO).
That said, that only changes the answer to your question to "it depends" - in particular, it depends on what the IEnumerable actually is behind the scenes. If it happens to be a List already, then ToList will effectively be free, of course will go much faster than if it were another type. It's still not super-fast.
The best way to solve this, of course, is to try to figure out how to do your processing on an IEnumerable rather than a List. That may not be possible.
Edit: Some people in the comments are debating whether or not ToList() will actually be any faster when called on a List than if not, and whether ToList() will be any faster than the list constructor. At this point, speculating is getting pointless, so here's some code:
using System;
using System.Linq;
using System.Collections.Generic;
public static class ToListTest
{
public static int Main(string[] args)
{
List<int> intlist = new List<int>();
for (int i = 0; i < 1000000; i++)
intlist.Add(i);
IEnumerable<int> intenum = intlist;
for (int i = 0; i < 1000; i++)
{
List<int> foo = intenum.ToList();
}
return 0;
}
}
Running this code with an IEnumerable that's really a List goes about 6-10 times faster than if I replace it with a LinkedList or Stack (on my pokey 2.4 GHz P4, using Mono 1.2.6). Conceivably this could be due to some unfortunate interaction between ToList() and the particular implementations of LinkedList or Stack's enumerations, but at least the point remains: speed will depend on the underlying type of the IEnumerable. That said, even with a List as the source, it still takes 6 seconds for me to make 1000 ToList() calls, so it's far from free.
The next question is whether ToList() is any more intelligent than the List constructor. The answer to that turns out to be no: the List constructor is just as fast as ToList(). In hindsight, Jon Skeet's reasoning makes sense - I was just forgetting that ToList() was an extension method. I still (much) prefer ToList() syntactically, but there's no performance reason to use it.
So the short version is that the best answer is still "don't convert to a List if you can avoid it". Barring that, actual performance will depend drastically on what the IEnumerable actually is, but at best it'll be sluggish, as opposed to glacial. I've amended my original answer to reflect this.
From reading the various comments and the question I get the following requirements
for a collection of data you need to run through that collection, filter out some objects and then perform some transformation on the remaining objects. If thats the case you can do something like this:
var result = from item in collection
where item.Id > 10 //or some more sensible condition
select Operation(item);
and if you need to the perform more filtering and transformation you can nest your LINQ queries like
var result = from filteredItem in (from item in collection
where item.Id > 10 //or some more sensible condition
select Operation(item))
where filteredItem.SomePropertyAvailableAfterFirstTransformation == "new"
select SecondTransfomation(filteredItem);

Where and When to use LINQ to Objects?

In which situations I should use LINQ to Objects?
Obviously I can do everything without LINQ. So in which operations LINQ actually helps me to code shorter and/or more readable?
This question triggered by this
I find LINQ to Objects useful all over the place. The problem it solves is pretty general:
You have a collection of some data items
You want another collection, formed from the original collection, but after some sort of transformation or filtering. This might be sorting, projection, applying a predicate, grouping, etc.
That's a situation I come across pretty often. There are an awful lot of areas of programming which basically involve transforming one collection (or stream of data) into another. In those cases the code using LINQ is almost always shorter and more readable. I'd like to point out that LINQ shouldn't be regarded as being synonymous with query expressions - if only a single operator is required, the normal "dot notation" (using extension methods) can often be shorter and more readable.
One of the reasons I particularly like LINQ to Objects is that it is so general - whereas LINQ to SQL is likely to only get involved in your data layer (or pretty much become the data layer), LINQ to Objects is applicable in every layer, and in all kinds of applications.
Just as an example, here's a line in my MiniBench benchmarking framework, converting a TestSuite (which is basically a named collection of tests) into a ResultSuite (a named collection of results):
return new ResultSuite(name,
tests.Select(test => test.Run(input, expectedOutput)));
Then again if a ResultSuite needs to be scaled against some particular "standard" result:
return new ResultSuite(name,
results.Select(x => x.ScaleToStandard(standard, mode)));
It wouldn't be hard to write this code without LINQ, but LINQ just makes it clearer and lets you concentrate on the real "logic" instead of the details of iterating through loops and adding results to lists etc.
Even when LINQ itself isn't applicable, some of the features which were largely included for the sake of LINQ (e.g. implicitly typed local variables, lambda expressions, extension methods) can be very useful.
The answer practically everywhere comes to mind. A better question would be when not to use it.
LINQ is great for the "slippery slope". Think of what's involved in many common operations:
Where. Just write a foreach loop and an "if"
Select. Create an empty list of the target type, loop through the originals, convert each one and add it to the results.
OrderBy. Just add it to a list and call .Sort(). Or implement a bubble sort ;)
ThenBy (from order by PropertyA, then by PropertyB). Quite a bit harder. A custom comparer and Sort should do the trick.
GroupBy - create a Dictionary<key, List<value>> and loop through all items. If no key exists create it, then add items to the appropriate list.
In each of those cases, the procedural way takes more code than the LINQ way. In the case of "if" it's a couple of lines more; in the case of GroupBy or OrderBy/ThenBy it's a lot more.
Now take an all too common scenario of combining them together. You're suddenly looking at a 10-20 line method which could be solved with 3-4 lines in LINQ. And the LINQ version is guaranteed to be easier to read (once you are familiar with LINQ).
So when do you use LINQ? My answer: whenever you see "foreach" :)
LINQ is pretty useful in a few scenarios:
You want to use typed "business entities", instead of data tables, to more naturally access your data (and aren't already using something like NHibernate or LLBLGenPro)
You want to query non-relational data using a SQL like syntax (this is real handy when querying lists and such)
You don't like lots of inline SQL or stored procedures
LINQ comes in to play when you start doing complex filtering on complex data types. For example, if you're given a list of People objects and you need to gather a list of all the doctors within that list. With LINQ, you can compress the following code into a single LINQ statement:
(pseudo-code)
doctors = []
for person in people:
if person is doctor:
doctors.append(person)
(sorry, my C# is rusty, type checking syntax is probably incorrect, but you get the idea)
doctors = from person in people where person.type() == doctor select person;
Edit: After I answered I see a change to say "LINQ to Objects". Oh well.
If by LINQ we refer to all the new types in System.Linq, as well as new compiler features, then it'll have quite a bit of benefit -- it is effectively adding functional programming to these languages. ( Here's the progression I've seen a few times (although this is mainly C# -- VB is limited in the current version).
The obvious start is that anything related to list processing gets vastly easier. A lot of loops can just go away. What benefit do you get? You'll start programming more declaratively, which will lead to fewer bugs. Things start to "just work" when switching to this style. (The LINQ query syntax I don't find too useful, unless the queries are very complicated with lots of intermediate values. In these cases, the syntax will sort out all the issues you'd otherwise have to pass tuples around for.)
Next, language support (in C#, and in the next version of VB) for anonymous methods allows you to write a lot more constructs in a much shorter way. For instance, handling an async callback can be defined inside the method that initiates it. Using a closure here will result in you not having to bundle up state into an opaque object parameter and casting it out later on.
Being able to use higher order functions gets you thinking much more generically. So you'll start to see where you could simply pass in a lambda and solve things neater and cleaner. At this point, you'll realise that things only really work if you use generics. Sure, this is a 2.0 feature, but the usage is much more prevalent when you're passing functions around.
And around there, you get into the point of diminishing returns. The cost of declaring and using funcs and declaring all the generic type parameters (in C# and VB) is quite high. The compiler won't work it out for you, so you have to do it all manually. This adds a huge amount of overhead and friction, which limits how far you can go.
So, is this all "LINQ"? Depends on marketing, perhaps. The LINQ push made this style of programming much easier in C#, and all of LINQ is based on FP ideas.

Categories