C# Linq Where(expression).FirstorDefault() vs .FirstOrDefault(expression) - c#

What is the difference between these two Linq queries:
var result = ResultLists().Where( c=> c.code == "abc").FirstOrDefault();
// vs.
var result = ResultLists().FirstOrDefault( c => c.code == "abc");
Are the semantics exactly the same?
Iff sematically equal, does the predicate form of FirstOrDefault offer any theoretical or practical performance benefit over Where() plus plain FirstOrDefault()?

Either is fine.
They both run lazily - if the source list has a million items, but the tenth item matches then both will only iterate 10 items from the source.
Performance should be almost identical and any difference would be totally insignificant.

The second one. All other things being equal, the iterator in the second case can stop as soon as it finds a match, where the first one must find all that match, and then pick the first of those.

Nice discussion, all the above answers are correct.
I didn't run any performance test, whereas on the bases of my experience FirstOrDefault() sometimes faster and optimize as compare to Where().FirstOrDefault().
I recently fixed the memory overflow/performance issue ("neural-network algorithm") and fix was changing Where(x->...).FirstOrDefault() to simply FirstOrDefault(x->..).
I was ignoring the editor's recommendation to change Where(x->...).FirstOrDefault() to simply FirstOrDefault(x->..).
So I believe the correct answer to the above question is
The second option is the best approach in all cases

Where is actually a deferred execution - it means, the evaluation of an expression is delayed until its realized value is actually required. It greatly improves performance by avoiding unnecessary execution.
Where looks kind of like this, and returns a new IEnumerable
foreach (var item in enumerable)
{
if (condition)
{
yield return item;
}
}
FirstOrDefault() returns <T> and not throw any exception or return null when there is no result

Related

Using AsSequential in order to preserve order

I am looking at this code
var numbers = Enumerable.Range(0, 20);
var parallelResult = numbers.AsParallel().AsOrdered()
.Where(i => i % 2 == 0).AsSequential();
foreach (int i in parallelResult.Take(5))
Console.WriteLine(i);
The AsSequential() is supposed to make the resulting array sorted. Actually it is sorted after its execution, but if I remove the call to AsSequential(), it is still sorted (since AsOrdered()) is called.
What is the difference between the two?
AsSequential is just meant to stop any further parallel execution - hence the name. I'm not sure where you got the idea that it's "supposed to make the resulting array sorted". The documentation is pretty clear:
Converts a ParallelQuery into an IEnumerable to force sequential evaluation of the query.
As you say, AsOrdered ensures ordering (for that particular sequence).
I know that this was asked over a year old but here are my two cents.
In the example exposed, i think it uses AsSequential so that the next query operator (in this case the Take operator) it is execute sequentially.
However the Take operator prevent a query from being parallelized, unless the source elements are in their original indexing position, so that is why even when you remove the AsSequential operator, the result is still sorted.

FirstOrDefault is signicantly faster than SingleOrDefault while viewing ANTS profiler

I have a generic collection with 5000+ items in it. All items are unique so I used SingleOrDefault to pull up an item from collection. Today I used Red Gate ANTS profiler to look into the code and found out my SingleOrDefault call has 18 millions hit for 5000 iterations with (~3.5 sec) whereas when I change it to FirstOrDefault it has 9 millions hit with (~1.5 sec).
I used SingleOrDefault because I know that all items in collection are unique.
Edit : Question will be why is FirstOrDefault faster than SingleOrDefault even though this is the exact scenario where we supposed to use SingleOrDefault.
SingleOrDefault() raises an exception if there is more than one. In order to determine that, it must verify there are no more than one.
On the other hand, FirstOrDefault() can stop looking once it finds one. Therefore, I would expect it to be considerably faster in many cases.
SingleOrDefault(predicate) makes sure there is at most one item matching the given predicate, so even if it finds a matching item near the beginning of your collection, it still has to continue to the end of the IEnumerable.
FirstOrDefault(predicate) stops as soon as it finds a matching item in the collection. If your "first matches" are uniformly distributed throughout your IEnumerable, then you will, on average, have to go through half of the IEnumerable.
For a sequence of N items, SingleOrDefault will run your predicate N times, and FirstOrDefault will run your predicate (on average) N/2 times. This explains why you see SingleOrDefault has twice as many "hits" as FirstOrDefault.
If you know you'll only ever have a single matching item because the source of your collection is controlled by you and your system, then you're probably better off using FirstOrDefault. If your collection is coming from a user for example, then it could make sense to use SingleOrDefault as a check on the user's input.
I doubt very seriously that the choice between SingleOrDefault or FirstOrDefault will be your bottleneck. I think profiling tools will hopefully highlight much larger fish to fry. Your own metrics reveal that this amounts to an almost indiscernable unit of time for any given iteration.
But I recommend using the one that matches your expectation. Namely, is having more than one that matches a predicate an error? If it is, use the method that enforces that expectation. SingleOrDefault. (Similarly, if having none is also an error, simply use Single.) If it is not an error for more than one, feel free to use the First variants, instead.
Now it should become obvious why one could be marginally faster than the other, as other answers discuss. One is enforcing a constraint, which of course is accomplished by executing logic. The other isn't enforcing that particular constraint and is thus not delayed by it.
FirstOrDefault will return on the first hit. SinglerOrDefault will not return on the first hit but will also look at all other elements to check if its unique. So FirstOrDefault will be faster in most cases. Idf you don't need the uniqueness check take FirstOrDefault.
I've run tests using LinqPad which indicate that queries using Single, and SingleOrDefault are faster than queries using First or FirstOrDefault. These tests were on rather simple queries of large datasets (no joins involved). I did not expect this to be the result, in fact I was trying to prove to another developer that we should be using First and FirstOrDefault, but my foundation for my argument died when the proof indicated Single was actually faster. There may be cases where First is faster, but don't assume it is the blanket case.

what is the difference between for (or) foreach loop and linq query in case of speed

i like to know difference retrieval from list using for (or) foreach loop and retrieval from list using linq query. specially in case of speed and other difference
EXample:
List A=new List() contains 10000 rows i need to copy filter some rows from list A which one better in case of speed am i go with for loop or linq query
You could benchmark yourself and find out. (After all, only you know the particular circumstances in which you'll need to be running these loops and queries.)
My (very crude) rule-of-thumb -- which has so many caveats and exceptions as to be almost useless -- is that a for loop will generally be slightly faster than a foreach which will generally be slightly faster than a sensibly-written LINQ query.
You should use whatever construct makes the most sense for your particular situation. If what you want to do is best expressed with a for loop then do that; if it's best expressed as a foreach then do that; if it's best expressed as a query then use LINQ.
Only if and when you find that performance isn't good enough should you consider re-writing code that's expressive and correct into something faster and less expressive (but hopefully still correct).
If we're talking regular LINQ, then we're focusing on IEnumerable<T> (LINQ-to-Objects) and IQueryable<T> (LINQ-to-most-other-stuff). Since IQueryable<T> : IEnumerable<T>, it is automatic that you can use foreach - but what this means is very query-specific, since LINQ is generally lazily spooling data from an underlying source. Indeed, that source can be infinite:
public IEnumerable<int> Forever() {
int i = 0;
while(true) yield return i++;
}
...
foreach(int i in Forever()) {
Console.WriteLine(i);
if(Console.ReadLine() == "exit") break;
}
However, a for loop requires the length and an indexer. Which in real terms, typically means calling ToList() or ToArray():
var list = source.ToList();
for(int i = 0 ; i < list.Count ; i++) { do something with list[i] }
This is interesting in various ways: firstly, it will die for infinite sequences ;p. However, it also moves the spooling earlier. So if we are reading from an external data source, the for/foreach loop over the list will be quicker, but simply because we've moved a lot of work to ToList() (or ToArray(), etc).
Another important feature of performing the ToList() earlier is that you have closed the reader. You might need to operate on data inside the list, and that isn't always possible while a reader is open; iterators break while enumerating, for example - or perhaps more notably, unless you use "MARS" SQL Server only allows one reader per connection. As a counterpoint, that reeks of "n+1", so watch for that too.
Over a local list/array/etc, is is largely redundant which loop strategy you use.

Why can LINQ operations be faster than a normal loop?

A friend and I were a bit perplexed during a programming discussion today. As an example, we created a fictive problem of having a List<int> of n random integers (typically 1.000.000) and wanted to create a function that returned the set of all integers that there were more than one of. Pretty straightforward stuff. We created one LINQ statement to solve this problem, and a plain insertion sort based algorithm.
Now, as we tested the speed the code ran at (using System.Diagnostics.StopWatch), the results were confusing. Not only did the LINQ code outperform the simple sort, but it ran faster than a single foreach/for that only did a single loop of the list, and that had no operations within (which, on a side track, I thought the compiler was supposed to discover and remove alltogether).
If we generated a new List<int> of random numbers in the same execution of the program and ran the LINQ code again, the performance would increase by orders of magnitude (typically thousandfold). The performance of the empty loops were of course the same.
So, what is going on here? Is LINQ using parallelism to outperform normal loops? How are these results even possible? LINQ uses quicksort which runs at n*log(n), which per definition is already slower than n.
And what is happening at the performance leap on the second run?
We were both baffled and intrigued at these results and were hoping for some clarifying insights from the community, just to satisfy our own curiosity.
Undoubtedly you haven't actually performed the query, you've merely defined it. LINQ constructs an expression tree that isn't actually evaluated until you perform an operation that requires that the enumeration be iterated. Try adding a ToList() or Count() operation to the LINQ query to force the query to be evaluated.
Based on your comment I expect this is similar to what you've done. Note: I haven't spent any time figuring out if the query is as efficient as possible; I just want some query to illustrate how the code may be structured.
var dataset = ...
var watch = Stopwatch.StartNew();
var query = dataset.Where( d => dataset.Count( i => i == d ) > 1 );
watch.Stop(); // timer stops here
foreach (var item in query) // query is actually evaluated here
{
... print out the item...
}
I would suggest that LINQ is only faster than a 'normal loop' when your algorithm is less than perfect (or you have some problem in your code). So LINQ will be faster at sorting than you are if you don't write an efficient sorting algorithm, etc.
LINQ is usually 'as fast as' or 'close enough to' the speed of a normal loop, and can be faster (and simpler) to code / debug / read. That's its benefit - not execution speed.
If it's performing faster than an empty loop, you are doing something wrong. Most likely, as suggested in comments, you aren't considering deferred execution and the LINQ statement is not actually executing.
If you did not compile with "Optimize Code" enabled, you would probably see this behaviour. (It would certainly explain why the empty loop was not removed.)
The code underlying LINQ, however, is part of already-compiled code, which will certainly have been optimised (by the JIT, NGen or similar).

Aggregate vs. Any, for scanning objects such as IEnumerable<bool>

Just wondered if any LINQ guru might be able to shed light on how Aggregate and Any work under the hood.
Imagine that I have an IEnumerable which stores the results of testing an array for a given condition. I want to determine whether any element of the array is false. Is there any reason I should prefer one option above the other?
IEnumerable<bool> results = PerformTests();
return results.Any(r => !r); //Option 1
return results.Aggregate((h, t) => h && t); //Option 2
In production code I'd tend towards 1 as it's more obvious but out of curiosity wondered whether there's a difference in the way these are evalulated under the hood.
Yes, definitely prefer option 1 - it will stop as soon as it finds any value which is false.
Option 2 will go through the whole array.
Then there's the readability issue as well, of course :)
Jon beat me again, but to add some more text:
Aggregate always needs to consume the whole IEnumerable<T>, because that's exactly what it's supposed to do: To generate a dataset from your (complete) source.
It's the "Reduce" in the well-known Map/Reduce scenario.

Categories