A friend and I were a bit perplexed during a programming discussion today. As an example, we created a fictive problem of having a List<int> of n random integers (typically 1.000.000) and wanted to create a function that returned the set of all integers that there were more than one of. Pretty straightforward stuff. We created one LINQ statement to solve this problem, and a plain insertion sort based algorithm.
Now, as we tested the speed the code ran at (using System.Diagnostics.StopWatch), the results were confusing. Not only did the LINQ code outperform the simple sort, but it ran faster than a single foreach/for that only did a single loop of the list, and that had no operations within (which, on a side track, I thought the compiler was supposed to discover and remove alltogether).
If we generated a new List<int> of random numbers in the same execution of the program and ran the LINQ code again, the performance would increase by orders of magnitude (typically thousandfold). The performance of the empty loops were of course the same.
So, what is going on here? Is LINQ using parallelism to outperform normal loops? How are these results even possible? LINQ uses quicksort which runs at n*log(n), which per definition is already slower than n.
And what is happening at the performance leap on the second run?
We were both baffled and intrigued at these results and were hoping for some clarifying insights from the community, just to satisfy our own curiosity.
Undoubtedly you haven't actually performed the query, you've merely defined it. LINQ constructs an expression tree that isn't actually evaluated until you perform an operation that requires that the enumeration be iterated. Try adding a ToList() or Count() operation to the LINQ query to force the query to be evaluated.
Based on your comment I expect this is similar to what you've done. Note: I haven't spent any time figuring out if the query is as efficient as possible; I just want some query to illustrate how the code may be structured.
var dataset = ...
var watch = Stopwatch.StartNew();
var query = dataset.Where( d => dataset.Count( i => i == d ) > 1 );
watch.Stop(); // timer stops here
foreach (var item in query) // query is actually evaluated here
{
... print out the item...
}
I would suggest that LINQ is only faster than a 'normal loop' when your algorithm is less than perfect (or you have some problem in your code). So LINQ will be faster at sorting than you are if you don't write an efficient sorting algorithm, etc.
LINQ is usually 'as fast as' or 'close enough to' the speed of a normal loop, and can be faster (and simpler) to code / debug / read. That's its benefit - not execution speed.
If it's performing faster than an empty loop, you are doing something wrong. Most likely, as suggested in comments, you aren't considering deferred execution and the LINQ statement is not actually executing.
If you did not compile with "Optimize Code" enabled, you would probably see this behaviour. (It would certainly explain why the empty loop was not removed.)
The code underlying LINQ, however, is part of already-compiled code, which will certainly have been optimised (by the JIT, NGen or similar).
Related
This thread says that LINQ's OrderBy uses Quicksort. I'm struggling how that makes sense given that OrderBy returns an IEnumerable.
Let's take the following piece of code for example.
int[] arr = new int[] { 1, -1, 0, 60, -1032, 9, 1 };
var ordered = arr.OrderBy(i => i);
foreach(int i in ordered)
Console.WriteLine(i);
The loop is the equivalent of
var mover = ordered.GetEnumerator();
while(mover.MoveNext())
Console.WriteLine(mover.Current);
The MoveNext() returns the next smallest element. The way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created, so each time you call MoveNext() the IEnumerator finds the next smallest element. That doesn't make sense because during the execution of Quicksort there is no concept of a current smallest and next smallest element.
Where is the flaw in my thinking here?
the way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created
This statement is false. The flaw in your thinking is that you believe a false statement.
The LINQ to Objects implementation is smart about deferring work when possible at a reasonable cost. As you correctly note, it is not possible in the case of sorting. OrderBy produces as its result an object which, when MoveNext is called, enumerates the entire source sequence, generates the sorted list in memory and then enumerates the sorted list.
Similarly, joining and grouping also must enumerate the whole sequence before the first element is enumerated. (Logically, a join is just a cross product with a filter, and the work could be spread out over each MoveNext() but that would be inefficient; for practicality, a lookup table is built. It is educational to work out the asymptotic space vs time tradeoff; give it a shot.)
The source code is available; I encourage you to read it if you have questions about the implementation. Or check out Jon's "edulinq" series.
There's a great answer already, but to add a few things:
Enumerating the results of OrderBy() obviously can't yield an element until it has processed all elements because not until it has seen the last input element can it know that that last element seen isn't the first it must yield. It also must work on sources that can't be repeated or which will give different results each time. As such even if some sort of zeal meant the developers wanted to find the nth element anew each cycle, buffering is a logical requirement.
The quicksort is lazy in two regards though. One is that rather than sort the elements to return based on the keys from the delegate passed to the method, it sorts a mapping:
Buffer all the elements.
Get the keys. Note that this means the delegate is run only once per element. Among other things it means that non-pure keyselectors won't cause problems.
Get a map of numbers from 0 to n.
Sort the map.
Enumerate through the map, yielding the associated element each time.
So there is a sort of laziness in the final sorting of elements. This is significant in cases where moving elements is expensive (large value types).
There is of course also laziness in that none of the above is done until after the first attempt to enumerate, so until you call MoveNext() the first time, it won't have happened.
In .NET Core there is further laziness building on that, depending on what you then do with the results of OrderBy. Since OrderBy contains information about how to sort rather than the sorted buffer, the class returned by OrderBy can do something else with that information other than quicksorting:
The most obvious is ThenBy which all implementations do. When you call ThenBy or ThenByDescending you get a new similar class with different information about how to sort, and the sort the OrderBy result could have done probably never will.
First() and Last() don't need to sort at all. Logically source.OrderBy(del).First() is a variant of source.Min() where del contains the information to determine what defines "less than" for that Min(). Therefore if you call First() on the results of an OrderBy() that's exactly what is done. The laziness of OrderBy allows it to do this instead of quicksort. (Which means O(n) time complexity and O(1) space complexity instead of O(n log n) and O(n) respectively).
Skip() and Take() define a subsequence of a sequence which with OrderBy must conceptually happen after that sort. But since they are lazy too what can be returned is an object that knows; how to sort, how many to skip, how many to take. As such partial quicksort can be used so that the source need only be partially sorted: If a partition is outside of the range that will be returned then there's no point sorting it.
ElementAt() places more of a burden than First() or Last() but again doesn't require a full quicksort. Quickselect can be used to find just one result; if you're looking for the 3rd element and you've partitioned a set of 200 elements around the 90th element then you only need to look further in the first partition and can ignore the second partition from now on. Best-case and average-case time complexity is O(n).
The above can be combined, so e.g. .Skip(10).First() is equivalent to ElementAt(10) and can be treated as such.
All of these exceptions to getting the entire buffer and sorting it all have one thing in common: They were all implemented after identifying a way in which the correct result can be returned after making the computer do less work*. That new [] {1, 2, 3, 4}.Where(i => i % 2 == 0) will yield the 2 before it has seen the 4 (or even the 3 it won't yield comes from the same general principle. It just comes at it more easily (though there are still specialised variants of Where() results behind the scenes to provide other optimisations).
But note that Enumerable.Range(1, 10000).Where(i => i >= 10000) scans through 9999 elements to yield that first. Really it's not all that different to OrderBy's buffering; they're both bringing you the next result as quickly as they can†, and what differs is just what that means.
*And also identifying that the effort to detect and make use of the features of a particular case are worth it. E.g. many aggregate calls like Sum() can be optimised of the results of OrderBy by skipping the ordering completely. But this can generally be realised by the caller and they can just leave out the OrderBy so while adding that would make most calls to Sum() slightly slower to make that case much faster the case that benefits shouldn't really be happening anyway.
†Well, pretty much as quickly. It would be possible to get the first results back more quickly than OrderBy does—when you've got the left most part of a sequence sorted start giving out results—but that comes at a cost that would affect the later results so the trade-off isn't necessarily that doing that would be better.
Suppose I have the following code:
var X = XElement.Parse (#"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
");
Enumerable.Range (1, 100)
.Select (s => X.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.Aggregate (s, (t, u) => t * u)
)
.ToList ()
.ForEach (s => Console.WriteLine (s));
What is the .NET runtime actually doing here? Is it parsing and converting the attributes to integers each of the 100 times, or is it smart enough to figure out that it should cache the parsed values and not repeat the computation for each element in the range?
Moreover, how would I go about figuring out something like this myself?
Thanks in advance for your help.
LINQ and IEnumerable<T> is pull based. This means that the predicates and actions that are part of the LINQ statement in general are not executed until values are pulled. Furthermore the predicates and actions will execute each time values are pulled (e.g. there is no secret caching going on).
Pulling from an IEnumerable<T> is done by the foreach statement which really is syntactic sugar for getting an enumerator by calling IEnumerable<T>.GetEnumerator() and repeatedly calling IEnumerator<T>.MoveNext() to pull the values.
LINQ operators like ToList(), ToArray(), ToDictionary() and ToLookup() wraps a foreach statement so these methods will do a pull. The same can be said about operators like Aggregate(), Count() and First(). These methods have in common that they produce a single result that has to be created by executing a foreach statement.
Many LINQ operators produce a new IEnumerable<T> sequence. When an element is pulled from the resulting sequence the operator pulls one or more elements from the source sequence. The Select() operator is the most obvious example but other examples are SelectMany(), Where(), Concat(), Union(), Distinct(), Skip() and Take(). These operators don't do any caching. When then N'th element is pulled from a Select() it pulls the N´th element from the source sequence, applies the projection using the action supplied and returns it. Nothing secret going on here.
Other LINQ operators also produce new IEnumerable<T> sequences but they are implemented by actually pulling the entire source sequence, doing their job and then producing a new sequence. These methods include Reverse(), OrderBy() and GroupBy(). However, the pull done by the operator is only performed when the operator itself is pulled meaning that you still need a foreach loop "at the end" of the LINQ statement before anything is executed. You could argue that these operators use a cache because they immediately pull the entire source sequence. However, this cache is built each time the operator is iterated so it is really an implementation detail and not something that will magically detect that you are applying the same OrderBy() operation multiple times to the same sequence.
In your example the ToList() will do a pull. The action in the outer Select will execute 100 times. Each time this action is executed the Aggregate() will do another pull that will parse the XML attributes. In total your code will call Int32.Parse() 200 times.
You can improve this by pulling the attributes once instead of on each iteration:
var X = XElement.Parse (#"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
")
.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.ToList ();
Enumerable.Range (1, 100)
.Select (s => x.Aggregate (s, (t, u) => t * u))
.ToList ()
.ForEach (s => Console.WriteLine (s));
Now Int32.Parse() is only called 2 times. However, the cost is that a list of attribute values have to be allocated, stored and eventually garbage collected. (Not a big concern when the list contains two elements.)
Note that if you forget the first ToList() that pulls the attributes the code will still run but with the exact same performance characteristics as the original code. No space is used to store the attributes but they are parsed on each iteration.
It has been a while since I dug through this code but, IIRC, the way Select works is to simply cache the Func you supply it and run it on the source collection one at a time. So, for each element in the outer range, it will run the inner Select/Aggregate sequence as if it were the first time. There isn't any built-in caching going on -- you would have to implement that yourself in the expressions.
If you wanted to figure this out yourself, you've got three basic options:
Compile the code and use ildasm to view the IL; it's the most accurate but, especially with lambdas and closures, what you get from IL may look nothing like what you put into the C# compiler.
Use something like dotPeek to decompile System.Linq.dll into C#; again, what you get out of these kinds of tools may only approximately resemble the original source code, but at least it will be C# (and dotPeek in particular does a pretty good job, and is free.)
My personal preference - download the .NET 4.0 Reference Source and look for yourself; this is what it's for :) You have to just trust MS that the reference source matches the actual source used to produce the binaries, but I don't see any good reason to doubt them.
As pointed out by #AllonGuralnek you can set breakpoints on specific lambda expressions within a single line; put your cursor somewhere inside the body of the lambda and press F9 and it will breakpoint just the lambda. (If you do it wrong, it will highlight the entire line in the breakpoint color; if you do it right, it will just highlight the lambda.)
i like to know difference retrieval from list using for (or) foreach loop and retrieval from list using linq query. specially in case of speed and other difference
EXample:
List A=new List() contains 10000 rows i need to copy filter some rows from list A which one better in case of speed am i go with for loop or linq query
You could benchmark yourself and find out. (After all, only you know the particular circumstances in which you'll need to be running these loops and queries.)
My (very crude) rule-of-thumb -- which has so many caveats and exceptions as to be almost useless -- is that a for loop will generally be slightly faster than a foreach which will generally be slightly faster than a sensibly-written LINQ query.
You should use whatever construct makes the most sense for your particular situation. If what you want to do is best expressed with a for loop then do that; if it's best expressed as a foreach then do that; if it's best expressed as a query then use LINQ.
Only if and when you find that performance isn't good enough should you consider re-writing code that's expressive and correct into something faster and less expressive (but hopefully still correct).
If we're talking regular LINQ, then we're focusing on IEnumerable<T> (LINQ-to-Objects) and IQueryable<T> (LINQ-to-most-other-stuff). Since IQueryable<T> : IEnumerable<T>, it is automatic that you can use foreach - but what this means is very query-specific, since LINQ is generally lazily spooling data from an underlying source. Indeed, that source can be infinite:
public IEnumerable<int> Forever() {
int i = 0;
while(true) yield return i++;
}
...
foreach(int i in Forever()) {
Console.WriteLine(i);
if(Console.ReadLine() == "exit") break;
}
However, a for loop requires the length and an indexer. Which in real terms, typically means calling ToList() or ToArray():
var list = source.ToList();
for(int i = 0 ; i < list.Count ; i++) { do something with list[i] }
This is interesting in various ways: firstly, it will die for infinite sequences ;p. However, it also moves the spooling earlier. So if we are reading from an external data source, the for/foreach loop over the list will be quicker, but simply because we've moved a lot of work to ToList() (or ToArray(), etc).
Another important feature of performing the ToList() earlier is that you have closed the reader. You might need to operate on data inside the list, and that isn't always possible while a reader is open; iterators break while enumerating, for example - or perhaps more notably, unless you use "MARS" SQL Server only allows one reader per connection. As a counterpoint, that reeks of "n+1", so watch for that too.
Over a local list/array/etc, is is largely redundant which loop strategy you use.
Is this:
Box boxToFind = AllBoxes.FirstOrDefault(box => box.BoxNumber == boxToMatchTo.BagNumber);
Faster or slower than this:
Box boxToFind ;
foreach (Box box in AllBoxes)
{
if (box.BoxNumber == boxToMatchTo.BoxNumber)
{
boxToFind = box;
}
}
Both give me the result I am looking for (boxToFind). This is going to run on a mobile device that I need to be performance conscientious of.
It should be about the same, except that you need to call First (or, to match your code, Last), not Where.
Calling Where will give you a set of matching items (an IEnumerable<Box>); you only want one matching item.
In general, when using LINQ, you need to be aware of deferred execution. In your particular case, it's irrelevant, since you're getting a single item.
The difference is not important unless you've identified that this particular loop as a performance bottleneck through profiling.
If profiling does find it to be a problem, then you'll want to look into alternate storage. Store the data in a dictionary which provides faster lookup than looping through an array.
If micro-optimization is your thing, LINQ performs worse, this is just one article, there are a lot of other posts you can find.
Micro optimization will kill you.
First, finish the whole class, then, if you have performance problems, run a profiler and check for the hotspots of the application.
Make sure you're using the best algorithms you can, then turn to micro optimizations like this.
In case you already did :
Slow -> Fast
LINQ < foreach < for < unsafe for (The last option is not recommended).
Abstractions will make your code slower, 95% of the time.
The fastest is when you are using for loop. But the difference is so small that you are ignore it. It will only matter if you are building a real-time application but then for those applications maybe C# is not the best choice anyway!
If AllBoxes is an IQueryable, it can be faster than the loop, because the queryable could have an optimized implementation of the Where-operation (for example an indexed access).
LINQ is absolutely 100% slower
Depends on what you are trying to accomplish in your program, but for the most part this is most certainly what I would call LAZY PROGRAMMER CODE...
You are going to essentially "stall-out" if you are performing any complex queries, joins etc... total p.o.s for those types of functions/methods- just don't use it. If you do this the hard/long way you will be much happier in the long run...and performance will be a world apart.
NOTE:
I would definitely not recommend LINQ for any program built for speed/synchronization tasks/computation
(i.e. HFT trading &/or AT trading i-0-i for starters).
TESTED:
It took nearly 10 seconds to complete a join in "LINQ" vs. < 1 millisecond.
LINQ vs Loop – A performance test
LINQ: 00:00:04.1052060, avg. 00:00:00.0041052
Loop: 00:00:00.0790965, avg. 00:00:00.0000790
References:
http://ox.no/posts/linq-vs-loop-a-performance-test
http://www.schnieds.com/2009/03/linq-vs-foreach-vs-for-loop-performance.html
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I recently found LINQ and love it. I find lots of occasions where use of it is so much more expressive than the longhand version but a colleague passed a comment about me abusing this technology which now has me second guessing myself. It is my perspective that if a technology works efficiently and the code is elegant then why not use it? Is that wrong? I could spend extra time writing out processes "longhand" and while the resulting code may be a few ms faster, it's 2-3 times more code and therefore 2-3 times more chance that there may be bugs.
Is my view wrong? Should I be writing my code out longhand rather than using LINQ? Isn't this what LINQ was designed for?
Edit: I was speaking about LINQ to objects, I don't use LINQ to XML so much and I have used LINQ to SQL but I'm not so enamoured with those flavours as LINQ to objects.
I have to agree with your view - if it's more efficient to write and elegant then what's a few milliseconds. Writing extra code gives more room for bugs to creep in and it's extra code that needs to be tested and most of all it's extra code to maintain. Think about the guy who's going to come in behind you and maintain your code - they'll thank you for writing elegant easy to read code long before they thank you for writing code that's a few ms faster!
Beware though, this cost of a few ms could be significant when you take the bigger picture into account. If that few milliseconds is part of a loop of thousands of repetitions, then the milliseconds add up fast.
Yes you can love LINQ too much - Single Statement LINQ RayTracer
Where do you draw the line? I'd say use LINQ as much as it makes the code simpler and easier to read.
The moment the LINQ version becomes more difficult to understand then the non-LINQ version it's time to swap, and vice versa. EDIT: This mainly applies to LINQ-To-Objects as the other LINQ flavours have their own benefits.
Its not possible to love Linq to Objects too much, it's a freaking awesome technology !
But seriously, anything that makes your code simple to read, simple to maintain and does the job it was intended for, then you would be silly not to use it as much as you can.
LINQ's supposed to be used to make filtering, sorting, aggregating and manipulating data from various sources as intuitive and expressive as possible. I'd say, use it wherever you feel it's the tidiest, most expressive and most natural syntax for doing what it is you're trying to do, and don't feel guilty about it.
If you start humping the documentation, then it may be time to reconsider your position.
It's cases like these where it's important to remember the golden rules of optimization:
Don't Do It
For Experts: Don't do it yet
You should absolutely not worry about "abusing" linq unless you can indentify it explicitly as the cause of a performance problem
Like anything, it can be abused. As long as you stay away from obvious poor decisions such as
var v = List.Where(...);
for(int i = 0; i < v.Count(); i++)
{...}
and understand how differed execution works, then it is most likely not going to be much slower than the longhand way. According to Anders Hejlsburg (C# architect), the C# compiler is not particularly good at optimizing loops, however it is getting much better at optimizing and parallelizing expression trees. In time, it may be more effective than a loop. The List<>'s ForEach version is actually as fast as a for loop, although I can't find the link that proves that.
P.S. My personal favorite is ForEach<>'s lesser known cousin IndexedForEach (utilizing extension methods)
List.IndexedForEach( (p,i) =>
{
if(i != 3)
p.DoSomething(i);
};
LINQ can be like art. Keep using it to make the code beautiful.
You're answering your own question by talking about writing 2-3 times more code for a few ms of performance. I mean, if your problem domain requires that speedup then yes, if not probably not. However, is it really only a few ms of performance or is it > 5% or > 10%. This is a value judgement based on the individual case.
Where to draw the line?
Well, we already know that it is a bad idea to implement your own quicksort in linq, at least compared to just using linq's orderby.
I've found that using LINQ has speed up my development and made it easier to avoid stupid mistakes that loops can introduce. I have had instances where the performance of LINQ was poor, but that was when I was using it to things like fetch data for an excel file from a tree structure that had millions of nodes.
While I see how there is a point of view that LINQ might make a statement harder to read, I think it is far outweighed by the fact that my methods are now strictly related to the problems that they are solving and not spending time either including lookup loops or cluttering classes with dedicated lookup functions.
It took a little while to get used to doing things with LINQ, since looping lookups, and the like, have been the main option for so long. I look at LINQ as just being another type of syntactic sugar that can do the same task in a more elegant way. Right now, I am still avoiding it in processing-heavy mission critical code - but that is just until the performance improves as LINQ evolves.
My only concern about LINQ is with its implementation of joins.
As I determined when trying to answer this question (and it's confirmed here), the code LINQ generates to perform joins is (necessarily, I guess) naive: for each item in the list, the join performs a linear search through the joined list to find matches.
Adding a join to a LINQ query essentially turns a linear-time algorithm into a quadratic-time algorithm. Even if you think premature optimization is the root of all evil, the jump from O(n) to O(n^2) should give you pause. (It's O(n^3) if you join through a joined item to another collection, too.)
It's relatively easy to work around this. For instance, this query:
var list = from pr in parentTable.AsEnumerable()
join cr in childTable.AsEnumerable() on cr.Field<int>("ParentID") equals pr.Field<int>("ID")
where pr.Field<string>("Value") == "foo"
select cr;
is analogous to how you'd join two tables in SQL Server. But it's terribly inefficient in LINQ: for every parent row that the where clause returns, the query scans the entire child table. (Even if you're joining on an unindexed field, SQL Server will build a hashtable to speed up the join if it can. That's a little outside LINQ's pay grade.)
This query, however:
string fk = "FK_ChildTable_ParentTable";
var list = from cr in childTable.AsEnumerable()
where cr.GetParentRow(fk).Field<string>("Value") == "foo"
select cr;
produces the same result, but it scans the child table once only.
If you're using LINQ to objects, the same issues apply: if you want to join two collections of any significant size, you're probably going to need to consider implementing a more efficient method to find the joined object, e.g.:
Dictionary<Foo, Bar> map = buildMap(foos, bars);
var list = from Foo f in foos
where map[f].baz == "bat"
select f;