Suppose I have the following code:
var X = XElement.Parse (#"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
");
Enumerable.Range (1, 100)
.Select (s => X.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.Aggregate (s, (t, u) => t * u)
)
.ToList ()
.ForEach (s => Console.WriteLine (s));
What is the .NET runtime actually doing here? Is it parsing and converting the attributes to integers each of the 100 times, or is it smart enough to figure out that it should cache the parsed values and not repeat the computation for each element in the range?
Moreover, how would I go about figuring out something like this myself?
Thanks in advance for your help.
LINQ and IEnumerable<T> is pull based. This means that the predicates and actions that are part of the LINQ statement in general are not executed until values are pulled. Furthermore the predicates and actions will execute each time values are pulled (e.g. there is no secret caching going on).
Pulling from an IEnumerable<T> is done by the foreach statement which really is syntactic sugar for getting an enumerator by calling IEnumerable<T>.GetEnumerator() and repeatedly calling IEnumerator<T>.MoveNext() to pull the values.
LINQ operators like ToList(), ToArray(), ToDictionary() and ToLookup() wraps a foreach statement so these methods will do a pull. The same can be said about operators like Aggregate(), Count() and First(). These methods have in common that they produce a single result that has to be created by executing a foreach statement.
Many LINQ operators produce a new IEnumerable<T> sequence. When an element is pulled from the resulting sequence the operator pulls one or more elements from the source sequence. The Select() operator is the most obvious example but other examples are SelectMany(), Where(), Concat(), Union(), Distinct(), Skip() and Take(). These operators don't do any caching. When then N'th element is pulled from a Select() it pulls the N´th element from the source sequence, applies the projection using the action supplied and returns it. Nothing secret going on here.
Other LINQ operators also produce new IEnumerable<T> sequences but they are implemented by actually pulling the entire source sequence, doing their job and then producing a new sequence. These methods include Reverse(), OrderBy() and GroupBy(). However, the pull done by the operator is only performed when the operator itself is pulled meaning that you still need a foreach loop "at the end" of the LINQ statement before anything is executed. You could argue that these operators use a cache because they immediately pull the entire source sequence. However, this cache is built each time the operator is iterated so it is really an implementation detail and not something that will magically detect that you are applying the same OrderBy() operation multiple times to the same sequence.
In your example the ToList() will do a pull. The action in the outer Select will execute 100 times. Each time this action is executed the Aggregate() will do another pull that will parse the XML attributes. In total your code will call Int32.Parse() 200 times.
You can improve this by pulling the attributes once instead of on each iteration:
var X = XElement.Parse (#"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
")
.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.ToList ();
Enumerable.Range (1, 100)
.Select (s => x.Aggregate (s, (t, u) => t * u))
.ToList ()
.ForEach (s => Console.WriteLine (s));
Now Int32.Parse() is only called 2 times. However, the cost is that a list of attribute values have to be allocated, stored and eventually garbage collected. (Not a big concern when the list contains two elements.)
Note that if you forget the first ToList() that pulls the attributes the code will still run but with the exact same performance characteristics as the original code. No space is used to store the attributes but they are parsed on each iteration.
It has been a while since I dug through this code but, IIRC, the way Select works is to simply cache the Func you supply it and run it on the source collection one at a time. So, for each element in the outer range, it will run the inner Select/Aggregate sequence as if it were the first time. There isn't any built-in caching going on -- you would have to implement that yourself in the expressions.
If you wanted to figure this out yourself, you've got three basic options:
Compile the code and use ildasm to view the IL; it's the most accurate but, especially with lambdas and closures, what you get from IL may look nothing like what you put into the C# compiler.
Use something like dotPeek to decompile System.Linq.dll into C#; again, what you get out of these kinds of tools may only approximately resemble the original source code, but at least it will be C# (and dotPeek in particular does a pretty good job, and is free.)
My personal preference - download the .NET 4.0 Reference Source and look for yourself; this is what it's for :) You have to just trust MS that the reference source matches the actual source used to produce the binaries, but I don't see any good reason to doubt them.
As pointed out by #AllonGuralnek you can set breakpoints on specific lambda expressions within a single line; put your cursor somewhere inside the body of the lambda and press F9 and it will breakpoint just the lambda. (If you do it wrong, it will highlight the entire line in the breakpoint color; if you do it right, it will just highlight the lambda.)
Related
This thread says that LINQ's OrderBy uses Quicksort. I'm struggling how that makes sense given that OrderBy returns an IEnumerable.
Let's take the following piece of code for example.
int[] arr = new int[] { 1, -1, 0, 60, -1032, 9, 1 };
var ordered = arr.OrderBy(i => i);
foreach(int i in ordered)
Console.WriteLine(i);
The loop is the equivalent of
var mover = ordered.GetEnumerator();
while(mover.MoveNext())
Console.WriteLine(mover.Current);
The MoveNext() returns the next smallest element. The way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created, so each time you call MoveNext() the IEnumerator finds the next smallest element. That doesn't make sense because during the execution of Quicksort there is no concept of a current smallest and next smallest element.
Where is the flaw in my thinking here?
the way that LINQ works, unless you "cash out" of the query by use ToList() or similar, there are not supposed to be any intermediate lists created
This statement is false. The flaw in your thinking is that you believe a false statement.
The LINQ to Objects implementation is smart about deferring work when possible at a reasonable cost. As you correctly note, it is not possible in the case of sorting. OrderBy produces as its result an object which, when MoveNext is called, enumerates the entire source sequence, generates the sorted list in memory and then enumerates the sorted list.
Similarly, joining and grouping also must enumerate the whole sequence before the first element is enumerated. (Logically, a join is just a cross product with a filter, and the work could be spread out over each MoveNext() but that would be inefficient; for practicality, a lookup table is built. It is educational to work out the asymptotic space vs time tradeoff; give it a shot.)
The source code is available; I encourage you to read it if you have questions about the implementation. Or check out Jon's "edulinq" series.
There's a great answer already, but to add a few things:
Enumerating the results of OrderBy() obviously can't yield an element until it has processed all elements because not until it has seen the last input element can it know that that last element seen isn't the first it must yield. It also must work on sources that can't be repeated or which will give different results each time. As such even if some sort of zeal meant the developers wanted to find the nth element anew each cycle, buffering is a logical requirement.
The quicksort is lazy in two regards though. One is that rather than sort the elements to return based on the keys from the delegate passed to the method, it sorts a mapping:
Buffer all the elements.
Get the keys. Note that this means the delegate is run only once per element. Among other things it means that non-pure keyselectors won't cause problems.
Get a map of numbers from 0 to n.
Sort the map.
Enumerate through the map, yielding the associated element each time.
So there is a sort of laziness in the final sorting of elements. This is significant in cases where moving elements is expensive (large value types).
There is of course also laziness in that none of the above is done until after the first attempt to enumerate, so until you call MoveNext() the first time, it won't have happened.
In .NET Core there is further laziness building on that, depending on what you then do with the results of OrderBy. Since OrderBy contains information about how to sort rather than the sorted buffer, the class returned by OrderBy can do something else with that information other than quicksorting:
The most obvious is ThenBy which all implementations do. When you call ThenBy or ThenByDescending you get a new similar class with different information about how to sort, and the sort the OrderBy result could have done probably never will.
First() and Last() don't need to sort at all. Logically source.OrderBy(del).First() is a variant of source.Min() where del contains the information to determine what defines "less than" for that Min(). Therefore if you call First() on the results of an OrderBy() that's exactly what is done. The laziness of OrderBy allows it to do this instead of quicksort. (Which means O(n) time complexity and O(1) space complexity instead of O(n log n) and O(n) respectively).
Skip() and Take() define a subsequence of a sequence which with OrderBy must conceptually happen after that sort. But since they are lazy too what can be returned is an object that knows; how to sort, how many to skip, how many to take. As such partial quicksort can be used so that the source need only be partially sorted: If a partition is outside of the range that will be returned then there's no point sorting it.
ElementAt() places more of a burden than First() or Last() but again doesn't require a full quicksort. Quickselect can be used to find just one result; if you're looking for the 3rd element and you've partitioned a set of 200 elements around the 90th element then you only need to look further in the first partition and can ignore the second partition from now on. Best-case and average-case time complexity is O(n).
The above can be combined, so e.g. .Skip(10).First() is equivalent to ElementAt(10) and can be treated as such.
All of these exceptions to getting the entire buffer and sorting it all have one thing in common: They were all implemented after identifying a way in which the correct result can be returned after making the computer do less work*. That new [] {1, 2, 3, 4}.Where(i => i % 2 == 0) will yield the 2 before it has seen the 4 (or even the 3 it won't yield comes from the same general principle. It just comes at it more easily (though there are still specialised variants of Where() results behind the scenes to provide other optimisations).
But note that Enumerable.Range(1, 10000).Where(i => i >= 10000) scans through 9999 elements to yield that first. Really it's not all that different to OrderBy's buffering; they're both bringing you the next result as quickly as they can†, and what differs is just what that means.
*And also identifying that the effort to detect and make use of the features of a particular case are worth it. E.g. many aggregate calls like Sum() can be optimised of the results of OrderBy by skipping the ordering completely. But this can generally be realised by the caller and they can just leave out the OrderBy so while adding that would make most calls to Sum() slightly slower to make that case much faster the case that benefits shouldn't really be happening anyway.
†Well, pretty much as quickly. It would be possible to get the first results back more quickly than OrderBy does—when you've got the left most part of a sequence sorted start giving out results—but that comes at a cost that would affect the later results so the trade-off isn't necessarily that doing that would be better.
I am reading a book about C# in advanced level. And, now I am reading this part:
Behind-the-scenes operation of the Linq query methods that implement delegate-based syntax.
So far, I have read about Where, Select, Skip, SkipWhile, Take, TakeWhile methods.
And, I know about Defferred and Immediate execution and Iterators which is returned by some of these methods.
Deferred execution is a pattern of the execution model by which the
CLR ensures a value will be extracted only when it is required from
the IEnumerable-based information source. When any Linq operator
uses the deferred execution, the CLR encapsulates the related
information, such as the original sequence, predicate, or selector (if
any), into an iterator, which will be used when the information is
extracted from the original sequence using ToListmethod or
ForEachmethod or manually using the underlying GetEnumeratorand
MoveNextmethods in C#.
Now let's take these two examples:
IList<int> series = new List<int>() { 1, 2, 3, 4, 5, 6, 7 };
// First example
series.Where(x => x > 0).TakeWhile(x => x > 0).ToList();
// Second example
series.Where(x => x > 0).Take(4).ToList();
When I am putting breakpoints and debugging these two statements, I can see one difference.
TakeWhile() method executing when an item is met the condition in Where statement. But, this is not the case with Take method.
First statement:
Second statement:
Could you explain me why?
It's not entirely clear what you mean, but if you're asking why you hit a breakpoint in the lambda expression in TakeWhile, but you don't hit one within Take, it's just that Take doesn't accept a delegate at all - it just accepts a number. There's no user-defined code to execute while it's finding a value to return, so there's no breakpoint to hit.
In your example with TakeWhile, you've got two lambda expressions - one for Where and one for TakeWhile. So you can break into either of those lambda expressions.
It's important to understand that the Where and TakeWhile methods themselves are only called once - but the sequences they return evaluate the delegate passed to them for each value they encounter.
You might want to look at my Edulinq blog series for more details about the innards of LINQ.
Well, the condition in TakeWhile will need to be evaluated for each item, just like Where, so it will call each of them for each item.
Take(4) does not need to be evaluated per item, only the Where does, so in the second one, only the Where condition will be evaluated each time, (probably four times).
I am looking at this code
var numbers = Enumerable.Range(0, 20);
var parallelResult = numbers.AsParallel().AsOrdered()
.Where(i => i % 2 == 0).AsSequential();
foreach (int i in parallelResult.Take(5))
Console.WriteLine(i);
The AsSequential() is supposed to make the resulting array sorted. Actually it is sorted after its execution, but if I remove the call to AsSequential(), it is still sorted (since AsOrdered()) is called.
What is the difference between the two?
AsSequential is just meant to stop any further parallel execution - hence the name. I'm not sure where you got the idea that it's "supposed to make the resulting array sorted". The documentation is pretty clear:
Converts a ParallelQuery into an IEnumerable to force sequential evaluation of the query.
As you say, AsOrdered ensures ordering (for that particular sequence).
I know that this was asked over a year old but here are my two cents.
In the example exposed, i think it uses AsSequential so that the next query operator (in this case the Take operator) it is execute sequentially.
However the Take operator prevent a query from being parallelized, unless the source elements are in their original indexing position, so that is why even when you remove the AsSequential operator, the result is still sorted.
I am trying to build a SeparatedList using a dynamically-generated IEnumerable sequence (which is constructed by an Enumerable.Select() function call). The API function to create a SeparatedList takes two parameters, an IEnumerable<T> and an IEnumerable<SyntaxToken>. I have provided a simple function, Repeat, that is an infinite sequence generator which yields as many commas, in this case, as are requested.
The SeparatedList function appears to consume as many of the first sequence (parameter types here) as there are entries in the second sequence, which messes me up. Have I misunderstood how the function is supposed to work and has anyone else done this? Thanks
Syntax.SeparatedList<ParameterSyntax>(
functionParameterTypes,Repeat(i=>Syntax.Token(SyntaxKind.CommaToken)))
(Edit: I should add that converting the functionParameterTypes to a List<> and passing another List<> with one fewer token than elements in functionParameterTypes does work but I am trying to do this without having to explicitly build the list ahead of time.)
The XML documentation for the separators parameter says:
The number of tokens must be one less than the number of nodes.
You're right that this is not what the method actually requires: The number of tokens must be one less than the number of nodes or same as the number of tokens. I wouldn't be surprised if this was intentional, code like f(foo, bar, ) makes sense if you're trying to handle code that's just being written.
I think that calling ToList() on the sequence of parameters is the best choice here. And you don't have to use another List for separators, you can use Enumerable.Repeat() for that. For example like this (taken from a library I wrote where I faced the same issue):
public static SeparatedSyntaxList<T> ToSeparatedList<T>(
this IEnumerable<T> nodes, SyntaxKind separator = SyntaxKind.CommaToken)
where T : SyntaxNode
{
var nodesList = nodes == null ? new List<T>() : nodes.ToList();
return Syntax.SeparatedList(
nodesList,
Enumerable.Repeat(
Syntax.Token(separator), Math.Max(nodesList .Count - 1, 0)));
}
I also had the same need to create a SeparatedList using a dynamically generated list of parameters. My solution was to use SelectMany() and Take() to add separators (i.e. "comma") to the parameters but then remove the last trailing comma.
SyntaxFactory.SeparatedList<ParameterSyntax>(
functionParameterTypes
.SelectMany(param =>
new SyntaxNodeOrToken[]
{
param,
SyntaxFactory.Token(SyntaxKind.CommaToken)
})
.Take(functionParameterTypes.Count() * 2 - 1)
);
A friend and I were a bit perplexed during a programming discussion today. As an example, we created a fictive problem of having a List<int> of n random integers (typically 1.000.000) and wanted to create a function that returned the set of all integers that there were more than one of. Pretty straightforward stuff. We created one LINQ statement to solve this problem, and a plain insertion sort based algorithm.
Now, as we tested the speed the code ran at (using System.Diagnostics.StopWatch), the results were confusing. Not only did the LINQ code outperform the simple sort, but it ran faster than a single foreach/for that only did a single loop of the list, and that had no operations within (which, on a side track, I thought the compiler was supposed to discover and remove alltogether).
If we generated a new List<int> of random numbers in the same execution of the program and ran the LINQ code again, the performance would increase by orders of magnitude (typically thousandfold). The performance of the empty loops were of course the same.
So, what is going on here? Is LINQ using parallelism to outperform normal loops? How are these results even possible? LINQ uses quicksort which runs at n*log(n), which per definition is already slower than n.
And what is happening at the performance leap on the second run?
We were both baffled and intrigued at these results and were hoping for some clarifying insights from the community, just to satisfy our own curiosity.
Undoubtedly you haven't actually performed the query, you've merely defined it. LINQ constructs an expression tree that isn't actually evaluated until you perform an operation that requires that the enumeration be iterated. Try adding a ToList() or Count() operation to the LINQ query to force the query to be evaluated.
Based on your comment I expect this is similar to what you've done. Note: I haven't spent any time figuring out if the query is as efficient as possible; I just want some query to illustrate how the code may be structured.
var dataset = ...
var watch = Stopwatch.StartNew();
var query = dataset.Where( d => dataset.Count( i => i == d ) > 1 );
watch.Stop(); // timer stops here
foreach (var item in query) // query is actually evaluated here
{
... print out the item...
}
I would suggest that LINQ is only faster than a 'normal loop' when your algorithm is less than perfect (or you have some problem in your code). So LINQ will be faster at sorting than you are if you don't write an efficient sorting algorithm, etc.
LINQ is usually 'as fast as' or 'close enough to' the speed of a normal loop, and can be faster (and simpler) to code / debug / read. That's its benefit - not execution speed.
If it's performing faster than an empty loop, you are doing something wrong. Most likely, as suggested in comments, you aren't considering deferred execution and the LINQ statement is not actually executing.
If you did not compile with "Optimize Code" enabled, you would probably see this behaviour. (It would certainly explain why the empty loop was not removed.)
The code underlying LINQ, however, is part of already-compiled code, which will certainly have been optimised (by the JIT, NGen or similar).