Performance implications of calling ToArray inside a LINQ selector

Performance implications of calling ToArray inside a LINQ selector - c#

If I have the following statement:
whatever.Select(x => collection.ToArray()[index]).ToList();
Is LINQ smart enough to perform the ToArray cast only once (I'm not really aware of how this closure is transformed and evaluated)?
I understand that this code is bad, just interested.

No, it will be performed once for every item in whatever.

You can have a peek at the code for LINQBridge, especially the Select method (that ends up calling SelectYield.
The essence of SelectYield is a simple for-loop:
foreach (var item in source)
yield return selector(item, i++);
Where selector is the lambda expression you pass in, in your case x => collection.ToArray()[index]. From here it is obvious that the whole lambda expression will be evaluated for every element in whatever.
Note that LINQBridge is a stand alone reimplementation of LINQ2Objects and thus not necessarily identical (but to a very large extent at least behaving exactly like LINQ2Objects, including side effects).

Related

In a Linq predicate, will the compiler optimize a "scalar" call to Enumerable.Min() or will it be called for each item?

I was just looking at the question "SubQuery using Lambda Expression" and wondered about compiler optimization of Linq predicates.
Suppose I had a List<string> called names, and I was looking for the items with the shortest string length. So we have the query names.Where(x => x.Length == names.Min(y => y.Length)) (from the question mentioned above). Simple enough.
Now, we know the C# specification does not allow you to modify a collection while enumerating it. So I believe it is technically safe to assume the above call to Min() will always return the same value for every call.
But, my hypothesis is the compiler truly has no way of knowing what the lambda inside the Enumerable.Min extension method returns. Since, for example we could do:
int i = 0;
return names.Where(x => x.Length == names.Min(y => ++i));
Which would mean the query in question is really O(n²) - the result of Min() will be calculated for each iteration. And to get the desired O(n) implementation, you would have to be explicit:
int minLength = names.Min(y => y.Length);
return names.Where(x => x.Length == minLength);
Is my hypothesis correct, or is there something special about Linq or the C# specification that allows the compiler to look inside the lambda and optimize this call to Min()?
#spender is absolutely correct. Consider the following snippet:
List<string> names = new List<string>(new[] { "r", "abcde", "bcdef", "cdefg", "q" });
return names.Where(x =>
{
bool b = (x.Length == names.Min(y => y.Length));
names = new List<string>(new[] { "ab" });
return b;
});
This will return only "r", and not "q", because while the old reference to names is being iterated (foreach x), the call to Min after the first iteration is actually called with the new instance of names. But, a human looking at the query in the top of the question can say for certain nothing gets modified. So my question still stands: is the compiler smart enough to see this?

wondered about compiler optimization of Linq predicates.
The C# compiler does not know how the BCL types are implemented. It could look at the assemblies that you reference but those can change at any time. The compiler cannot assume that the machine the compiled program will run on will have the same binaries. Therefore, th C# compiler cannot legally perform these optimizations because you could tell the difference.
The JIT is in a position to make such optimizations (it does not at the moment).
Now, we know the C# specification does not allow you to modify a
collection while enumerating it. So I believe it is technically safe
to assume the above call to Min() will always return the same value
for every call.
The specification of C# knows nothing about libraries. It does not say this at all. Each implementation of IEnumerable can decide whether it wants to allows such behavior or not.
But, my hypothesis is the compiler truly has no way of knowing what
the lambda inside the Enumerable.Min extension method returns.
Yes, it could do anything. At runtime the JIT could deduce such properties but it does not. Note, that deducing even basic facts is hard because there are things like reflection, runtime code generation and multi-threading.
Is my hypothesis correct, or is there something special about Linq or
the C# specification that allows the compiler to look inside the
lambda and optimize this call to Min()?
No. LINQ has library-only optimizations. LINQ to objects is executed exactly as you wrote it. Other LINQ providers do this differently.
If you wonder whether the JIT will perform some advanced optimization the answer is usually no as of .NET 4.5.

C# compiler works in passes. Each pass takes some complex language feature and converts it to simpler. Quite often context is lost in this conversion. Lambda expressions are one of those steps. Each lambda is converted to class and this class is then instantiated and it's main method is passed to the delegate. And the compilation pass doesn't even look inside the lambda. So compiler that produces the IL code doesn't even know there are any lambdas and just sees bunch of classes. And those classes doesn't give him enough information to infer what you propose.

Performance difference between .where(...).Any() vs ..Any(...) [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
LINQ extension methods - Any() vs. Where() vs. Exists()
Given a list of objects in memory I ran the following two expressions:
myList.where(x => x.Name == "bla").Any()
vs
myList.Any(x => x.Name == "bla")
The latter was fastest always, I believe this is due to the Where enumerating all items. But this also happens when there's no matches.
Im not sure of the exact WHY though. Are there any cases where this viewed performance difference wouldn't be the case, like if it was querying Nhib?
Cheers.

The Any() with the predicate can perform its task without an iterator (yield return). Using a Where() creates an iterator, which adds has a performance impact (albeit very small).
Thus, performance-wise (by a bit), you're better off using the form of Any() that takes the predicate (x => x.Name == "bla"). Which, personally, I find more readable as well...
On a side note, Where() does not necessarily enumerate over all elements, it just creates an iterator that will travel over the elements as they are requested, thus the call to Any() after the Where() will drive the iteration, which will stop at the first item it finds that matches the condition.
So the performance difference is not that Where() iterates over all the items (in linq-to-objects) because it really doesn't need to (unless, of course, it doesn't find one that satisfies it), it's that the Where() clause has to set up an iterator to walk over the elements, whereas Any() with a predicate does not.

Assuming you correct where to Where and = to ==, I'd expect the "Any with a predicate" version to execute very slightly faster. However, I would expect the situations in which the difference was significant to be few and far between, so you should aim for readability first.
As it happens, I would normally prefer the "Any with a predicate" version in terms of readability too, so you win on both fronts - but you should really go with what you find more readable first. Measure the performance in scenarios you actually care about, and if a section of code isn't performing as you need it to, then consider micro-optimizing it - measuring at every step, of course.

I believe this is due to the Where enumerating all items.
If myList is a collection in memory, it doesn't. The Where method uses deferred execution, so it will only enumerate as many items as needed to determine the result. In that case you would not see any significant difference between .Any(...) and .Where(...).Any().
Are there any cases where this viewed performance difference wouldn't
be the case, like if it was querying Nhib?
Yes, if myList is a data source that will take the expression generated by the methods and translate to a query to run elsewhere (e.g. LINQ To SQL), you may see a difference. The code that translates the expression simply does a better job at translating one of the expressions.

How can I overcome the overhead of creating a List<T> from an IEnumerable<T>?

I am using some of the LINQ select stuff to create some collections, which return IEnumerable<T>.
In my case I need a List<T>, so I am passing the result to List<T>'s constructor to create one.
I am wondering about the overhead of doing this. The items in my collections are usually in the millions, so I need to consider this.
I assume, if the IEnumerable<T> contains ValueTypes, it's the worst performance.
Am I right? What about Ref Types? Either way there is also the cost of calling, List<T>.Add a million times, right?
Any way to solve this? Like can I "overload" methods like LINQ Select using extension methods)?

No, there's no particular penalty for the element type being value types, assuming you're using IEnumerable<T> instead of IEnumerable. You won't get any boxing going on.
If you actually know the size of the result beforehand (which the result of Select probably won't) you might want to consider creating the list with that size of buffer, then using AddRange to add the values. Otherwise the list will have to resize its buffer every time it fills it.
For instance, instead of doing:
Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(foo => foo.Name);
List<string> queryList = new List<string>(query);
you might do:
Foo[] foo = new Foo[100];
IEnumerable<string> query = foo.Select(x => x.Name);
List<string> queryList = new List<string>(foo.Length);
queryList.AddRange(query);
You know that calling Select will produce a sequence of the same length as the original query source, but nothing in the execution environment has that information as far as I'm aware.

It would be best to avoid the need for a list. If you can keep your caller using IEnumerable<T>, you will save yourself some headaches.
LINQ's ToList() will take your enumerable, and just construct a new List<T> directly from it, using the List<T>(IEnumerable<T>) constructor. This will be the same as making the list yourself, performance wise (although LINQ does a null check, as well).
If you're adding the elements yourself, use the AddRange method instead of the Add. ToList() is very similar to AddRange (since it's using the constructor which takes IEnumerable<T>), which typically will be your best bet, performance wise, in this case.

Generally speaking, a method returning IEnumerable doesn't have to evaluate any of the items before the item is actually needed. So, theoretically, when you return an IEnumerable none of you items need to exist at that time.
So creating a list means that you will really need to evaluate items, get them and place them somewhere in memory (at least their references). There is nothing that can be done about this - if you really need to have a list.

A number of other responders have already provided ideas for how to improve the performance of copying an IEnumerable<T> into a List<T> - I don't think that much can be added on that front.
However, based on what you have described you need to do with the results, and the fact that you get rid of the list when you're done (which I presume means that the intermediate results are not interesting) - you may want to consider whether you really need to materialize a List<T>.
Rather than creating a List<T> and operating on the contents of that list - consider writing a lazy extension method for IEnumerable<T> that performs the same processing logic. I've done this myself in a number of cases, and writing such logic in C# is not so bad when using the [yield return][1] syntax supported by the compiler.
This approach works well if all you're trying to do is visit each item in the results and collection some information from it. Often, what you need to do is just visit each element in the collection on demand, do some processing with it, and then move on. This approach is generally more scalable and performant that creating a copy of the collection just to iterate over it.
Now, this advice may not work for you for other reasons, but it's worth considering as an alternative to finding the most efficient way to materialize a very large list.

Don't pass an IEnumerable to the List constructor. IEnumerable has a ToList() method, which can't possibly do worse than that, and has nicer syntax (IMHO).
That said, that only changes the answer to your question to "it depends" - in particular, it depends on what the IEnumerable actually is behind the scenes. If it happens to be a List already, then ToList will effectively be free, of course will go much faster than if it were another type. It's still not super-fast.
The best way to solve this, of course, is to try to figure out how to do your processing on an IEnumerable rather than a List. That may not be possible.
Edit: Some people in the comments are debating whether or not ToList() will actually be any faster when called on a List than if not, and whether ToList() will be any faster than the list constructor. At this point, speculating is getting pointless, so here's some code:
using System;
using System.Linq;
using System.Collections.Generic;
public static class ToListTest
{
public static int Main(string[] args)
{
List<int> intlist = new List<int>();
for (int i = 0; i < 1000000; i++)
intlist.Add(i);
IEnumerable<int> intenum = intlist;
for (int i = 0; i < 1000; i++)
{
List<int> foo = intenum.ToList();
}
return 0;
}
}
Running this code with an IEnumerable that's really a List goes about 6-10 times faster than if I replace it with a LinkedList or Stack (on my pokey 2.4 GHz P4, using Mono 1.2.6). Conceivably this could be due to some unfortunate interaction between ToList() and the particular implementations of LinkedList or Stack's enumerations, but at least the point remains: speed will depend on the underlying type of the IEnumerable. That said, even with a List as the source, it still takes 6 seconds for me to make 1000 ToList() calls, so it's far from free.
The next question is whether ToList() is any more intelligent than the List constructor. The answer to that turns out to be no: the List constructor is just as fast as ToList(). In hindsight, Jon Skeet's reasoning makes sense - I was just forgetting that ToList() was an extension method. I still (much) prefer ToList() syntactically, but there's no performance reason to use it.
So the short version is that the best answer is still "don't convert to a List if you can avoid it". Barring that, actual performance will depend drastically on what the IEnumerable actually is, but at best it'll be sluggish, as opposed to glacial. I've amended my original answer to reflect this.

From reading the various comments and the question I get the following requirements
for a collection of data you need to run through that collection, filter out some objects and then perform some transformation on the remaining objects. If thats the case you can do something like this:
var result = from item in collection
where item.Id > 10 //or some more sensible condition
select Operation(item);
and if you need to the perform more filtering and transformation you can nest your LINQ queries like
var result = from filteredItem in (from item in collection
where item.Id > 10 //or some more sensible condition
select Operation(item))
where filteredItem.SomePropertyAvailableAfterFirstTransformation == "new"
select SecondTransfomation(filteredItem);

What is the LINQ to objects 'where' clause doing behind the scenes?

I've just replaced this piece of code:
foreach( var source in m_sources )
{
if( !source.IsExhausted )
{
....
}
}
with this one:
foreach( var source in m_sources.Where( src => !src.IsExhausted ) )
{
...
}
Now the code looks better (to me) but I'm wondering what's really happening here. I'm concerned about performance in this case, and it'd be bad news if applying this filter would mean that some kind of compiler magic would take place.
Are the two pieces of code doing basically the 'same' thing? Are temporary containers created to do the filtering then passing them to my foreach?
Any help on the subject will be pretty much appreciated. Thanks.

The yield return keyword and lambdas do involve the creation of hidden classes at compile time and the allocation of extra objects at runtime, and if your background is in C or C++ then it's only natural to be concerned about performance.
Natural, but wrong!
I tried measuring the overhead for lambdas with closure over local variables, and found it to be so incredibly small (a matter of nanoseconds) that it would be of no significance in almost all applications.

It depends on the type if m_sources.
If it is a Data Context from LINQ to SQL or Entity Framework the argument you pass is compiled as an instance of Expression and parsed to create SQL (with the help of the data model). There are some real costs in this process, but likely (in most cases) to be dominated by the round trip to the database.
If it is IEnumerable then Where is pretty much implemented as:
public static IEnumnerable<T> Where(this IEnumerable<T> input, Func<T, bool> predicate) {
foreach (var v in input) {
if (predicate(v)) {
yield return v;
}
}
}
Which is pretty efficient and performs lazily (so if you break out of your loop early the predicate will not be applied to the whole collection).

Basically, yes, it's the same, O(n).
The where clause will be executed while you loop through your list (i.e. if you break after the first item, the following items will not be tested).

When not to use lambda expressions [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
A lot of questions are being answered on Stack Overflow, with members specifying how to solve these real world/time problems using lambda expressions.
Are we overusing it, and are we considering the performance impact of using lambda expressions?
I found a few articles that explores the performance impact of lambda vs anonymous delegates vs for/foreach loops with different results
Anonymous Delegates vs Lambda Expressions vs Function Calls Performance
Performance of foreach vs. List.ForEach
.NET/C# Loop Performance Test (FOR, FOREACH, LINQ, & Lambda).
DataTable.Select is faster than LINQ
What should be the evaluation criteria when choosing the appropriate solution? Except for the obvious reason that it's more concise code and readable when using lambda.

Even though I will focus on point one, I begin by giving my 2 cents on the whole issue of performance. Unless differences are big or usage is intensive, usually I don't bother about microseconds that when added don't amount to any visible difference to the user. I emphasize that I only don't care when considering non-intensive called methods. Where I do have special performance considerations is on the way I design the application itself. I care about caching, about the use of threads, about clever ways to call methods (whether to make several calls or to try to make only one call), whether to pool connections or not, etc., etc. In fact I usually don't focus on raw performance, but on scalibility. I don't care if it runs better by a tiny slice of a nanosecond for a single user, but I care a lot to have the ability to load the system with big amounts of simultaneous users without noticing the impact.
Having said that, here goes my opinion about point 1. I love anonymous methods. They give me great flexibility and code elegance. The other great feature about anonymous methods is that they allow me to directly use local variables from the container method (from a C# perspective, not from an IL perspective, of course). They spare me loads of code oftentimes. When do I use anonymous methods? Evey single time the piece of code I need isn't needed elsewhere. If it is used in two different places, I don't like copy-paste as a reuse technique, so I'll use a plain ol' delegate. So, just like shoosh answered, it isn't good to have code duplication. In theory there are no performance differences as anonyms are C# tricks, not IL stuff.
Most of what I think about anonymous methods applies to lambda expressions, as the latter can be used as a compact syntax to represent anonymous methods. Let's assume the following method:
public static void DoSomethingMethod(string[] names, Func<string, bool> myExpression)
{
Console.WriteLine("Lambda used to represent an anonymous method");
foreach (var item in names)
{
if (myExpression(item))
Console.WriteLine("Found {0}", item);
}
}
It receives an array of strings and for each one of them, it will call the method passed to it. If that method returns true, it will say "Found...". You can call this method the following way:
string[] names = {"Alice", "Bob", "Charles"};
DoSomethingMethod(names, delegate(string p) { return p == "Alice"; });
But, you can also call it the following way:
DoSomethingMethod(names, p => p == "Alice");
There is no difference in IL between the both, being that the one using the Lambda expression is much more readable. Once again, there is no performance impact as these are all C# compiler tricks (not JIT compiler tricks). Just as I didn't feel we are overusing anonymous methods, I don't feel we are overusing Lambda expressions to represent anonymous methods. Of course, the same logic applies to repeated code: Don't do lambdas, use regular delegates. There are other restrictions leading you back to anonymous methods or plain delegates, like out or ref argument passing.
The other nice things about Lambda expressions is that the exact same syntax doesn't need to represent an anonymous method. Lambda expressions can also represent... you guessed, expressions. Take the following example:
public static void DoSomethingExpression(string[] names, System.Linq.Expressions.Expression<Func<string, bool>> myExpression)
{
Console.WriteLine("Lambda used to represent an expression");
BinaryExpression bExpr = myExpression.Body as BinaryExpression;
if (bExpr == null)
return;
Console.WriteLine("It is a binary expression");
Console.WriteLine("The node type is {0}", bExpr.NodeType.ToString());
Console.WriteLine("The left side is {0}", bExpr.Left.NodeType.ToString());
Console.WriteLine("The right side is {0}", bExpr.Right.NodeType.ToString());
if (bExpr.Right.NodeType == ExpressionType.Constant)
{
ConstantExpression right = (ConstantExpression)bExpr.Right;
Console.WriteLine("The value of the right side is {0}", right.Value.ToString());
}
}
Notice the slightly different signature. The second parameter receives an expression and not a delegate. The way to call this method would be:
DoSomethingExpression(names, p => p == "Alice");
Which is exactly the same as the call we made when creating an anonymous method with a lambda. The difference here is that we are not creating an anonymous method, but creating an expression tree. It is due to these expression trees that we can then translate lambda expressions to SQL, which is what Linq 2 SQL does, for instance, instead of executing stuff in the engine for each clause like the Where, the Select, etc. The nice thing is that the calling syntax is the same whether you're creating an anonymous method or sending an expression.

My answer will not be popular.
I believe Lambda's are 99% always the better choice for three reasons.
First, there is ABSOLUTELY nothing wrong with assuming your developers are smart. Other answers have an underlying premise that every developer but you is stupid. Not so.
Second, Lamdas (et al) are a modern syntax - and tomorrow they will be more commonplace than they already are today. Your project's code should flow from current and emerging conventions.
Third, writing code "the old fashioned way" might seem easier to you, but it's not easier to the compiler. This is important, legacy approaches have little opportunity to be improved as the compiler is rev'ed. Lambdas (et al) which rely on the compiler to expand them can benefit as the compiler deals with them better over time.
To sum up:
Developers can handle it
Everyone is doing it
There's future potential
Again, I know this will not be a popular answer. And believe me "Simple is Best" is my mantra, too. Maintenance is an important aspect to any source. I get it. But I think we are overshadowing reality with some cliché rules of thumb.
// Jerry

Code duplication.
If you find yourself writing the same anonymous function more than once, it shouldn't be one.

Well, when we are talking bout delegate usage, there shouldn't be any difference between lambda and anonymous methods -- they are the same, just with different syntax. And named methods (used as delegates) are also identical from the runtime's viewpoint. The difference, then, is between using delegates, vs. inline code - i.e.
list.ForEach(s=>s.Foo());
// vs.
foreach(var s in list) { s.Foo(); }
(where I would expect the latter to be quicker)
And equally, if you are talking about anything other than in-memory objects, lambdas are one of your most powerful tools in terms of maintaining type checking (rather than parsing strings all the time).
Certainly, there are cases when a simple foreach with code will be faster than the LINQ version, as there will be fewer invokes to do, and invokes cost a small but measurable time. However, in many cases, the code is simply not the bottleneck, and the simpler code (especially for grouping, etc) is worth a lot more than a few nanoseconds.
Note also that in .NET 4.0 there are additional Expression nodes for things like loops, commas, etc. The language doesn't support them, but the runtime does. I mention this only for completeness: I'm certainly not saying you should use manual Expression construction where foreach would do!

I'd say that the performance differences are usually so small (and in the case of loops, obviously, if you look at the results of the 2nd article (btw, Jon Skeet has a similar article here)) that you should almost never choose a solution for performance reasons alone, unless you are writing a piece of software where performance is absolutely the number one non-functional requirement and you really have to do micro-optimalizations.
When to choose what? I guess it depends on the situation but also the person. Just as an example, some people perfer List.Foreach over a normal foreach loop. I personally prefer the latter, as it is usually more readable, but who am I to argue against this?

Rules of thumb:
Write your code to be natural and readable.
Avoid code duplications (lambda expressions might require a little extra diligence).
Optimize only when there's a problem, and only with data to back up what that problem actually is.

Any time the lambda simply passes its arguments directly to another function. Don't create a lambda for function application.
Example:
var coll = new ObservableCollection<int>();
myInts.ForEach(x => coll.Add(x))
Is nicer as:
var coll = new ObservableCollection<int>();
myInts.ForEach(coll.Add)
The main exception is where C#'s type inference fails for whatever reason (and there are plenty of times that's true).

If you need recursion, don't use lambdas, or you'll end up getting very distracted!

Lambda expressions are cool. Over older delegate syntax they have a few advantages like, they can be converted to either anonymous function or expression trees, parameter types are inferred from the declaration, they are cleaner and more concise, etc. I see no real value to not use lambda expression when you're in need of an anonymous function. One not so big advantage the earlier style has is that you can omit the parameter declaration totally if they are not used. Like
Action<int> a = delegate { }; //takes one argument, but no argument specified
This is useful when you have to declare an empty delegate that does nothing, but it is not a strong reason enough to not use lambdas.
Lambdas let you write quick anonymous methods. Now that makes lambdas meaningless everywhere where anonymous methods go meaningless, ie where named methods make more sense. Over named methods, anonymous methods can be disadvantageous (not a lambda expression per se thing, but since these days lambdas widely represent anonymous methods it is relevant):
because it tend to lead to logic duplication (often does, reuse is difficult)
when it is unnecessary to write to one, like:
//this is unnecessary
Func<string, int> f = x => int.Parse(x);
//this is enough
Func<string, int> f = int.Parse;
since writing anonymous iterator block is impossible.
Func<IEnumerable<int>> f = () => { yield return 0; }; //impossible
since recursive lambdas require one more line of quirkiness, like
Func<int, int> f = null;
f = x => (x <= 1) ? 1 : x * f(x - 1);
well, since reflection is kinda messier, but that is moot isn't it?
Apart from point 3, the rest are not strong reasons not to use lambdas.
Also see this thread about what is disadvantageous about Func/Action delegates, since often they are used along with lambda expressions.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.