I am wondering about general performance of LINQ. I admit, that it comes handy but how performant is LINQ? I know that is a broad question. So I want to ask about a particular example:
I have an anonymous type:
var users = reader.Select(user => new MembershipUser(reader.Name, reader Age));
And now, I want to convert it to the MembershipUserCollection.
So I do it like this:
MembershipUserCollection membershipUsers = new MembershipUserCollection();
users.ToList().ForEach(membershipUsers.Add); //what is the complexity of this line?
What is the complexity of the last line? Is it n^2 ?
Is ToList() method iterates for each element of the users and adds it to the list?
Or does ToList() works differently? Because if it is not, I find hard to justice the reason of using the last line of the code instead of simply:
foreach (var user in users)
{
membershipUsers.Add(user);
}
Your example isn't particularly good for your question because ToList() isn't really in the same class of extension methods as the other ones supporting LINQ. The ToList() extension method is a conversion operation, not a query operation. The real values in LINQ are deferred execution of a composite query built by combining several LINQ query operations and improved readability. In LINQ2SQL you also get the advantage of constructing arbitrary queries that get pushed to the DB server for actual execution, taking advantage of optimizations that the DB may have in place to improve performance.
In general, I would expect that the question of performance largely comes down to how well you construct the actual queries and has a lot more to do with how well the programmer knows the tools and data than how well the tool is implemented. In your case, it makes no sense to construct a temporary list just to be able to invoke the convenience ForEach method on it if all you care about is performance. You'd be better off simply iterating over the enumeration you already have (as you suspect). LINQ won't stop a programmer from writing bad code, though it may disguise bad code for the person who doesn't understand how LINQ works.
It's always the case that you can construct an equivalent program not using LINQ for any program using LINQ. It may be that you can actually improve on the performance. I would submit, though, that LINQ makes it much easier to write readable code than non-LINQ solutions. By that, I mean more compact and understandable. It also makes it easier to write composable code, which when executed in a deferred manner performs better than, non-LINQ compositions. By breaking the code into composable parts, you simplify it and improve understandability.
I think the trick here is to really understand where LINQ makes sense rather than treat it as a shiny, new tool that you need to now use for every problem you have. The nice part of this shiny, new tool, though, is that it really does come in handy in a lot of situations.
It's O(n) - since .ToList() iterates once through the enumeration and copys the elements into the resulting List<T> (whose insertion is O(1)). Thus the complexity is fine.
The actual issue you might see is that you create a completely new, temporary List<T> just to copy its contents into another list (and afterwards discard it).
I suspect that's just due to the convenience of having a .ForEach()-method on List<T>s. One could nonetheless code a direct implementation for IEnumerable<T>s, which would save this one superfluous copying - or just write
foreach (var user in users) membershipUsers.Add(user)
which is basically what you want to express after all ;-)
Converting to a list will have the same complexity as iterating over the sequence, which may really by anything depending on how the sequence is generated. A normal Select over an in-memory list is O(n).
The performance of using ForEach on a List vs a foreach loop comes down to the overhead of invoking a delegate vs the overhead of creating and using a enumerator, I cannot say which one is quicker, but if both are used on an in-memory list, the complexity is the same.
Related
This question already has an answer here:
foreach vs LINQ .ForEach() [duplicate]
(1 answer)
Closed 6 years ago.
I was playing around with the main differences between 4 monitor classes/approaches, like async/await, Task, BackgroundWorker and Thread), so I created some controls, which just interact with those 4 approaches/objects.
I also wanted to dig into LINQ/LAMBDA at the same time and managed to write some successful LINQ-statements like this one, for example.
(from t in tabControlOutput.TabPages.Cast<TabPage>()
from c in t.Controls.OfType<WebBrowser>()
select c).ToList().ForEach(c => c.Navigate(Constants.BlankUrl));
Of course I was looking before posting this question and I found some nice information like this
First Link
Which redirects to a comment of Eric Lippert.
foreach vs ForEach
Besides that I am aware of LINQ delivering some overhead to standard-operations, which can be issued by standard instructions instead of linq, the overhead issue seems to be the particular reason fore MS, why stick to standard LINQ-less processing, but E.Lippert has some other arguments, especially referring to ForEach.
There are some statements, which seem to confuse me:
.... The purpose of
an expression is to compute a value, not to cause a side effect. The
purpose of a statement is to cause a side effect.
I never heard / read about that in any programming book, best practice or whatever. Following common (and my) experience, the term sideeffects usually carries a negative taste of arbitrary , instead of being an aim, which is achieved willingly.
Could anyone clarifiy this quote from E.Lippert ?
Furthermore E.Lippert states, that some two lines seem to be harder to maintain, which I would not accept(ok, this might be opinion based).
But, regarding my piece of code, I can see that the only "ugly" thing is the "Cast".
So, would there be a reasonable reason(not opinion based, but in pure technical terms, arguments, restrictive statements, principles (readability also belongs to them), why my lines should or should not be replaced by traditional foreach, or even reflection ?
EDIT:
I changed my code, remmed out my previous lines, made comments and added those lines. Please feel free to comment on them:
// ---> NOT RECOMMENDED APPROACH: Because LINQ is more designed to query, but at the end
// There is a modification of the queried objects, which
// Is not real LINQ, but a mix.
// And it violates the principles of least astonishment/surprise
// Changed to
var qresult = (from t in tabControlOutput.TabPages.Cast<TabPage>()
from c in t.Controls.OfType<WebBrowser>()
select c);
foreach (var tmp in qresult)
{
tmp.Navigate(Constants.BlankUrl);
}
First, let me get a bit nitpicky about what LINQ is.
This is LINQ. Language integrated query. You are querying. Reading.
(from t in tabControlOutput.TabPages.Cast<TabPage>()
from c in t.Controls.OfType<WebBrowser>()
select c).ToList()
The following is not. Technically, it's not LINQ (a set of extension methods) but a normal class member of the List<T> class. And philosophically, this is not querying. You are manipulating data:
.ForEach(c => c.Navigate(Constants.BlankUrl));
Your call to .ToList() is only needed so you can call the .ForEach method of said class. It would be more efficient if you just used a normal foreach on the result of your LINQ.
So your line is not plain LINQ from the start, it's actually manipulating data, which may come as a surprise to the casual reader thinking that it was plain LINQ and therefore violates the principle of least surprise. And last but not least it's less efficient than the foreach alternative.
I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?
This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.
Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .
This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.
For a simple case, where class foo has a member i, and I have a collection of foos, say IEnumerable<Foo> foos, and I want to end up with a collection of foo's member i, say List<TypeOfi> result.
Question: is it preferable to use a foreach (Option 1 below) or some form of Linq (Option 2 below) or some other method. Or, perhaps, it it not even worth concerning myself with (just choose my personal preference).
Option 1:
foreach (Foo foo in foos)
result.Add(foo.i);
Option 2:
result.AddRange(foos.Select(foo => foo.i));
To me, Option 2 looks cleaner, but I'm wondering if Linq is too heavy handed for something that can achieved with such a simple foreach loop.
Looking for all opinions and suggestions.
I prefer the second option over the first. However, unless there is a reason to pre-create the List<T> and use AddRange, I would avoid it. Personally, I would use:
List<TypeOfi> results = foos.Select(f => f.i).ToList();
In addition, I would not necessarily even use ToList() unless you actually need a true List<T>, or need to force the execution to be immediate instead of deferred. If you just need the collection of "i" values to iterate, I would simply use:
var results = foos.Select(f => f.i);
I definitely prefer the second. It is far more declarative and easier to understand (to me, at least).
LINQ is here to make our lives more declarative so I would hardly consider it heavy handed even in cases as seemingly "trivial" as this.
As Reed said, though, you could improve the quality by using:
var result = foos.Select(f => f.i).ToList();
As long as there is no data already in the result collection.
LINQ isn't heavy handed in any way, both the foreach and the linq code do about the same, the foreach in the second case is just hidden away.
It really is just a matter of preference, at least concerning linq to objects. If your source collection is a linq to entities query or something different, it is a complete different case - the second case would put the query into the database which is much more effective. In this simple case, the difference probably won't be that much, but if you throw in a Where operator or others into it and make the query non-trivial, the linq query will most likely have better/faster performance.
I think you could also just do
foos.Select(foo => foo.i).ToList<TypeOfi>();
regarding linq to objects, if i use a .Where(x => x....) and then straight afterwards use a .SkipWhile(x => x...) does this incur a performance penalty because i am going over the collection twice?
Should i find a way to put everything in the Where clause or in the SkipWhile clause?
There will be a minor inefficiency due to chaining iterators together, but it really will be very minor. (In particular, although each matching item will be seen by both operators, they won't be buffered up or anything like that. LINQ to Object isn't going to create a new list of all matching items and then run SkipWhile over it.)
If this is so performance-critical, you'd probably get a very slight speed bump by not using LINQ in the first place. In every other case, write the simplest code first and only worry about micro-optimisations like this when you've proved it's a bottleneck.
Using a Where and a SkipWhile doesn't result in "going over the collection twice." LINQ to Objects works on a pull model. When you enumerate the combined query, the SkipWhile will start asking its source for elements. Its source is the Where, so this will cause the Where to start asking its source in turn for elements. So the SkipWhile will see all elements that pass the Where clause, but it's getting them as it goes. The upshot is that LINQ does a foreach over the original collection, returning only elements that pass both the Where and SkipWhile filters -- and this involves only a single pass over the collection.
There may be a trivial loss of efficiency because there are two iterators involved, but it is unlikely to be significant. You should write the code to be clear (as you are doing at the moment), and if you suspect that the clear version is causing a performance issue, measure to make sure, and only then try combining the clauses.
As with most things the answer is it depends on what you're doing. If you have multiple where's the operate on the same object it's probably worth combining them with &&'s for example.
Most of the LINQ operators won't iterate over the entire collection per operator, they merely process one item and pass it onto the next operator. There are exceptions to this such as Reverse and OrderBy, but generally if you were using Where and SkipWhile for example you would have a chain that would process one item at a time. Now your first Where statement could obviously filter out some items, so SkipWhile wouldn't see an item until it passed through the preceding operator.
My personal preference is to keep operators separate for clarity and combine them only if performance becomes an issue.
You aren't going over the collection twice when you use Where and SkipWhile.
The Where method will stream its output to the SkipWhile method one item at a time, and likewise the SkipWhile method will stream its output to any subsequent method one item at a time.
(There will be a small overhead because the compiler generates separate iterator objects for each method behind the scenes. But if I was worried about the overhead of compiler-generated iterators then I probably wouldn't be using LINQ in the first place.)
No, there's (essentially) no performance penalty. That's what lazy (deferred) execution is all about.
I feel that using GetEnumerator() and casting IEnumerator.Current is expensive. Any better suggestions?
I'm open to using a different data structure if it offers similiar capabilities with better performance.
After thought:
Would a generic stack be a better idea so that the cast isn't necessary?
Stack<T> (with foreach) would indeed save the cast, but actually boxing isn't all that bad in the grand scheme of things. If you have performance issues, I doubt this is the area where you can add much value. Use a profiler, and focus on real problems - otherwise this is premature.
Note that if you only want to read the data once (i.e. you are happy to consume the stack), then this may be quicker (avoids the overhead of an enumerator); YMMV.
Stack<T> stack = null;
while (stack.Count > 0)
{
T value = stack.Pop();
// process value
}
Have you done any benchmarks, or are they just gut feelings?
If you think that the majority of the processing time is spent looping through stacks you should benchmark it and make sure that that is the case. If it is, you have a few options.
Redesign the code so that the looping isn't necessary
Find a faster looping construct. (I would recommend generics even though it wouldn't matter that much. Again, do benchmarks).
EDIT:
Examples of looping that might not be necessary are when you try to do lookups in a list or match two lists or similar. If the looping takes a long time, see if it make sense to put the lists into binary trees or hash maps. There could be an initial cost of creating them, but if the code is redesigned you might get that back by having O(1) lookups later on.
If you need the functionality of a Stack (as apposed to a List, or some other colleciton type), then yes, use a generic stack. This will speed things up a bit as the compiler will skip the casting at runtime (because it's garunteed at compile time).
Stack<MyClass> stacky = new Stack<MyClass>();
foreach (MyClass item in stacky)
{
// this is as fast as you're going to get.
}
Yes, using a generic stack will spare the cast.
Enumerating over a generic IEnumerable<T> or IEnumerator<T> doesn't create a cast if the iterating variable is of type T, so yes using the generic is going to be faster in most cases, but generics have some very subtle issues, especially when used with value types.
Rico Mariani (Microsoft performance architect) has some posts detailing the differences and the underpinnings
Six Questions about Generics and Performance
Performance Quiz #7 -- Generics Improvements and Costs
Performance Quiz #7 -- Generics Improvements and Costs -- Solution
As far as speed is concerned there are multiple variables, depends on the context. For example, in a auto-memory-managed codebase like C#, you can get allocation spikes which can affect framerate in something like, say, a game. A nice optimization you can make for this instead of a foreach is an enumerator with a while loop:
var enumerator = stack.GetEnumerator();
while(enumerator.MoveNext ()) {
// do stuff with enumerator value using enumerator.Current
enumerator.Current = blah
}
As far as CPU benchmarks, this probably isn't any faster than a foreach, but foreach can have unintended allocation spikes, which can ultimately "slow down" the performance of your application.
An alternative to creating an enumerator is to use the ToArray method, and then iterate over the array. The stack iterator causes some slight overhead for checking whether the stack has been modified, whereas iteration over the array would be fast. However, there is of course the overhead of creating the array in the first place. As mats says, you should benchmark the alternatives.