Combining LINQ statements for efficiency

Combining LINQ statements for efficiency - c#

regarding linq to objects, if i use a .Where(x => x....) and then straight afterwards use a .SkipWhile(x => x...) does this incur a performance penalty because i am going over the collection twice?
Should i find a way to put everything in the Where clause or in the SkipWhile clause?

There will be a minor inefficiency due to chaining iterators together, but it really will be very minor. (In particular, although each matching item will be seen by both operators, they won't be buffered up or anything like that. LINQ to Object isn't going to create a new list of all matching items and then run SkipWhile over it.)
If this is so performance-critical, you'd probably get a very slight speed bump by not using LINQ in the first place. In every other case, write the simplest code first and only worry about micro-optimisations like this when you've proved it's a bottleneck.

Using a Where and a SkipWhile doesn't result in "going over the collection twice." LINQ to Objects works on a pull model. When you enumerate the combined query, the SkipWhile will start asking its source for elements. Its source is the Where, so this will cause the Where to start asking its source in turn for elements. So the SkipWhile will see all elements that pass the Where clause, but it's getting them as it goes. The upshot is that LINQ does a foreach over the original collection, returning only elements that pass both the Where and SkipWhile filters -- and this involves only a single pass over the collection.
There may be a trivial loss of efficiency because there are two iterators involved, but it is unlikely to be significant. You should write the code to be clear (as you are doing at the moment), and if you suspect that the clear version is causing a performance issue, measure to make sure, and only then try combining the clauses.

As with most things the answer is it depends on what you're doing. If you have multiple where's the operate on the same object it's probably worth combining them with &&'s for example.
Most of the LINQ operators won't iterate over the entire collection per operator, they merely process one item and pass it onto the next operator. There are exceptions to this such as Reverse and OrderBy, but generally if you were using Where and SkipWhile for example you would have a chain that would process one item at a time. Now your first Where statement could obviously filter out some items, so SkipWhile wouldn't see an item until it passed through the preceding operator.
My personal preference is to keep operators separate for clarity and combine them only if performance becomes an issue.

You aren't going over the collection twice when you use Where and SkipWhile.
The Where method will stream its output to the SkipWhile method one item at a time, and likewise the SkipWhile method will stream its output to any subsequent method one item at a time.
(There will be a small overhead because the compiler generates separate iterator objects for each method behind the scenes. But if I was worried about the overhead of compiler-generated iterators then I probably wouldn't be using LINQ in the first place.)

No, there's (essentially) no performance penalty. That's what lazy (deferred) execution is all about.

Related

How Linq's GroupBy method has a deferred execution?

I've found this question, but it has no answer yet... What algorithm does Linq GroupBy use?
Since you have to iterate over the entire source collection to know all the groups, how can its execution be deferred? Will it iterate the source collection only once? Does it use a buffer?

(I'm assuming we're only talking about LINQ to Objects.)
It's still deferred in that until you start asking for results, it won't read the source collection at all. But yes, once you ask for the first result, it will indeed read the whole collection. It only reads the source once, and it only asks each element for its grouping key once. All the results are buffered in memory, as you suspected.
My Edulinq blog post on GroupBy (Edulinq is basically a reimplementation of LINQ to Objects for the sake of education) shows a sample implementation, although that's in terms of ToLookup.

Would AddRange() be faster than ToList() in this case?

I have a comma delimited string called ctext which I want to split and put into a List<string>.
Would using LINQ,
List<string> f = ctext.Split(',').ToList();
be slower than not using LINQ?
List<string> f;
f.AddRange(ctext.Split(','));
It seems that LINQ would actually copy something somewhere at some point which would make it slower, whereas AddRange() would just check the size of the list once, expand it, and dump it in.
Or is there an even faster way? (Like using a for loop, but I doubt it.)

Fortunately, we can easily look at what ToList does now that it's open source. (Follow the link for the latest source...)
I haven't seen IListProvider<T> before, but I doubt that an array implements it, which means we've basically got new List<TSource>(source). Looking at the List<T> source shows that both the constructor and AddRange basically end up using CopyTo.
In other words, other than a few levels of indirection, I'd expect them both to do the same thing.

It seems that LINQ would actually copy something somewhere at some point which would make it slower, whereas AddRange() would just check the size of the list once, expand it, and dump it in.
You are correct that both of these things are happening, but incorrect in thinking that each are specific to that one operation. Both ToList and AddRange do both of those things. Both operations copy all of the values from the input sequence into the list, and since both of them are adding multiple items at the same time they're able to see how much to expand the internal capacity of the list all at once, rather than needing to perform multiple expansions.

Does foreach loop work more slowly when used with a not stored list or array?

I am wondered at if foreach loop works slowly if an unstored list or array is used as an in array or List.
I mean like that:
foreach (int number in list.OrderBy(x => x.Value)
{
// DoSomething();
}
Does the loop in this code calculates the sorting every iteration or not?
The loop using stored value:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>;
foreach (int number in list)
{
// DoSomething();
}
And if it does, which code shows the better performance, storing the value or not?

This is often counter-intuitive, but generally speaking, the option that is best for performance is to wait as long as possible to materialize results into a concrete structure like a list or array. Please keep in mind that this is a generalization, and so there are plenty of cases where it doesn't hold. Nevertheless, the first instinct is better when you avoid creating the list for as long as possible.
To demonstrate with your sample, we have these two options:
var list = tours.OrderBy(x => x.Value).ToList();
foreach (int number in list)
{
// DoSomething();
}
vs this option:
foreach (int number in list.OrderBy(x => x.Value))
{
// DoSomething();
}
To understand what is going on here, you need to look at the .OrderBy() extension method. Reading the linked documentation, you'll see it returns a IOrderedEnumerable<TSource> object. With an IOrderedEnumerable, all of the sorting needed for the foreach loop is already finished when you first start iterating over the object (and that, I believe, is the crux of your question: No, it does not re-sort on each iteration). Also note that both samples use the same OrderBy() call. Therefore, both samples have the same problem to solve for ordering the results, and they accomplish it the same way, meaning they take exactly the same amount of time to reach that point in the code.
The difference in the code samples, then, is entirely in using the foreach loop directly vs first calling .ToList(), because in both cases we start from an IOrderedEnumerable. Let's look closely at those differences.
When you call .ToList(), what do you think happens? This method is not magic. There is still code here which must execute in order to produce the list. This code still effectively uses it's own foreach loop that you can't see. Additionally, where once you only needed to worry about enough RAM to handle one object at a time, you are now forcing your program to allocate a new block of RAM large enough to hold references for the entire collection. Moving beyond references, you may also potentially need to create new memory allocations for the full objects, if you were reading a from a stream or database reader before that really only needed one object in RAM at a time. This is an especially big deal on systems where memory is the primary constraint, which is often the case with web servers, where you may be serving and maintaining session RAM for many many sessions, but each session only occasionally uses any CPU time to request a new page.
Now I am making one assumption here, that you are working with something that is not already a list. What I mean by this, is the previous paragraphs talked about needing to convert an IOrderedEnumerable into a List, but not about converting a List into some form of IEnumerable. I need to admit that there is some small overhead in creating and operating the state machine that .Net uses to implement those objects. However, I think this is a good assumption. It turns out to be true far more often than we realize. Even in the samples for this question, we're paying this cost regardless, by the simple virtual of calling the OrderBy() function.
In summary, there can be some additional overhead in using a raw IEnumerable vs converting to a List, but there probably isn't. Additionally, you are almost certainly saving yourself some RAM by avoiding the conversions to List whenever possible... potentially a lot of RAM.

Yes and no.
Yes the foreach statement will seem to work slower.
No your program has the same total amount of work to do so you will not be able to measure a difference from the outside.
What you need to focus on is not using a lazy operation (in this case OrderBy) multiple times without a .ToList or ToArray. In this case you are only using it once(foreach) but it is an easy thing to miss.
Edit: Just to be clear. The as statement in the question will not work as intended but my answer assumes no .ToList() after OrderBy .

This line won't run:
List<Tour> list = tours.OrderBy(x => x.Value) as List<Tour>; // Returns null.
Instead, you want to store the results this way:
List<Tour> list = tours.OrderBy(x => x.Value).ToList();
And yes, the second option (storing the results) will enumerate much faster as it will skip the sorting operation.

LINQ .FromCache().ToList()?

When using .FromCache() on an IQueryable result set, should I additionally call .ToList(), or can I just return the IEnumerable<> returned by the materialized query with FromCache?

I am assuming you are using a derivative of the code from http://petemontgomery.wordpress.com/2008/08/07/caching-the-results-of-linq-queries/ . If you look at the FromCache implementation, you will see the that the query.ToList() is already called. This means that the evaluated list is what is cached. So,
You do NOT need to call ToList()

That depends entirely on what you want to do with it. If you're just going to foreach over it once then you may as well just leave it as an IEnumerable. There's no need to build up a list just to discard it right away.
If you plan to iterate over it multiple times it's probably best to ToList it, so that you're not accessing the underlying IQueryable multiple times. You should also ToList it if it's possible for the underlying query to change over time and you don't want those changes to be reflected in your query.
If you are likely to not need to iterate all of the items (you may end up stopping after the first item, or half way, or something like that) then it's probably best to leave it as an IEnumerable to potentially avoid even fetching some amount of data in the first place.
If the method has no idea how it's going to be used, and it's just a helper method that will be used by not-yet-written code, then consider returning IEnumerable. The caller can call ToList on it if they have a compelling reason to turn it into a list.
For me, as a general rule, I leave such queries as IEnumerable unless I have some compelling reason to make it a List.

What is the complexity of this LINQ example?

I am wondering about general performance of LINQ. I admit, that it comes handy but how performant is LINQ? I know that is a broad question. So I want to ask about a particular example:
I have an anonymous type:
var users = reader.Select(user => new MembershipUser(reader.Name, reader Age));
And now, I want to convert it to the MembershipUserCollection.
So I do it like this:
MembershipUserCollection membershipUsers = new MembershipUserCollection();
users.ToList().ForEach(membershipUsers.Add); //what is the complexity of this line?
What is the complexity of the last line? Is it n^2 ?
Is ToList() method iterates for each element of the users and adds it to the list?
Or does ToList() works differently? Because if it is not, I find hard to justice the reason of using the last line of the code instead of simply:
foreach (var user in users)
{
membershipUsers.Add(user);
}

Your example isn't particularly good for your question because ToList() isn't really in the same class of extension methods as the other ones supporting LINQ. The ToList() extension method is a conversion operation, not a query operation. The real values in LINQ are deferred execution of a composite query built by combining several LINQ query operations and improved readability. In LINQ2SQL you also get the advantage of constructing arbitrary queries that get pushed to the DB server for actual execution, taking advantage of optimizations that the DB may have in place to improve performance.
In general, I would expect that the question of performance largely comes down to how well you construct the actual queries and has a lot more to do with how well the programmer knows the tools and data than how well the tool is implemented. In your case, it makes no sense to construct a temporary list just to be able to invoke the convenience ForEach method on it if all you care about is performance. You'd be better off simply iterating over the enumeration you already have (as you suspect). LINQ won't stop a programmer from writing bad code, though it may disguise bad code for the person who doesn't understand how LINQ works.
It's always the case that you can construct an equivalent program not using LINQ for any program using LINQ. It may be that you can actually improve on the performance. I would submit, though, that LINQ makes it much easier to write readable code than non-LINQ solutions. By that, I mean more compact and understandable. It also makes it easier to write composable code, which when executed in a deferred manner performs better than, non-LINQ compositions. By breaking the code into composable parts, you simplify it and improve understandability.
I think the trick here is to really understand where LINQ makes sense rather than treat it as a shiny, new tool that you need to now use for every problem you have. The nice part of this shiny, new tool, though, is that it really does come in handy in a lot of situations.

It's O(n) - since .ToList() iterates once through the enumeration and copys the elements into the resulting List<T> (whose insertion is O(1)). Thus the complexity is fine.
The actual issue you might see is that you create a completely new, temporary List<T> just to copy its contents into another list (and afterwards discard it).
I suspect that's just due to the convenience of having a .ForEach()-method on List<T>s. One could nonetheless code a direct implementation for IEnumerable<T>s, which would save this one superfluous copying - or just write
foreach (var user in users) membershipUsers.Add(user)
which is basically what you want to express after all ;-)

Converting to a list will have the same complexity as iterating over the sequence, which may really by anything depending on how the sequence is generated. A normal Select over an in-memory list is O(n).
The performance of using ForEach on a List vs a foreach loop comes down to the overhead of invoking a delegate vs the overhead of creating and using a enumerator, I cannot say which one is quicker, but if both are used on an in-memory list, the complexity is the same.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.