Why does IEnumerable<thing> Select run everytime? - c#

I took a second to look why my app had terrible performance. All i did was pause the debugger twice and i found it.
Is there a practical reason why it runs my code everytime? The only way i know to prevent this is to add ToArray() at the end. I guess i need to revise all my code and make sure they return arrays?
Online demo http://ideone.com/EUfJN
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
class Test
{
static void Main()
{
string[] test = new string[] { "a", "sdj", "bb", "d444"};
var expensivePrint = false;
IEnumerable<int> ls = test.Select(s => { if (expensivePrint) { Console.WriteLine("Doing expensive math"); } return s.Length; });
expensivePrint = true;
foreach (var v in ls)
{
Console.WriteLine(v);
}
Console.WriteLine("If you dont think it does it everytime, lets try it again");
foreach (var v in ls)
{
Console.WriteLine(v);
}
}
}
Output
Doing expensive math
1
Doing expensive math
3
Doing expensive math
2
Doing expensive math
4
If you dont think it does it everytime, lets try it again
Doing expensive math
1
Doing expensive math
3
Doing expensive math
2
Doing expensive math
4

Enumerables evaluate lazily (only when required). Add a .ToList() after the select and it will force evaluation.

LINQ has lazy evaluation methods and Select is one of them.
And the thing is you are using foreach two times and it prints the values two times.

The Select causes the iterator to be... iterated.
If it is expensive to build the result, you can .ToList() the result once, then use that list going forward.
List<int> resultAsList = ls.ToList();
// Use resultAsList in each of the foreach statements

When you are building the query
IEnumerable<int> ls = test.Select(s => { if (expensivePrint) { Console.WriteLine("Doing expensive math"); } return s.Length; });
It actually does not EXECUTE and cache the result as you are apparently expecting. This is called "deffered execution".
It just builds the query. The execution of the query actually takes place when the foreach statement is called on the query.
If you call ToList() or ToArray() or Sum() or Average() or any operator of the kind on your query, it will however execute it IMMEDIATELY.
The best thing to do if you want to keep the result of the query, is to cache it in a array or list by calling ToList() or ToArray() and to enumerate on this list or array rather than on the constructed query.

Please refer to the documentation of Enumerable.Select
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
By iterating the result of the Select method, the query is executed. foreach is one way to iterate that result. ToArray is another.
Is there a practical reason why it runs my code everytime?
Yes, if the result was not deferred, then more iteration would be performed than necessary:
IEnumerable<string> query = Enumerable.Range(0, 100000)
.Select(x => x.ToString())
.Where(s => s.Length == 6)
.Take(5);

This is Linq's deferred execution. If you need a concise yet complete explanation, read this:
http://weblogs.asp.net/dixin/archive/2010/03/16/understanding-linq-to-objects-6-deferred-execution.aspx

I would suggest you use .ToArray() which will return int[] for you and will give better performance
Reason why int[] because it will be declare and create at once, well as the List<T> wil be created one by one at runtime
int[] array = test.Select(s =>
{
if (expensivePrint)
{
Console.WriteLine("Doing expensive math");
}
return s.Length;
}).ToArray();

Related

Using async/await inside a Select linq query [duplicate]

This question already has answers here:
Async await in linq select
(8 answers)
Closed 5 months ago.
After reading this post:
Nesting await in Parallel.ForEach
I tried to do the following:
private static async void Solution3UsingLinq()
{
var ids = new List<string>() { "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" };
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i);
Console.WriteLine(id);
});
}
For some reason, this doesn't work... I don't understand why, I think there is a deadlock but i'm not sure...
So at the end of your method, customerTasks contains an IEnumerable<Task> that has not been enumerated. None of the code within the Select even runs.
When creating tasks like this, it's probably safer to materialize your sequence immediately to mitigate the risk of double enumeration (and creating a second batch of tasks by accident). You can do this by calling ToList on your sequence.
So:
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i); //consider changing to GetCustomerAsync
Console.WriteLine(id);
}).ToList();
Now... what to do with your list of tasks? You need to wait for them all to complete...
You can do this with Task.WhenAll:
await Task.WhenAll(customerTasks);
You could take this a step further by actually returning a value from your async delegate in the Select statement, so you end up with an IEnumerable<Task<Customer>>.
Then you can use a different overload of Task.WhenAll:
IEnumerable<Task<Customer>> customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var c = await repo.getCustomer(i); //consider changing to GetCustomerAsync
return c;
}).ToList();
Customer[] customers = await Task.WhenAll(customerTasks); //look... all the customers
Of course, there are probably more efficient means of getting several customers in one go, but that would be for a different question.
If instead, you'd like to perform your async tasks in sequence then:
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i); //consider changing to GetCustomerAsync
Console.WriteLine(id);
});
foreach(var task in customerTasks) //items in sequence will be materialized one-by-one
{
await task;
}
Addition:
There seems to be some confusion about when the the LINQ
statements actually are executed, especially the Where statement.
I created a small program to show when the
source data actually is accessed
Results at the end of this answer
end of Addition
You have to be aware about the lazyness of most LINQ functions.
Lazy LINQ functions will only change the Enumerator that IEnumerable.GetEnumerator() will return when you start enumerating. Hence, as long as you call lazy LINQ functions the query isn't executed.
Only when you starte enumerating, the query is executed. Enumerating starts when you call foreach, or non-layzy LINQ functions like ToList(), Any(), FirstOrDefault(), Max(), etc.
In the comments section of every LINQ function is described whether the function is lazy or not. You can also see whether the function is lazy by inspecting the return value. If it returns an IEnumerable<...> (or IQueryable) the LINQ is not enumerated yet.
The nice thing about this lazyness, is that as long as you use only lazy functions, changing the LINQ expression is not time consuming. Only when you use non-lazy functions, you have to be aware of the impact.
For instance, if fetching the first element of a sequence takes a long time to calculate, because of Ordering, Grouping, Database queries etc, make sure you don't start enumerating more then once (= don't use non-lazy functions for the same sequence more than once)
Don't do this at home:
Suppose you have the following query
var query = toDoLists
.Where(todo => todo.Person == me)
.GroupBy(todo => todo.Priority)
.Select(todoGroup => new
{
Priority = todoGroup.Key,
Hours = todoGroup.Select(todo => todo.ExpectedWorkTime).Sum(),
}
.OrderByDescending(work => work.Priority)
.ThenBy(work => work.WorkCount);
This query contains only lazy LINQ functions. After all these statement, the todoLists have not been accessed yet.
But as soon as you get the first element of the resulting sequence all elements have to be accessed (probably more than once) to group them by priority, calculate the total number of involved working hours and to sort them by descending priority.
This is the case for Any(), and again for First():
if (query.Any()) // do grouping, summing, ordering
{
var highestOnTodoList = query.First(); // do all work again
Process(highestOnTodoList);
}
else
{ // nothing to do
GoFishing();
}
In such cases it is better to use the correct function:
var highestOnToDoList = query.FirstOrDefault(); // do grouping / summing/ ordering
if (highestOnTioDoList != null)
etc.
back to your question
The Enumerable.Select statement only created an IEnumerable object for you. You forgot to enumerate over it.
Besides you constructed your CustomerRepo several times. Was that intended?
ICustomerRepo repo = new CustomerRepo();
IEnumerable<Task<CustomerRepo>> query = ids.Select(id => repo.getCustomer(i));
foreach (var task in query)
{
id = await task;
Console.WriteLine(id);
}
Addition: when are the LINQ statements executed?
I created a small program to test when a LINQ statement is executed, especially when a Where is executed.
A function that returns an IEnumerable:
IEnumerable<int> GetNumbers()
{
for (int i=0; i<10; ++i)
{
yield return i;
}
}
A program that uses this enumeration using an old fashioned Enumerator
public static void Main()
{
IEnumerable<int> number = GetNumbers();
IEnumerable<int> smallNumbers = numbers.Where(number => number < 3);
IEnumerator<int> smallEnumerator = smallNumbers.GetEnumerator();
bool smallNumberAvailable = smallEnumerator.MoveNext();
while (smallNumberAvailable)
{
int smallNumber = smallEnumerator.Current;
Console.WriteLine(smallNumber);
smallNumberAvailable = smallEnumerator.MoveNext();
}
}
During debugging I can see that the first time GetNumbers is executed the first time that MoveNext() is called. GetNumbers() is executed until the first yield return statement.
Every time that MoveNext() is called the statements after the yield return are performed until the next yield return is executed.
Changing the code such that the enumerator is accessed using foreach, Any(), FirstOrDefault(), ToDictionary, etc, shows that the calls to these functions are the time that the originating source is actually accessed.
if (smallNumbers.Any())
{
int x = smallNumbers.First();
Console.WriteLine(x);
}
Debugging shows that the originating source starts enumerating from the beginning twice. So indeed, it is not wise to do this, especially if you need to do a lot to calculate the first element (GroupBy, OrderBy, Database access, etc)

C# IEnumerable being reset in child method

I have the below method:
private static List<List<job>> SplitJobsByMonth(IEnumerable<job> inactiveJobs)
{
List<List<job>> jobsByMonth = new List<List<job>>();
DateTime cutOff = DateTime.Now.Date.AddMonths(-1).Date;
cutOff = cutOff.AddDays(-cutOff.Day + 1);
List<job> temp;
while (inactiveJobs.Count() > 0)
{
temp = inactiveJobs.Where(j => j.completeddt >= cutOff).ToList();
jobsByMonth.Add(temp);
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
cutOff = cutOff.AddMonths(-1);
}
return jobsByMonth;
}
It aims to split the jobs by month. 'job' is a class, not a struct. In the while loop, the passed in IEnumerable is reset with each iteration to remove the jobs that have been processed:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
Typically this reduces the content of this collection by quite a lot. However, on the next iteration the line:
temp = inactiveJobs.Where(j => j.completeddt >= cutOff).ToList();
restores the inactiveJobs object to the state it was when it was passed into the method - so the collection is full again.
I have solved this problem by refactoring this method slightly, but I am curious as to why this issue occurs as I can't explain it. Can anyone explain why this is happening?
Why not just use a group by?
private static List<List<job>> SplitJobsByMonth(IEnumerable<job> inactiveJobs)
{
var jobsByMonth = (from job in inactiveJobs
group job by new DateTime(job.completeddt.Year, job.completeddt.Month, 1)
into g
select g.ToList()).ToList();
return jobsByMonth;
}
This happens because of deferred execution of LINQ's Where.
When you do this
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
no evaluation is actually happening until you start iterating the IEnumerable. If you add ToList after Where, the iteration would happen right away, so the content of interactiveJobs would be reduced:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a)).ToList();
In LINQ, queries have two different behaviors of execution: immediate and deferred.
The query is actually executed when the query variable is iterated over, not when the query variable is created. This is called deferred execution.
You can also force a query to execute immediately, which is useful for caching query results.
In order to make this add .ToList() in the end of your line:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a)).ToList();
This executes the created query immediately and writes result to your variable.
You can see more about this example Here.

Understanding Deferred Execution: Is a Linq Query Re-executed Everytime its collection of anonymous objects is referred to?

I'm currently trying to write some code that will run a query on two separate databases, and will return the results to an anonymous object. Once I have the two collections of anonymous objects, I need to perform a comparison on the two collections. The comparison is that I need to retrieve all of the records that are in webOrders, but not in foamOrders. Currently, I'm making the comparison by use of Linq. My major problem is that both of the original queries return about 30,000 records, and as my code is now, it takes waay too long to complete. I'm new to using Linq, so I'm trying to understand if using Linq to compare the two collections of anonymous objects will actually cause the database queries to run over and over again - due to deferred execution. This may be an obvious answer, but I don't yet have a very firm understanding of how Linq and anonymous objects work with deferred execution. I'm hoping someone may be able to enlighten me. Below is the code that I have...
private DataTable GetData()
{
using (var foam = Databases.Foam(false))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(true)
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT order_id
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM TRANSACTIONS AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
return (from w in webOrders
where !(from f in foamOrders
select f.order).Contains(w.order)
select w
).ToDataTable();
}
}
}
Your linq ceases to be deferred when you do
ToDataTable();
At that point it is snapshotted as done and dusted forever.
Same is true with foamOrders and webOrders when you convert it
ToList();
You could do it as one query. I dont have mySQL to check it out on.
Regarding deferred execution:
Method .ToList() iterates over the IEnumerable retrieves all values and fill a new List<T> object with that values. So it's definitely not deferred execution at this point.
It's most likely the same with .ToDataTable();
P.S.
But i'd recommend to :
Use custom types rather than anonymous types.
Do not use LINQ to compare objects because it's not really effective (linq is doing extra job)
You can create a custom MyComparer class (that might implement IComparer interface) and method like Compare<T1, T2> that compares two entities. Then you can create another method to compare two sets of entities for example T1[] CompareRange<T1,T2>(T1[] entities1, T2[] entities2) that reuse your Compare method in a loop and returns result of the operation
P.S.
Some of other resource-intensive operations that may potentially lead to significant performance losses (in case if you need to perform thousands of operations) :
Usage of enumerator object (foreach loop or some of LINQ methods)
Possible solution : Try to use for loop if it is possible.
Extensive use of anonymous methods (compiler requires significant time to compile the lambda expression / operator );
Possible solution : Store lambdas in delegates (like Func<T1, T2>)
In case it helps anyone in the future, my new code is pasted below. It runs much faster now. Thanks to everyone's help, I've learned that even though the deferred execution of my database queries was cut off and the results became static once I used .ToList(), using Linq to compare the resulting collections was very inefficient. I went with a for loop instead.
private DataTable GetData()
{
//Needed to have both connections open in order to preserve the scope of var foamOrders and var webOrders, which are both needed in order to perform the comparison.
using (var foam = Databases.Foam(isDebug))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(isDebug)))
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT foreignID
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM transactions AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))
", 300)
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
List<OrderNumber> on = new List<OrderNumber>();
foreach (var w in webOrders)
{
if (!foamOrders.Contains(w))
{
OrderNumber o = new OrderNumber();
o.orderNumber = w.order;
on.Add(o);
}
}
return on.ToDataTable();
}
}
}
public class OrderNumber
{
public string orderNumber { get; set; }
}
}

Why is this output variable in my LINQ expression NOT problematic?

Given the following code:
var strings = Enumerable.Range(0, 100).Select(i => i.ToString());
int outValue = 0;
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue);
outValue = 3;
//enumerating over someEnumerable here shows ints from 0 to 99
I am able to see a "snapshot" of the out parameter for each iteration. Why does this work correctly instead of me seeing 100 3's (deferred execution) or 100 99's (access to modified closure)?
First you define a query, strings that knows how to generate a sequence of strings, when queried. Each time a value is asked for it will generate a new number and convert it to a string.
Then you declare a variable, outValue, and assign 0 to it.
Then you define a new query, someEnumerable, that knows how to, when asked for a value, get the next value from the query strings, try to parse the value and, if the value can be parsed, yields the value of outValue. Once again, we have defined a query that can do this, we have not actually done any of this.
You then set outValue to 3.
Then you ask someEnumerable for it's first value, you are asking the implementation of Select for its value. To compute that value it will ask the Where for its first value. The Where will ask strings. (We'll skip a few steps now.) The Where will get a 0. It will call the predicate on 0, specifically calling int.TryParse. A side effect of this is that outValue will be set to 0. TryParse returns true, so the item is yielded. Select then maps that value (the string 0) into a new value using its selector. The selector ignores the value and yields the value of outValue at that point in time, which is 0. Our foreach loop now does whatever with 0.
Now we ask someEnumerable for its second value, on the next iteration of the loop. It asks Select for a value, Select asks Where,Where asks strings, strings yields "1", Where calls the predicate, setting outValue to 1 as a side effect, Select yields the current value of outValue, which is 1. The foreach loop now does whatever with 1.
So the key point here is that due to the way in which Where and Select defer execution, performing their work only immediately when the values are needed, the side effect of the Where predicate ends up being called immediately before each projection in the Select. If you didn't defer execution, and instead performed all of the TryParse calls before any of the projections in Select, then you would see 100 for each value. We can actually simulate this easily enough. We can materialize the results of the Where into a collection, and then see the results of the Select be 100 repeated over and over:
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.ToList()//eagerly evaluate the query up to this point
.Select(s => outValue);
Having said all of that, the query that you have is not particularly good design. Whenever possible you should avoid queries that have side effects (such as your Where). The fact that the query both causes side effects, and observes the side effects that it creates, makes following all of this rather hard. The preferable design would be to rely on purely functional methods that aren't causing side effects. In this context the simplest way to do that is to create a method that tries to parse a string and returns an int?:
public static int? TryParse(string rawValue)
{
int output;
if (int.TryParse(rawValue, out output))
return output;
else
return null;
}
This allows us to write:
var someEnumerable = from s in strings
let n = TryParse(s)
where n != null
select n.Value;
Here there are no observable side effects in the query, nor is the query observing any external side effects. It makes the whole query far easier to reason about.
Because when you enumerate the value changes one at a time and changes the value of the variable on the fly. Due to the nature of LINQ the select for the first iteration is executed before the where for the second iteration. Basically this variable turns into a foreach loop variable of a kind.
This is what deferred execution buys us. Previous methods do not have to execute fully before the next method in the chain starts. One value moves through all the methods before the second goes in. This is very useful with methods like First or Take which stop the iteration early. Exceptions to the rule are methods that need to aggregate or sort like OrderBy (they need to look at all elements before finding out which is first). If you add an OrderBy before the Select the behavior will probably break.
Of course I wouldn't depend on this behavior in production code.
I don't understand what is odd for you?
If you write a loop on this enumerable like this
foreach (var i in someEnumerable)
{
Console.WriteLine(outValue);
}
Because LINQ enumerates each where and select lazyly and yield each value, if you add ToArray
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue).ToArray();
Than in the loop you will see 99 s
Edit
The below code will print 99 s
var strings = Enumerable.Range(0, 100).Select(i => i.ToString());
int outValue = 0;
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue).ToArray();
//outValue = 3;
foreach (var i in someEnumerable)
{
Console.WriteLine(outValue);
}

What it the easiest way to make LINQ to process all collection at once?

I often have similar constructions:
var t = from i in Enumerable.Range(0,5)
select num(i);
Console.WriteLine(t.Count());
foreach (var item in t)
Console.WriteLine(item);
In this case LINQ will evaluate num() function twice for each element (one for Count() and one for output). So after such LINQ calls I have to declare new vatiable: var t2 = t.ToList();
Is there a better way to do this?
You can call ToList without making a separate variable:
var t = Enumerable.Range(0,5).Select(num).ToList();
EDIT: Or,
var t = Enumerable.Range(0,5).Select(x => num(x)).ToList();
Or even
var t = (from i in Enumerable.Range(0,5)
select num).ToList();
I usually call the functions so it could look like this:
var t = Enumerable.Range(0,5).Select(x=>num(x)).ToList();
...
var count = 0
foreach(var item in t)
{
Console.WriteLine(item)
count++;
}
num is now evaluated once per item. If you wanted to use the predicate overload (Count(i=>i.PropA == "abc")) then just wrap the increment in an if. Don't use extension methods for everything; like you realized, just because you can't see the implementation doesn't mean it isn't still costing you.
If you expect to use the concrete values a lot, then ToList() is a great answer.

Categories