C# IEnumerable being reset in child method - c#

I have the below method:
private static List<List<job>> SplitJobsByMonth(IEnumerable<job> inactiveJobs)
{
List<List<job>> jobsByMonth = new List<List<job>>();
DateTime cutOff = DateTime.Now.Date.AddMonths(-1).Date;
cutOff = cutOff.AddDays(-cutOff.Day + 1);
List<job> temp;
while (inactiveJobs.Count() > 0)
{
temp = inactiveJobs.Where(j => j.completeddt >= cutOff).ToList();
jobsByMonth.Add(temp);
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
cutOff = cutOff.AddMonths(-1);
}
return jobsByMonth;
}
It aims to split the jobs by month. 'job' is a class, not a struct. In the while loop, the passed in IEnumerable is reset with each iteration to remove the jobs that have been processed:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
Typically this reduces the content of this collection by quite a lot. However, on the next iteration the line:
temp = inactiveJobs.Where(j => j.completeddt >= cutOff).ToList();
restores the inactiveJobs object to the state it was when it was passed into the method - so the collection is full again.
I have solved this problem by refactoring this method slightly, but I am curious as to why this issue occurs as I can't explain it. Can anyone explain why this is happening?

Why not just use a group by?
private static List<List<job>> SplitJobsByMonth(IEnumerable<job> inactiveJobs)
{
var jobsByMonth = (from job in inactiveJobs
group job by new DateTime(job.completeddt.Year, job.completeddt.Month, 1)
into g
select g.ToList()).ToList();
return jobsByMonth;
}

This happens because of deferred execution of LINQ's Where.
When you do this
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a));
no evaluation is actually happening until you start iterating the IEnumerable. If you add ToList after Where, the iteration would happen right away, so the content of interactiveJobs would be reduced:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a)).ToList();

In LINQ, queries have two different behaviors of execution: immediate and deferred.
The query is actually executed when the query variable is iterated over, not when the query variable is created. This is called deferred execution.
You can also force a query to execute immediately, which is useful for caching query results.
In order to make this add .ToList() in the end of your line:
inactiveJobs = inactiveJobs.Where(a => !temp.Contains(a)).ToList();
This executes the created query immediately and writes result to your variable.
You can see more about this example Here.

Related

Using async/await inside a Select linq query [duplicate]

This question already has answers here:
Async await in linq select
(8 answers)
Closed 5 months ago.
After reading this post:
Nesting await in Parallel.ForEach
I tried to do the following:
private static async void Solution3UsingLinq()
{
var ids = new List<string>() { "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" };
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i);
Console.WriteLine(id);
});
}
For some reason, this doesn't work... I don't understand why, I think there is a deadlock but i'm not sure...
So at the end of your method, customerTasks contains an IEnumerable<Task> that has not been enumerated. None of the code within the Select even runs.
When creating tasks like this, it's probably safer to materialize your sequence immediately to mitigate the risk of double enumeration (and creating a second batch of tasks by accident). You can do this by calling ToList on your sequence.
So:
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i); //consider changing to GetCustomerAsync
Console.WriteLine(id);
}).ToList();
Now... what to do with your list of tasks? You need to wait for them all to complete...
You can do this with Task.WhenAll:
await Task.WhenAll(customerTasks);
You could take this a step further by actually returning a value from your async delegate in the Select statement, so you end up with an IEnumerable<Task<Customer>>.
Then you can use a different overload of Task.WhenAll:
IEnumerable<Task<Customer>> customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var c = await repo.getCustomer(i); //consider changing to GetCustomerAsync
return c;
}).ToList();
Customer[] customers = await Task.WhenAll(customerTasks); //look... all the customers
Of course, there are probably more efficient means of getting several customers in one go, but that would be for a different question.
If instead, you'd like to perform your async tasks in sequence then:
var customerTasks = ids.Select(async i =>
{
ICustomerRepo repo = new CustomerRepo();
var id = await repo.getCustomer(i); //consider changing to GetCustomerAsync
Console.WriteLine(id);
});
foreach(var task in customerTasks) //items in sequence will be materialized one-by-one
{
await task;
}
Addition:
There seems to be some confusion about when the the LINQ
statements actually are executed, especially the Where statement.
I created a small program to show when the
source data actually is accessed
Results at the end of this answer
end of Addition
You have to be aware about the lazyness of most LINQ functions.
Lazy LINQ functions will only change the Enumerator that IEnumerable.GetEnumerator() will return when you start enumerating. Hence, as long as you call lazy LINQ functions the query isn't executed.
Only when you starte enumerating, the query is executed. Enumerating starts when you call foreach, or non-layzy LINQ functions like ToList(), Any(), FirstOrDefault(), Max(), etc.
In the comments section of every LINQ function is described whether the function is lazy or not. You can also see whether the function is lazy by inspecting the return value. If it returns an IEnumerable<...> (or IQueryable) the LINQ is not enumerated yet.
The nice thing about this lazyness, is that as long as you use only lazy functions, changing the LINQ expression is not time consuming. Only when you use non-lazy functions, you have to be aware of the impact.
For instance, if fetching the first element of a sequence takes a long time to calculate, because of Ordering, Grouping, Database queries etc, make sure you don't start enumerating more then once (= don't use non-lazy functions for the same sequence more than once)
Don't do this at home:
Suppose you have the following query
var query = toDoLists
.Where(todo => todo.Person == me)
.GroupBy(todo => todo.Priority)
.Select(todoGroup => new
{
Priority = todoGroup.Key,
Hours = todoGroup.Select(todo => todo.ExpectedWorkTime).Sum(),
}
.OrderByDescending(work => work.Priority)
.ThenBy(work => work.WorkCount);
This query contains only lazy LINQ functions. After all these statement, the todoLists have not been accessed yet.
But as soon as you get the first element of the resulting sequence all elements have to be accessed (probably more than once) to group them by priority, calculate the total number of involved working hours and to sort them by descending priority.
This is the case for Any(), and again for First():
if (query.Any()) // do grouping, summing, ordering
{
var highestOnTodoList = query.First(); // do all work again
Process(highestOnTodoList);
}
else
{ // nothing to do
GoFishing();
}
In such cases it is better to use the correct function:
var highestOnToDoList = query.FirstOrDefault(); // do grouping / summing/ ordering
if (highestOnTioDoList != null)
etc.
back to your question
The Enumerable.Select statement only created an IEnumerable object for you. You forgot to enumerate over it.
Besides you constructed your CustomerRepo several times. Was that intended?
ICustomerRepo repo = new CustomerRepo();
IEnumerable<Task<CustomerRepo>> query = ids.Select(id => repo.getCustomer(i));
foreach (var task in query)
{
id = await task;
Console.WriteLine(id);
}
Addition: when are the LINQ statements executed?
I created a small program to test when a LINQ statement is executed, especially when a Where is executed.
A function that returns an IEnumerable:
IEnumerable<int> GetNumbers()
{
for (int i=0; i<10; ++i)
{
yield return i;
}
}
A program that uses this enumeration using an old fashioned Enumerator
public static void Main()
{
IEnumerable<int> number = GetNumbers();
IEnumerable<int> smallNumbers = numbers.Where(number => number < 3);
IEnumerator<int> smallEnumerator = smallNumbers.GetEnumerator();
bool smallNumberAvailable = smallEnumerator.MoveNext();
while (smallNumberAvailable)
{
int smallNumber = smallEnumerator.Current;
Console.WriteLine(smallNumber);
smallNumberAvailable = smallEnumerator.MoveNext();
}
}
During debugging I can see that the first time GetNumbers is executed the first time that MoveNext() is called. GetNumbers() is executed until the first yield return statement.
Every time that MoveNext() is called the statements after the yield return are performed until the next yield return is executed.
Changing the code such that the enumerator is accessed using foreach, Any(), FirstOrDefault(), ToDictionary, etc, shows that the calls to these functions are the time that the originating source is actually accessed.
if (smallNumbers.Any())
{
int x = smallNumbers.First();
Console.WriteLine(x);
}
Debugging shows that the originating source starts enumerating from the beginning twice. So indeed, it is not wise to do this, especially if you need to do a lot to calculate the first element (GroupBy, OrderBy, Database access, etc)

Understanding Deferred Execution: Is a Linq Query Re-executed Everytime its collection of anonymous objects is referred to?

I'm currently trying to write some code that will run a query on two separate databases, and will return the results to an anonymous object. Once I have the two collections of anonymous objects, I need to perform a comparison on the two collections. The comparison is that I need to retrieve all of the records that are in webOrders, but not in foamOrders. Currently, I'm making the comparison by use of Linq. My major problem is that both of the original queries return about 30,000 records, and as my code is now, it takes waay too long to complete. I'm new to using Linq, so I'm trying to understand if using Linq to compare the two collections of anonymous objects will actually cause the database queries to run over and over again - due to deferred execution. This may be an obvious answer, but I don't yet have a very firm understanding of how Linq and anonymous objects work with deferred execution. I'm hoping someone may be able to enlighten me. Below is the code that I have...
private DataTable GetData()
{
using (var foam = Databases.Foam(false))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(true)
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT order_id
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM TRANSACTIONS AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
return (from w in webOrders
where !(from f in foamOrders
select f.order).Contains(w.order)
select w
).ToDataTable();
}
}
}
Your linq ceases to be deferred when you do
ToDataTable();
At that point it is snapshotted as done and dusted forever.
Same is true with foamOrders and webOrders when you convert it
ToList();
You could do it as one query. I dont have mySQL to check it out on.
Regarding deferred execution:
Method .ToList() iterates over the IEnumerable retrieves all values and fill a new List<T> object with that values. So it's definitely not deferred execution at this point.
It's most likely the same with .ToDataTable();
P.S.
But i'd recommend to :
Use custom types rather than anonymous types.
Do not use LINQ to compare objects because it's not really effective (linq is doing extra job)
You can create a custom MyComparer class (that might implement IComparer interface) and method like Compare<T1, T2> that compares two entities. Then you can create another method to compare two sets of entities for example T1[] CompareRange<T1,T2>(T1[] entities1, T2[] entities2) that reuse your Compare method in a loop and returns result of the operation
P.S.
Some of other resource-intensive operations that may potentially lead to significant performance losses (in case if you need to perform thousands of operations) :
Usage of enumerator object (foreach loop or some of LINQ methods)
Possible solution : Try to use for loop if it is possible.
Extensive use of anonymous methods (compiler requires significant time to compile the lambda expression / operator );
Possible solution : Store lambdas in delegates (like Func<T1, T2>)
In case it helps anyone in the future, my new code is pasted below. It runs much faster now. Thanks to everyone's help, I've learned that even though the deferred execution of my database queries was cut off and the results became static once I used .ToList(), using Linq to compare the resulting collections was very inefficient. I went with a for loop instead.
private DataTable GetData()
{
//Needed to have both connections open in order to preserve the scope of var foamOrders and var webOrders, which are both needed in order to perform the comparison.
using (var foam = Databases.Foam(isDebug))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(isDebug)))
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT foreignID
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM transactions AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))
", 300)
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
List<OrderNumber> on = new List<OrderNumber>();
foreach (var w in webOrders)
{
if (!foamOrders.Contains(w))
{
OrderNumber o = new OrderNumber();
o.orderNumber = w.order;
on.Add(o);
}
}
return on.ToDataTable();
}
}
}
public class OrderNumber
{
public string orderNumber { get; set; }
}
}

Why is this output variable in my LINQ expression NOT problematic?

Given the following code:
var strings = Enumerable.Range(0, 100).Select(i => i.ToString());
int outValue = 0;
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue);
outValue = 3;
//enumerating over someEnumerable here shows ints from 0 to 99
I am able to see a "snapshot" of the out parameter for each iteration. Why does this work correctly instead of me seeing 100 3's (deferred execution) or 100 99's (access to modified closure)?
First you define a query, strings that knows how to generate a sequence of strings, when queried. Each time a value is asked for it will generate a new number and convert it to a string.
Then you declare a variable, outValue, and assign 0 to it.
Then you define a new query, someEnumerable, that knows how to, when asked for a value, get the next value from the query strings, try to parse the value and, if the value can be parsed, yields the value of outValue. Once again, we have defined a query that can do this, we have not actually done any of this.
You then set outValue to 3.
Then you ask someEnumerable for it's first value, you are asking the implementation of Select for its value. To compute that value it will ask the Where for its first value. The Where will ask strings. (We'll skip a few steps now.) The Where will get a 0. It will call the predicate on 0, specifically calling int.TryParse. A side effect of this is that outValue will be set to 0. TryParse returns true, so the item is yielded. Select then maps that value (the string 0) into a new value using its selector. The selector ignores the value and yields the value of outValue at that point in time, which is 0. Our foreach loop now does whatever with 0.
Now we ask someEnumerable for its second value, on the next iteration of the loop. It asks Select for a value, Select asks Where,Where asks strings, strings yields "1", Where calls the predicate, setting outValue to 1 as a side effect, Select yields the current value of outValue, which is 1. The foreach loop now does whatever with 1.
So the key point here is that due to the way in which Where and Select defer execution, performing their work only immediately when the values are needed, the side effect of the Where predicate ends up being called immediately before each projection in the Select. If you didn't defer execution, and instead performed all of the TryParse calls before any of the projections in Select, then you would see 100 for each value. We can actually simulate this easily enough. We can materialize the results of the Where into a collection, and then see the results of the Select be 100 repeated over and over:
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.ToList()//eagerly evaluate the query up to this point
.Select(s => outValue);
Having said all of that, the query that you have is not particularly good design. Whenever possible you should avoid queries that have side effects (such as your Where). The fact that the query both causes side effects, and observes the side effects that it creates, makes following all of this rather hard. The preferable design would be to rely on purely functional methods that aren't causing side effects. In this context the simplest way to do that is to create a method that tries to parse a string and returns an int?:
public static int? TryParse(string rawValue)
{
int output;
if (int.TryParse(rawValue, out output))
return output;
else
return null;
}
This allows us to write:
var someEnumerable = from s in strings
let n = TryParse(s)
where n != null
select n.Value;
Here there are no observable side effects in the query, nor is the query observing any external side effects. It makes the whole query far easier to reason about.
Because when you enumerate the value changes one at a time and changes the value of the variable on the fly. Due to the nature of LINQ the select for the first iteration is executed before the where for the second iteration. Basically this variable turns into a foreach loop variable of a kind.
This is what deferred execution buys us. Previous methods do not have to execute fully before the next method in the chain starts. One value moves through all the methods before the second goes in. This is very useful with methods like First or Take which stop the iteration early. Exceptions to the rule are methods that need to aggregate or sort like OrderBy (they need to look at all elements before finding out which is first). If you add an OrderBy before the Select the behavior will probably break.
Of course I wouldn't depend on this behavior in production code.
I don't understand what is odd for you?
If you write a loop on this enumerable like this
foreach (var i in someEnumerable)
{
Console.WriteLine(outValue);
}
Because LINQ enumerates each where and select lazyly and yield each value, if you add ToArray
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue).ToArray();
Than in the loop you will see 99 s
Edit
The below code will print 99 s
var strings = Enumerable.Range(0, 100).Select(i => i.ToString());
int outValue = 0;
var someEnumerable = strings.Where(s => int.TryParse(s, out outValue))
.Select(s => outValue).ToArray();
//outValue = 3;
foreach (var i in someEnumerable)
{
Console.WriteLine(outValue);
}

When are collections enumerated (IEnumerable)

Recently, I ran into a strange problem where I had a method generate an IEnumerable collection of objects. This method contained four yield return statements that returned four objects. I assigned the result to a variable results using the var keyword.
var result = GenerateCollection().ToList();
This effectively meant: List<MyType> result = GenerateCollection().
I made a simple for loop over the elements of this collection. What surprised me is that the collection was reenumerated for each call to the list (for each result[i]). Later I used the result collection in a LINQ query which had some bad results performance-wise due to the continual reenumeration of the collection.
I solved the problem by casting to array instead of the list.
What this makes me wonder now is when are the collections enumerated? Which method calls make collections reenumerate?
EDIT: The GenerateCollection() method looked similarly to this:
public static IEnumerable<MyType> GenerateCollection()
{
var array = data.AsParallel(); //data is a simple collection of sublists of strings
yield return new MyType("a", array.Where(x => x.Sublist.Count(y => y == 'a') == 0));
yield return new MyType("b", array.Where(x => x.Sublist.Count(y => y == 'b') == 0));
yield return new MyType("c", array.Where(x => x.Sublist.Count(y => y == 'c') == 0));
yield return new MyType("d", array.Where(x => x.Sublist.Count(y => y == 'd') == 0));
}
You are yielding objects which have queries inside - it's not some sequence of array values - its iterator objects which are not executed when you are passing them to constructor of MyType. When you create list of MyType objects
var result = GenerateCollection().ToList();
all MyType instances are yielded and saved into list, but if you haven't executed iterators in MyType constructor, then queries are not executed. And even more - they will be executed each time again, if you'll call some operator which executes query, e.g.
result[i].ArrayIterator.Count(); // first execution
foreach(var item in result[i].ArrayIterator) // second execution
// ...
You can fix it if you'll pass result of query execution to MyType constructor:
yield return new MyType("a", array.Where(x => !x.Sublist.Contains('a')).ToList())
Now you are passing list of items instead of iterator (you can use ToArray()) also. Query is executed when you are yielding MyType instance, and it will not be executed again.
array.Where(x => x.Sublist.Count(y => y == 'a') == 0)
This piece of code will be enumerated every time you access it in MyType. Use ToList or ToArray to ensure it is enumerated only once in place where the code is written.
collections which are based on deferred execution gets enumerated as soon as you use them For-example IEnumerable,IQueryable etc. and collections which are based on immediate execution gets enumerated as soon as they are created for example LIST.

does putting Linq query inside a method affect deferred excecution?

Linq query is not executed until the sequence returned by the query is actually iterated.
I have a query that is used repeatedly, so I am going to encapuslate it inside a method. I'd like to know if it interferes with the deferred execution.
If I encapsulate a Linq query into a method like below,
the query gets executed at line 2, not line 1 where the method is called. Is this correct?
public IEnumerable<Person> GetOldPeopleQuery()
{
return personList.Where(p => p.Age > 60);
}
public void SomeOtherMethod()
{
var getWomenQuery = GetOldPeopleQuery().Where(p => p.Gender == "F"); //line 1
int numberOfOldWomen = getWomanQuery.Count(); //line 2
}
P.S. I am using Linq-To-EF, if it makes any difference.
The query is lazy evaluated when you enumerate the result the first time, that is not affected by putting it inside a method.
There is however another thing in your code that will be very inefficient. Once you've returned an IEnumerable the next linq statement applied to the collection will be a linq-to-objects query. That means that in your case you will load all old people from the database and then filter out the women in memory. The same with the count, it will be done in memory.
If you instead return an IQueryable<Person> those two questions will be evaluated using linq-to-entities and the filtering and summing can be done in the database.

Categories