Linq running total 1st value added to itself - c#

I have the below which calculates the running total for a customer account status, however he first value is always added to itself and I'm not sure why - though I suspect I've missed something obvious:
decimal? runningTotal = 0;
IEnumerable<StatementModel> statement = sage.Repository<FDSSLTransactionHistory>()
.Queryable()
.Where(x => x.CustomerAccountNumber == sageAccount)
.OrderBy(x=>x.UniqueReferenceNumber)
.AsEnumerable()
.Select(x => new StatementModel()
{
SLAccountId = x.CustomerAccountNumber,
TransactionReference = x.TransactionReference,
SecondReference = x.SecondReference,
Currency = x.CurrencyCode,
Value = x.GoodsValueInAccountCurrency,
TransactionDate = x.TransactionDate,
TransactionType = x.TransactionType,
TransactionDescription = x.TransactionTypeName,
Status = x.Status,
RunningTotal = (runningTotal += x.GoodsValueInAccountCurrency)
});
Which outputs:
29/02/2012 00:00:00 154.80 309.60
30/04/2012 00:00:00 242.40 552.00
30/04/2012 00:00:00 242.40 794.40
30/04/2012 00:00:00 117.60 912.00
Where the 309.60 of the first row should be simply 154.80
What have I done wrong?
EDIT:
As per ahruss's comment below, I was calling Any() on the result in my View, causing the first to be evaluated twice - to resolve I appended ToList() to my query.
Thanks all for your suggestions

Add a ToList() to the end of the call to avoid duplicate invocations of the selector.
This is a stateful LINQ query with side-effects, which is by nature unpredictable. Somewhere else in the code, you called something that caused the first element to be evaluated, like First() or Any(). In general, it is dangerous to have side-effects in LINQ queries, and when you find yourself needing them, it's time to think about whether or not it should just be a foreach instead.
Edit, or Why is this happening?
This is a result of how LINQ queries are evaluated: until you actually use the results of a query, nothing really happens to the collection. It doesn't evaluate any of the elements. Instead, it stores Abstract Expression Trees or just the delegates it needs to evaluate the query. Then, it evaluates those only when the results are needed, and unless you explicitly store the results, they're thrown away afterwards, and re-evaluated the next time.
So this makes the question why does it have different results each time? The answer is that runningTotal is only initialized the first time around. After that, its value is whatever it was after the last execution of the query, which can lead to strange results.
This means the question could just have easily have been "Why is the total always twice what it should be?" if the asker were doing something like this:
Console.WriteLine(statement.Count()); // this enumerates all the elements!
foreach (var item in statement) { Console.WriteLine(item.Total); }
Because the only way to get the number of elements in the sequence is to actually evaluate all of them.
Similarly, what actually happened in this question was that somewhere there was code like this:
if (statement.Any()) // this actually involves getting the first result
{
// do something with the statement
}
// ...
foreach (var item in statement) { Console.WriteLine(item.Total); }
It seems innocuous, but if you know how LINQ and IEnumerable work, you know that .Any() is basically the same as .GetEnumerator().MoveNext(), which makes it more obvious that it requires getting the first element.
It all boils down to the fact that LINQ is based on deferred execution, which is why the solution is to use ToList, which circumvents that and forces immediate execution.

If you don't want to freeze the results with ToList, a solution to the outer scope variable problem is using an iterator function, like this:
IEnumerable<StatementModel> GetStatement(IEnumerable<DataObject> source) {
decimal runningTotal = 0;
foreach (var x in source) {
yield return new StatementModel() {
...
RunningTotal = (runningTotal += x.GoodsValueInAccountCurrency)
};
}
}
Then pass to this function the source query (not including the Select):
var statement = GetStatement(sage.Repository...AsEnumerable());
Now it is safe to enumerate statement multiple times. Basically, this creates an enumerable that re-executes this entire block on each enumeration, as opposed to executing a selector (which equates to only the foreach part) -- so runningTotal will be reset.

Related

Resharper, linq within foreach loop

Resharper is suggesting to use the top example, over the bottom example. However I am under the impression that a new list of items will be created first, and thus all of the _executeFuncs will be run before the runstoredprocedure is called.
This would normally not be an issue, however exceptions are prone to occur and if my hypothesis is correct then my database will not be update despite the functions having been ran??
foreach (var result in rows.Select(row => _executeFunc(row)))
{
RunStoredProcedure(result)
}
Or
foreach(var row in rows)
{
var result = _executeFunc(row);
RunStoredProcedure(result);
}
The statements are, in this case, semantically the same as Select (and linq in general) uses deferred execution of delegates. It won't run any declared queries until the result is being materialised, and depending on how you write that query it will do it in proper sequence.
A very simple example to show that:
var list = new List<string>{"hello", "world", "example"};
Func<string, string> func = (s) => {
Console.WriteLine(s);
return s.ToUpper();
};
foreach(var item in list.Select(i => func(i)))
{
Console.WriteLine(item);
}
results in
hello
HELLO
world
WORLD
example
EXAMPLE
In your first example, _executeFunc(row) will NOT be called first for each item in rows before your foreach loop begins. LINQ will defer execution. See This answer for more details.
The order of events will be:
Evaluate the first item in rows
Call executeFunc(row) on that item
Call RunStoredProcedure(result)
Repeat with the next item in rows
Now, if your code were something like this:
foreach (var result in rows.Select(row => _executeFunc(row)).ToList())
{
RunStoredProcedure(result)
}
Then it WOULD run the LINQ .Select first for every item in rows because the .ToList() causes the collection to be enumerated.
In the top example, using Select will project the rows, by yielding them one by one.
So
foreach (var result in rows.Select(row => _executeFunc(row)))
is basically the same as
foreach(var row in rows)
Thus Select is doing something like this
for each row in source
result = _executeFunc(row)
yield result
That yield is passing each row back one by one (it's a bit more complicated than that, but this explanation should suffice for now).
If you did this instead
foreach (var result in rows.Select(row => _executeFunc(row)).ToList())
Calling ToList() will return a List of rows immediately, and that means _executeFunc() will indeed be called for every row, before you've had a chance to call RunStoredProcedure().
Thus what Resharper is suggesting is valid. To be fair, I'm sure the Jetbrains devs know what they are doing :)
Select uses deferred execution. This means that it will, in order:
take an item from rows
call _executeFunc on it
call RunStoredProcedure on the result of _executeFunc
And then it will do the same for the next item, until all the list has been processed.
The execution will be deferred meaning they will have the same exec

How do I make a streaming LINQ expression that delivers the filtered out items as well as the filtered items?

I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.
Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.
var elements = sheet
.Rows() <-- BEGIN sheet data transform
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows() <-- END sheet data transform
.ToElements(strings) <-- BEGIN domain transform
.RemoveBadRecords(out discard)
.OrderByCompositeKey();
The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.
public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
return map.Select(pair =>
{
try
{
return ElementRow.FromCells(pair.Key, pair.Value);
}
catch (Exception)
{
return ElementRow.BadRecord(pair.Key);
}
});
}
Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
return elements.Where(el => el.FormatId != 0);
}
However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
var temp = new List<ElementRow>();
var filtered = elements.Where(el =>
{
if (el.FormatId == 0) temp.Add(el);
return el.FormatId != 0;
});
discard = temp;
return filtered;
}
And, lo! I thought I was hardcore and would have this working in one shot...
var discard = new List<ElementRow>();
var elements = data
/* snipped long LINQ statement */
.RemoveBadRecords(out discard)
/* snipped long LINQ statement */
discard.ForEach(el => failures.Add(el));
foreach(var el in elements)
{
/* do more work, maybe add more failures */
}
return new Result(elements, failures);
But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.
The temp list was created
The Where filter was assigned (but not yet run)
And the discard list was assigned
Then the streaming thing was returned
When discard was iterated, it contained no elements, because the elements weren't iterated over yet.
Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?
Some Commentary
Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.
When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!
It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.
However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)
I'd suggest materializing the query before removing the bad records - and then you can run separate queries:
var allElements = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.ToList();
var goodElements = allElements.Where(el => el.FormatId != 0)
.OrderByCompositeKey();
var badElements = allElements.Where(el => el.FormatId == 0);
By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.
EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:
var lookup = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.OrderByCompositeKey()
.ToLookup(el => el.FormatId == 0);
Then you can use:
foreach (var goodElement in lookup[false])
{
...
}
and
foreach (var badElement in lookup[true])
{
...
}
Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:
foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
...
}
I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.

IEnumerable<T>.Union(IEnumerable<T>) overwrites contents instead of unioning

I've got a collection of items (ADO.NET Entity Framework), and need to return a subset as search results based on a couple different criteria. Unfortunately, the criteria overlap in such a way that I can't just take the collection Where the criteria are met (or drop Where the criteria are not met), since this would leave out or duplicate valid items that should be returned.
I decided I would do each check individually, and combine the results. I considered using AddRange, but that would result in duplicates in the results list (and my understanding is it would enumerate the collection every time - am I correct/mistaken here?). I realized Union does not insert duplicates, and defers enumeration until necessary (again, is this understanding correct?).
The search is written as follows:
IEnumerable<MyClass> Results = Enumerable.Empty<MyClass>();
IEnumerable<MyClass> Potential = db.MyClasses.Where(x => x.Y); //Precondition
int parsed_key;
//For each searchable value
foreach(var selected in SelectedValues1)
{
IEnumerable<MyClass> matched = Potential.Where(x => x.Value1 == selected);
Results = Results.Union(matched); //This is where the problem is
}
//Ellipsed....
foreach(var selected in SelectedValuesN) //Happens to be integer
{
if(!int.TryParse(selected, out parsed_id))
continue;
IEnumerable<MyClass> matched = Potential.Where(x => x.ValueN == parsed_id);
Results = Results.Union(matched); //This is where the problem is
}
It seems, however, that Results = Results.Union(matched) is working more like Results = matched. I've stepped through with some test data and a test search. The search asks for results where the first field is -1, 0, 1, or 3. This should return 4 results (two 0s, a 1 and a 3). The first iteration of the loops works as expected, with Results still being empty. The second iteration also works as expected, with Results containing two items. After the third iteration, however, Results contains only one item.
Have I just misunderstood how .Union works, or is there something else going on here?
Because of deferred execution, by the time you eventually consume Results, it is the union of many Where queries all of which are based on the last value of selected.
So you have
Results = Potential.Where(selected)
.Union(Potential.Where(selected))
.Union(potential.Where(selected))...
and all the selected values are the same.
You need to create a var currentSelected = selected inside your loop and pass that to the query. That way each value of selected will be captured individually and you won't have this problem.
You can do this much more simply:
Reuslts = SelectedValues.SelectMany(s => Potential.Where(x => x.Value == s));
(this may return duplicates)
Or
Results = Potential.Where(x => SelectedValues.Contains(x.Value));
As pointed out by others, your LINQ expression is a closure. This means your variable selected is captured by the LINQ expression in each iteration of your foreach-loop. The same variable is used in each iteration of the foreach, so it will end up having whatever the last value was. To get around this, you will need to declare a local variable within the foreach-loop, like so:
//For each searchable value
foreach(var selected in SelectedValues1)
{
var localSelected = selected;
Results = Results.Union(Potential.Where(x => x.Value1 == localSelected));
}
It is much shorter to just use .Contains():
Results = Results.Union(Potential.Where(x => SelectedValues1.Contains(x.Value1)));
Since you need to query multiple SelectedValues collections, you could put them all inside their own collection and iterate over that as well, although you'd need some way of matching the correct field/property on your objects.
You could possibly do this by storing your lists of selected values in a Dictionary with the name of the field/property as the key. You would use Reflection to look up the correct field and perform your check. You could then shorten the code to the following:
// Store each of your searchable lists here
Dictionary<string, IEnumerable<MyClass>> DictionaryOfSelectedValues = ...;
Type t = typeof(MyType);
// For each list of searchable values
foreach(var selectedValues in DictionaryOfSelectedValues) // Returns KeyValuePair<TKey, TValue>
{
// Try to get a property for this key
PropertyInfo prop = t.GetProperty(selectedValues.Key);
IEnumerable<MyClass> localSelected = selectedValues.Value;
if( prop != null )
{
Results = Results.Union(Potential.Where(x =>
localSelected.Contains(prop.GetValue(x, null))));
}
else // If it's not a property, check if the entry is for a field
{
FieldInfo field = t.GetField(selectedValues.Key);
if( field != null )
{
Results = Results.Union(Potential.Where(x =>
localSelected.Contains(field.GetValue(x, null))));
}
}
}
No, your use of union is absoloutely correct.
The only thing to keep in mind is it excludes duplicates as based on the equality operator. Do you have sample data?
Okay, I think you are are haveing a problem because Union uses deferred execution.
What happens if you do,
var unionResults = Results.Union(matched).ToList();
Results = unionResults;

Why does fluent nhibernate not cache results if using an IEnumerable instead of a List

Edit: Changed my test around as there was a flaw with the way the test was being run.
I was fighting some performance issues with Fluent Nhibernate recently and I came across something I thought was very odd. When I made an IEnumerable a List performance increased dramatically. I was trying to figure out why. It didn't seem like it should, and google didn't turn anything up.
Here's the basic test I ran:
//Class has various built in type fields, but no references to anything
public class Something
{
public int ID;
public decimal Value;
}
var someRepository = new Repository(uow);
//RUN 1
var start = DateTime.Now;
// Returns a IEnumerable from a session.Linq<SomeAgg> based on the passed in parameters, nothing fancy. Has about 1300 rows that get returned.
var somethings = someRepository.GetABunchOfSomething(various, parameters);
var returnValue = SumAllFunction(somethings);
var timeSpent = DateTime.Now - start; //Takes {00:00:00.3580358} on my box
//RUN2
var start2 = DateTime.Now;
var returnValue = someFunction(somethings);
var timeSpent = DateTime.Now - start2; //Takes {00:00:00.0560000} on my box
public decimal SumAllFunction(IEnumerable<Something> somethings)
{
return somethings.Sum(x => x.Value); //Value is a decimal that's part of the Something class
}
Now if I take the same code and just change the line someRepository.GetABunchOfSomethingto and appened .ToList():
//RUN 1
var start = DateTime.Now;
var somethings = someRepository.GetABunchOfSomething(various, parameters).ToList();
var returnValue = SumAllFunction(somethings);
var timeSpent = DateTime.Now - start; //Takes {00:00:00.3580358} on my box
//RUN 2
var start2 = DateTime.Now;
var returnValue = SumAllFunction(somethings);
var timeSpent = DateTime.Now - start2; //Takes {00:00:00.0010000} on my box
Nothing else changed. These results are very repeatable. So it's not just a one off timing issue.
The TLDR version is this:
When running the same IEnumerable through a loop twice the second run takes anywhere from 10-20 time longer than if I change the IEnumerable to a List using .ToList() before running it through the 2 loops.
I checked the SQL and when it's a List then the sql only gets run once and appears to be cached and used again rather than having to go back to the database to get the results.
If it's an IEnumerable then everytime it goes to access the children of the IEnumerable it makes a trip to the database to rehydrate them.
I understand that you can't add to/delete from an IEnumerable, but my understanding was that the IEnumerable would have been initially filled with the proxy objects and then the proxy objects would have been hydrated later on when needed. After they were hydrated you wouldn't have to go back to the DB again, but it does not appear to be that way. I obviously have a work around for this, but I thought it was odd and I was curious why it behaves the way it does.
When you call ToList() on your GetABunchOfSomething result, the query is performed at that moment, and the results are placed in a list. When you don't call ToList(), then it's not until someFunction runs that the query is performed, and your timer doesn't take that into account.
I think you'll find that the time difference between the two are due to that.
Update
The results, though maybe counter-intuitive to you, makes sense. The reason why the query isn't run until you iterate, and the reason why the results aren't cached, is provided as a feature. Say you wanted to call your repository method in two places in your code; one time sorted by Foo, another time filtered by Bar. If the repository method returns an IQueryable<YourClass>, any additional modifications made to that object will actually affect the SQL that gets emitted rather than causing the collection to be modified in-memory. For example, if you ran this:
someRepository
.GetABunchOfSomething(various, parameters)
.Where(s => s.Bar == "SomeValue");
The generated SQL might look something like this once you iterate:
select *
from someTable
where Bar = 'SomeValue'
However, if you did this instead:
someRepository
.GetABunchOfSomething(various, parameters)
.ToList()
.Where(s => s.Bar == "SomeValue");
Then you'll be retrieving all rows from the table instead, and your application would filter the results.

in foreach loop, should I expect an error since query collection has changed?

For example
var query = myDic.Where(x => !blacklist.Contains(x.Key));
foreach (var item in query)
{
if (condition)
blacklist.Add(item.key+1); //key is int type
ret.add(item);
}
return ret;
would this code be valid? and how do I improve it?
Updated
i am expecting my blacklist.add(item.key+1) would result in smaller ret then otherwise. The ToList() approach won't achieve my intention in this sense.
is there any other better ideas, correct and unambiguous.
That is perfectly safe to do and there shouldn't be any problems as you're not directly modifying the collection that you are iterating over. Though you are making other changes that affects where clause, it's not going to blow up on you.
The query (as written) is lazily evaluated so blacklist is updated as you iterate through the collection and all following iterations will see any newly added items in the list as it is iterated.
The above code is effectively the same as this:
foreach (var item in myDic)
{
if (!blacklist.Contains(item.Key))
{
if (condition)
blacklist.Add(item.key + 1);
}
}
So what you should get out of this is that as long as you are not directly modifying the collection that you are iterating over (the item after in in the foreach loop), what you are doing is safe.
If you're still not convinced, consider this and what would be written out to the console:
var blacklist = new HashSet<int>(Enumerable.Range(3, 100));
var query = Enumerable.Range(2, 98).Where(i => !blacklist.Contains(i));
foreach (var item in query)
{
Console.WriteLine(item);
if ((item % 2) == 0)
{
var value = 2 * item;
blacklist.Remove(value);
}
}
Yes. Changing a collections internal objects is strictly prohibited when iterating over a collection.
UPDATE
I initially made this a comment, but here is a further bit of information:
I should note that my knowledge comes from experience and articles I've read a long time ago. There is a chance that you can execute the code above because (I believe) the query contains references to the selected object within blacklist. blacklist might be able to change, but not query. If you were strictly iterating over blacklist, you would not be able to add to the blacklist collection.
Your code as presented would not throw an exception. The collection being iterated (myDic) is not the collection being modified (blacklist or ret).
What will happen is that each iteration of the loop will evaluate the current item against the query predicate, which would inspect the blacklist collection to see if it contains the current item's key. This is lazily evaluated, so a change to blacklist in one iteration will potentially impact subsequent iterations, but it will not be an error. (blacklist is fully evaluated upon each iteration, its enumerator is not being held.)

Categories