Linq optimization of query and foreach - c#

I return a List from a Linq query, and after it I have to fill the values in it with a for cycle.
The problem is that it is too slow.
var formentries = (from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
}).ToList();
and then I use the for in this way:
for(var x=0; x<formentries.Count(); x++)
{
var values = (from e in entryvalues
where e.FormEntryID.Equals(formentries.ElementAt(x).FormEntryID)
select e).ToList<FormEntryValue>();
formentries.ElementAt(x).Values = values;
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
But it is definitely too slow.
Is there a way to make it faster?

it is definitely too slow. Is there a way to make it faster?
Maybe. Maybe not. But that's not the right question to ask. The right question is:
Why is it so slow?
It is a lot easier to figure out the answer to the first question if you have an answer to the second question! If the answer to the second question is "because the database is in Tokyo and I'm in Rome, and the fact that the packets move no faster than speed of light is the cause of my unacceptable slowdown", then the way you make it faster is you move to Japan; no amount of fixing the query is going to change the speed of light.
To figure out why it is so slow, get a profiler. Run the code through the profiler and use that to identify where you are spending most of your time. Then see if you can speed up that part.

For what I see, you are iterating trough formentries 2 more times without reason - when you populate the values, and when you convert to dictionary.
If entryvalues is a database driven - i.e. you get them from the database, then put the value field population in the first query.
If it's not, then you do not need to invoke ToList() on the first query, do the loop, and then the Dictionary creation.
var formentries = from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
};
var formEntryDictionary = new Dictionary<int, FormEntry>();
foreach (formEntry in formentries)
{
formentry.Values = GetValuesForFormEntry(formentry, entryvalues);
formEntryDict.Add(formEntry.FormEntryID, formEntry);
}
return formEntryDictionary;
And the values preparation:
private IList<FormEntryValue> GetValuesForFormEntry(FormEntry formEntry, IEnumerable<FormEntryValue> entryValues)
{
return (from e in entryValues
where e.FormEntryID.Equals(formEntry.FormEntryID)
select e).ToList<FormEntryValue>();
}
You can change the private method to accept only entryId instead the whole formEntry if you wish.

It's slow because your O(N*M) where N is formentries.Count and M is entryvalues.Count Even with a simple test I was getting more than 20 times slower with only 1000 elements any my type only had an int id field, with 10000 elements in the list it was over 1600 times slower than the code below!
Assuming your entryvalues is a local list and not hitting a database (just .ToList() it to a new variable somewhere if that's the case), and assuming your FormEntryId is unique (which it seems to be from the .ToDictionary call then try this instead:
var entryvaluesDictionary = entryvalues.ToDictionary(entry => entry.FormEntryID, entry => entry);
for(var x=0; x<formentries.Count; x++)
{
formentries[x] = entryvaluesDictionary[formentries[x].FormEntryID];
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
It should go a long way to making it at least scale better.
Changes: .Count instead of .Count() just because it's better to not call extension method when you don't need to. Using a dictionary to find the values rather than doing a where for every x value in the for loop effectively removes the M from the bigO.
If this isn't entirely correct I'm sure you can change whatever is missing to suit your work case instead. But as an aside, you should really consider using case for your variable names formentries versus formEntries one is just that little bit easier to read.

There are some reasons why this might be slow regarding the way you use formentries.
The formentries List<T> from above has a Count property, but you are calling the enumerable Count() extension method instead. This extension may or may not have an optimization that detects that you're operating on a collection type that has a Count property that it can defer to instead of walking the enumeration to compute the count.
Similarly the formEntries.ElementAt(x) expression is used twice; if they have not optimized ElementAt to determine that they are working with a collection like a list that can jump to an item by its index then LINQ will have to redundantly walk the list to get to the xth item.
The above evaluation may miss the real problem, which you'll only really know if you profile. However, you can avoid the above while making your code significantly easier to read if you switch how you iterate the collection of formentries as follows:
foreach(var fe in formentries)
{
fe.Values = entryvalues
.Where(e => e.FormEntryID.Equals(fe.FormEntryID))
.ToList<FormEntryValue>();
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
You may have resorted to the for(var x=...) ...ElementAt(x) approach because you thought you could not modify properties on object referenced by the foreach loop variable fe.
That said, another point that could be an issue is if formentries has multiple items with the same FormEntryID. This would result in the same work being done multiple times inside the loop. While the top query appears to be against a database, you can still do joins with data in linq-to-object land. Happy optimizing/profiling/coding - let us know what works for you.

Related

Understanding Deferred Execution: Is a Linq Query Re-executed Everytime its collection of anonymous objects is referred to?

I'm currently trying to write some code that will run a query on two separate databases, and will return the results to an anonymous object. Once I have the two collections of anonymous objects, I need to perform a comparison on the two collections. The comparison is that I need to retrieve all of the records that are in webOrders, but not in foamOrders. Currently, I'm making the comparison by use of Linq. My major problem is that both of the original queries return about 30,000 records, and as my code is now, it takes waay too long to complete. I'm new to using Linq, so I'm trying to understand if using Linq to compare the two collections of anonymous objects will actually cause the database queries to run over and over again - due to deferred execution. This may be an obvious answer, but I don't yet have a very firm understanding of how Linq and anonymous objects work with deferred execution. I'm hoping someone may be able to enlighten me. Below is the code that I have...
private DataTable GetData()
{
using (var foam = Databases.Foam(false))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(true)
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT order_id
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM TRANSACTIONS AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))")
.Select(o => new
{
order = o[0].ToString().Trim()
}).ToList();
return (from w in webOrders
where !(from f in foamOrders
select f.order).Contains(w.order)
select w
).ToDataTable();
}
}
}
Your linq ceases to be deferred when you do
ToDataTable();
At that point it is snapshotted as done and dusted forever.
Same is true with foamOrders and webOrders when you convert it
ToList();
You could do it as one query. I dont have mySQL to check it out on.
Regarding deferred execution:
Method .ToList() iterates over the IEnumerable retrieves all values and fill a new List<T> object with that values. So it's definitely not deferred execution at this point.
It's most likely the same with .ToDataTable();
P.S.
But i'd recommend to :
Use custom types rather than anonymous types.
Do not use LINQ to compare objects because it's not really effective (linq is doing extra job)
You can create a custom MyComparer class (that might implement IComparer interface) and method like Compare<T1, T2> that compares two entities. Then you can create another method to compare two sets of entities for example T1[] CompareRange<T1,T2>(T1[] entities1, T2[] entities2) that reuse your Compare method in a loop and returns result of the operation
P.S.
Some of other resource-intensive operations that may potentially lead to significant performance losses (in case if you need to perform thousands of operations) :
Usage of enumerator object (foreach loop or some of LINQ methods)
Possible solution : Try to use for loop if it is possible.
Extensive use of anonymous methods (compiler requires significant time to compile the lambda expression / operator );
Possible solution : Store lambdas in delegates (like Func<T1, T2>)
In case it helps anyone in the future, my new code is pasted below. It runs much faster now. Thanks to everyone's help, I've learned that even though the deferred execution of my database queries was cut off and the results became static once I used .ToList(), using Linq to compare the resulting collections was very inefficient. I went with a for loop instead.
private DataTable GetData()
{
//Needed to have both connections open in order to preserve the scope of var foamOrders and var webOrders, which are both needed in order to perform the comparison.
using (var foam = Databases.Foam(isDebug))
{
using (MySqlConnection web = new MySqlConnection(Databases.ConnectionStrings.Web(isDebug)))
{
var foamOrders = foam.DataTableEnumerable(#"
SELECT foreignID
FROM Orders
WHERE order_id NOT LIKE 'R35%'
AND originpartner_code = 'VN000011'
AND orderDate > Getdate() - 7 ")
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
var webOrders = web.DataTableEnumerable(#"
SELECT ORDER_NUMBER FROM transactions AS T WHERE
(Str_to_date(T.ORDER_DATE, '%Y%m%d %k:%i:%s') >= DATE_SUB(Now(), INTERVAL 7 DAY))
AND (STR_TO_DATE(T.ORDER_DATE, '%Y%m%d %k:%i:%s') <= DATE_SUB(NOW(), INTERVAL 1 HOUR))
", 300)
.Select(o => new
{
order = o[0].ToString()
.Trim()
}).ToList();
List<OrderNumber> on = new List<OrderNumber>();
foreach (var w in webOrders)
{
if (!foamOrders.Contains(w))
{
OrderNumber o = new OrderNumber();
o.orderNumber = w.order;
on.Add(o);
}
}
return on.ToDataTable();
}
}
}
public class OrderNumber
{
public string orderNumber { get; set; }
}
}

Using linq to get min and max of a particular field in one query

Lets say you have a class like:
public class Section {
public DateTime StartDate;
public DateTime? EndDate;
}
I have a list of these objects, and I would like to get the minimum start date and the maximum end date, but I would like to use one linq query so I know that I'm only iterating over the list once.
For instance, if I was doing this without linq, my code would look a bit like this (not checking for nulls):
DateTime? minStartDate;
DateTime? maxEndDate;
foreach(var s in sections) {
if(s.StartDate < minStartDate) minStartDate = s.StartDate;
if(s.EndDate > maxEndDate) maxEndDate = s.EndDate;
}
I could have two linq queries to get the min and max, but I know that under the covers, it would require iterating over all values twice.
I've seen min and max queries like this before, but with grouping. How would you do this without grouping, and in a single linq query?
How would you do this without grouping, and in a single linq query?
If I had to do that, then I'd do:
var minMax = (from s0 in sections
from s1 in sections
orderby s0.StartDate, s1.EndDate descending
select new {s0.StartDate, s1.EndDate}).FirstOrDefault();
But I'd also consider the performance impact depending on the provider in question.
On a database I'd expect this to become something like:
SELECT s0.StartDate, s1.EndDate
FROM Sections AS s0
CROSS JOIN Sections AS s1
ORDER BY created ASC, EndDate DESC
LIMIT 1
OR
SELECT TOP 1 s0.StartDate, s1.EndDate
FROM Sections AS s0, Sections AS s1
ORDER BY created ASC, EndDate DESC
Depending on database type. How that in turn would be executed could well be two table scans, but if I was going to care about these dates I'd have indices on those columns so it should be two index look-scans toward the end of each index, so I'd expect it to be pretty fast.
I have a list of these objects
Then if I cared a lot about performance, I wouldn't use Linq.
but I would like to use one linq query so I know that I'm only iterating over the list once
That's why I wouldn't use linq. Since there's nothing in linq designed to deal with this particular case, it would hit the worse combination. Indeed it would be worse than 2 iterations, it would be N +1 iterations where N is the number of elements in Sections. Linq providers are good, but they aren't magic.
If I really wanted to be able to do this in Linq, as for example I was sometimes doing this against lists in memory and sometimes against databases and so on, I'd add my own methods to do each the best way possible:
public static Tuple<DateTime, DateTime?> MinStartMaxEnd(this IQueryable<Section> source)
{
if(source == null)
return null;
var minMax = (from s0 in source
from s1 in source
orderby s0.StartDate, s1.EndDate descending
select new {s0.StartDate, s1.EndDate}).FirstOrDefault();
return minMax == null ? null : Tuple.Create(minMax.StartDate, minMax.EndDate);
}
public static Tuple<DateTime, DateTime?> MinStartMaxEnd(this IEnumerable<Section> source)
{
if(source != null)
using(var en = source.GetEnumerator())
if(en.MoveNext())
{
var cur = en.Current;
var start = cur.StartDate;
var end = cur.EndDate;
while(en.MoveNext())
{
cur = en.Current;
if(cur.StartDate < start)
start = cur.StartDate;
if(cur.EndDate.HasValue && (!end.HasValue || cur.EndDate > end))
end = cur.EndDate;
}
return Tuple.Create(start, end);
}
return null;
}
but I would like to use one linq query so I know that I'm only iterating over the list once
To come back to this. Linq does not promise to iterate over a list once. It can sometimes do so (or not iterate at all). It can call into database queries that in turn turn what is conceptually several iterations into one or two (common with CTEs). It can produce code that is very efficient for a variety of similar-but-not-quite-the-same queries where the alternative in hand-coding would be to either suffer a lot of waste or else to write reams of similar-but-not-quite-the-same methods.
But it can also hide some N+1 or N*N behaviour in what looks like a lot less if you assume Linq gives you a single pass. If you need particular single-pass behaviour, add to Linq; it's extensible.
You can use Min and Max:
List<Section> test = new List<Section>();
minStartDate = test.Min(o => o.StartDate);
maxEndDate = test.Max(o => o.EndDate);

Assigning values inside a LINQ Select?

I have the following query:
drivers.Select(d => { d.id = 0; d.updated = DateTime.Now; return d; }).ToList();
drivers is a List which comes in with different id's and updated values, so I am changing the values in the Select, but is the proper way to do it. I already know that I am not reassigning drivers to drivers because Resharper complains about it, so I guess it would be better if it was:
drivers = drivers.Select(d => { d.id = 0; d.updated = DateTime.Now; return d; }).ToList();
but is this still the way someone should assign new values to each element in the drivers List?
Although this looks innocent, especially in combination with a ToList call that executes the code immediately, I would definitely stay away from modifying anything as part of a query: the trick is so unusual that it would trip up readers of your program, even experienced ones, especially if they never saw this before.
There's nothing wrong with foreach loops - the fact that you can do it with LINQ does not mean that you should be doing it.
NEVER DO THIS. A query should be a query; it should be non-destructively asking questions of a data source. If you want to cause a side effect then use a foreach loop; that's what it's for. Use the right tool for the job.
Ok I will make an answer myself.
Xaisoft, Linq queries, be it lambda expression or query expression, shouldn't be used to mutate list. Hence your Select
drivers = drivers.Select(d => { d.id = 0; d.updated = DateTime.Now; return d; }).ToList();
is bad style. It confuses/unreadable, not standard, and against Linq philosophy. Another poor style of achieving the end result is:
drivers.Any(d => { d.id = 0; d.updated = DateTime.Now; return false; });
But that's not to say ForEach on List<T> is inappropriate. It finds uses in cases like yours, but do not mix mutation with Linq query, thats all. I prefer to write something like:
drivers.ForEach(d => d.updated = DateTime.Now);
Its elegant and understandable. Since it doesn't deal with Linq, its not confusing too. I don't like that syntax for multiple statements (as in your case) inside the lambda. It's a little less readable and harder to debug when things get complex. In your case I prefer a straight foreach loop.
foreach (var d in drivers)
{
d.id = 0;
d.updated = DateTime.Now;
}
Personally I like ForEach on IEnumerable<T> as a terminating call to Linq expression (ie, if the assignment is not meant to be a query but an execution).

Lambda for summing shorthand

I have the following LINQ query:
var q = from bal in data.balanceDetails
where bal.userName == userName && bal.AccountID == accountNumber
select new
{
date = bal.month + "/" + bal.year,
commission = bal.commission,
rebate = bal.rebateBeforeService,
income = bal.commission - bal.rebateBeforeService
};
I remember seeing a lambda shorthand for summing the commission field for each row of q.
What would be the best way of summing this? Aside from manually looping through the results?
It's easy - no need to loop within your code:
var totalCommission = q.Sum(result => result.commission);
Note that if you're going to use the results of q for various different calculations (which seems a reasonable assumption, as if you only wanted the total commission I doubt that you'd be selecting the other bits) you may want to materialize the query once so that it doesn't need to do all the filtering and projecting multiple times. One way of doing this would be to use:
var results = q.ToList();
That will create a List<T> for your anonymous type - you can still use the Sum code above on results here.

How to force my lambda expressions to evaluate early? Fix lambda expression weirdness?

I have written the following C# code:
_locationsByRegion = new Dictionary<string, IEnumerable<string>>();
foreach (string regionId in regionIds)
{
IEnumerable<string> locationIds = Locations
.Where(location => location.regionId.ToUpper() == regionId.ToUpper())
.Select(location => location.LocationId); //If I cast to an array here, it works.
_locationsByRegion.Add(regionId, LocationIdsIds);
}
This code is meant to create a a dictionary with my "region ids" as keys and lists of "location ids" as values.
However, what actually happens is that I get a dictionary with the "region ids" as keys, but the value for each key is identical: it is the list of locations for the last region id in regionIds!
It looks like this is a product of how lambda expressions are evaluated. I can get the correct result by casting the list of location ids to an array, but this feels like a kludge.
What is a good practice for handling this situation?
You're using LINQ. You need to perform an eager operation to make it perform the .Select. ToList() is a good operator to do that. List is generic it can be assigned to IEnumberable directly.
In the case where you're using LINQ it does lazy evaluation by default. ToList/eager operations force the select to occur. Before you use one of these operators the action is not performed. It is like executing SQL in ADO.NET kind of. If you have the statement "Select * from users" that doesn't actually perform the query until you do extra stuff. The ToList makes the select execute.
Your closing over the variable, not the value.
Make a local copy of the variable so you capture the current value from the foreach loop instead:
_locationsByRegion = new Dictionary<string, IEnumerable<string>>();
foreach (string regionId in regionIds)
{
var regionToUpper = regionId.ToUpper();
IEnumerable<string> locationIds = Locations
.Where(location => location.regionId.ToUpper() == regionToUpper)
.Select(location => location.LocationId); //If I cast to an array here, it works.
_locationsByRegion.Add(regionId, LocationIdsIds);
}
Then read this:
http://msdn.microsoft.com/en-us/vcsharp/hh264182
edit - Forcing a eager evaluation would also work as others have suggested, but most of the time eager evaluations end up being much slower.
Call ToList() or ToArray() after the Select(...). Thus entire collection will be evaluated right there.
Actually the question is about lookup creation, which could be achieved simpler with standard LINQ group join:
var query = from regionId in regionIds
join location in Locations
on regionId.ToLower() equals location.regionId.ToLower() into g
select new { RegionID = regionId,
Locations = g.Select(location => location.LocationId) };
In this case all locations will be downloaded at once, and grouped in-memory. Also this query will not be executed until you try to access results, or until you convert it to dictionary:
var locationsByRegion = query.ToDictionary(x => x.RegionID, x => x.Locations);

Categories