linq statement very slow - c#

I have a linq to sql data context set up. I try to expect an amount of logs identified by a mappingID, so im using that to write a webclient, that shows the status of these downloads. Right now i have a situation where a linq statement is taking forever, although the amount of rows seems relatively low.
The statement taking forever is:
var dlCatToUnion = (from cl in currentLogs
where ndcat.All(x=>x!=cl.CategoryCountryCategoryTypeMapping.CategoryID)
group cl by cl.CategoryCountryCategoryTypeMapping.Category.CategoryID into t1
select new CategoryStruct
{
CategoryName = t1.Max(x => x.CategoryCountryCategoryTypeMapping.Category.Name),
Status = t1.Any(x=>x.Response!=(int)ErrorCodes.staticCodes.success)
? (int)ErrorCodes.staticCodes.genericFailure : (int)ErrorCodes.staticCodes.success,
AverageResponseTime = 0,
categoryId = t1.Key
}
);
Specifically if you look at the second line where it says where ndcat.All(x=>x!=cl.CategoryCountryCategoryTypeMapping.CategoryID) if i take this part out, its instant.
To take a look at what that line is doing:
var ndcat = (from ndid in notDownloadedIds
where ndid.Category.StorefrontID==StorefrontID
group ndid by ndid.CategoryID into finalCat
select finalCat.Key);
And then notDownloadedIds
notDownloadedIds = cDataContext.CategoryCountryCategoryTypeMappings.Where(mapping =>!
currentLogs.Select(dll => dll.CategoryCountryCategoryTypeMappingID).Any(id => id == mapping.CategoryCountryCategoryTypeMappingID));
To give some estimates of row counts, currentLogs is around 25k rows, CategoryCountryCategoryTYpeMappingID is about 53k rows. ndcat ends up being 47 rows(also it enumerates just about instantly).
Also to note ive changed the suspect line to a ! ...Any(...) statement and its just as slow.
Is there anywhere where im being ineffecient?

Have you tried changing:
where ndcat.All(x => x != cl.CategoryCountryCategoryTypeMapping.CategoryID)
to:
where !ndcat.Any(x => x == cl.CategoryCountryCategoryTypeMapping.CategoryID)
??

I ended up changing a bit, but long story short, i did the ndcat check after grouping instead of before, which made quite a bit faster.

Related

FirstOrDefault() adding days of time to iteration

Got a kind of edge case issue here. I've been tasked with pulling all data from one database to another, where the destination database has a different schema.
I've chosen to write a WinForms utility to do the data mapping and transfer with Entity Framework/ADO.NET when necessary.
This has worked great so far, except for this one specific table that has 2.5 million records. The transfer is about 10 minutes total when I disregard all foreign keys, however when I start mapping foreign keys with FirstOrDefault() calls against in memory lists of data that have been already moved to the destination database, quite literally 4 days are added to the amount of time that it takes.
I'm going to need to run this tool a lot over the coming days so this isn't really acceptable for me.
Here's my current approach (Not my first approach, this is the result of much trial and error for efficiencies sake):
private OldModelContext _oldModelContext { get; } //instantiated in controller
using (var newModelContext = new NewModelContext())
{
//Takes no time at all to load these into memory, collections are small, 3 - 20 records each
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToList();
var alreadyMigratedTable2 = newModelContext.alreadyMigratedTable2.ToList();
var alreadyMigratedTable3 = newModelContext.alreadyMigratedTable3.ToList();
var alreadyMigratedTable4 = newModelContext.alreadyMigratedTable4.ToList();
var alreadyMigratedTable5 = newModelContext.alreadyMigratedTable5.ToList();
var oldDatasetInMemory = _oldModelContext.MasterData.AsNoTracking().ToList();//2.5 Million records, takes about 6 minutes
var table = new DataTable("MasterData");
table.Columns.Add("Column1");
table.Columns.Add("Column2");
table.Columns.Add("Column3");
table.Columns.Add("ForeignKeyColumn1");
table.Columns.Add("ForeignKeyColumn2");
table.Columns.Add("ForeignKeyColumn3");
table.Columns.Add("ForeignKeyColumn4");
table.Columns.Add("ForeignKeyColumn5");
foreach(var masterData in oldDatasetInMemory){
DataRow row = table.NewRow();
//With just these properties mapped, this takes about 2 minutes for all 2.5 Million
row["Column1"] = masterData.Property1;
row["Column2"] = masterData.Property2;
row["Column3"] = masterData.Property3;
//With this mapping, we add about 4 days to the overall process.
row["ForeignKeyColumn1"] = alreadyMigratedTable1.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn2"] = alreadyMigratedTable2.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn3"] = alreadyMigratedTable3.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn4"] = alreadyMigratedTable4.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn5"] = alreadyMigratedTable5.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
table.Rows.Add(row);
}
//Save table with SQLBulkCopy is very fast, takes about a minute and a half.
}
}
Note: uniquePropertyOn(New/Old)Dataset is most often a unique description string shared among the datasets, can't match Ids as they won't be the same across databases.
I have tried:
Instead of using a foreach, cast with a linq select statement, not much improvement was had.
Use .Where(predicate).FirstOrDefault(), didn't see any considerable improvement
Running FirstOrDefault() against iqueryable instead of lists of migrated data, didn't see any improvement.
Mapping to a List instead of a datatable, but that makes no difference in the mapping speed, and also makes bulk saves slower.
I've been messing around with the idea of turning the foreach into a parallel foreach loop and locking the calls to the datatable, but I keep running into
Entity Framework connection closed issues
when querying the in memory lists while using the parallel foreach.... not really sure what that's about but initially the speed results were promising.
I'd be happy to post that code/errors if anyone thinks it's the right road to go down, but i'm not sure anymore..
The first thing I'd try is a dictionary, and pre-fetching the columns:
var fk1 = oldDatasetInMemory.Columns["ForeignKeyColumn1"];
// ...
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToDictionary(
x => x.uniquePropertyOnNewDataset);
// ...
if (alreadyMigratedTable1.TryGetValue(masterData.uniquePropertyOnOldDataset, out var val))
row[fk1] = val;
However, in reality: I'd also try to avoid the entire DataTable piece unless it is really, really necessary.
If there is really no other way to migrate this data than to load everything into memory, you can make it more efficient by avoiding this nested loop and by linking the lists via Join.
Read: Why is LINQ JOIN so much faster than linking with WHERE?
var newData =
from master in oldDatasetInMemory
join t1 in alreadyMigratedTable1
on master.uniquePropertyOnOldDataset equals t1.uniquePropertyOnNewDataset into t1Group
from join1 in t1Group.Take(1).DefaultIfEmpty()
join t2 in alreadyMigratedTable2
on master.uniquePropertyOnOldDataset equals t2.uniquePropertyOnNewDataset into t2Group
from join2 in t2Group.Take(1).DefaultIfEmpty()
join t3 in alreadyMigratedTable3
on master.uniquePropertyOnOldDataset equals t3.uniquePropertyOnNewDataset into t3Group
from join3 in t1Group.Take(1).DefaultIfEmpty()
join t4 in alreadyMigratedTable4
on master.uniquePropertyOnOldDataset equals t4.uniquePropertyOnNewDataset into t4Group
from join4 in t1Group.Take(1).DefaultIfEmpty()
join t5 in alreadyMigratedTable5
on master.uniquePropertyOnOldDataset equals t5.uniquePropertyOnNewDataset into t5Group
from join5 in t1Group.Take(1).DefaultIfEmpty()
select new { master, join1, join2, join3, join4, join5};
foreach (var x in newData)
{
DataRow row = table.Rows.Add();
row["Column1"] = x.master.Property1;
row["Column2"] = x.master.Property2;
row["Column3"] = x.master.Property3;
row["ForeignKeyColumn1"] = x.join1;
row["ForeignKeyColumn2"] = x.join2;
row["ForeignKeyColumn3"] = x.join3;
row["ForeignKeyColumn4"] = x.join4;
row["ForeignKeyColumn5"] = x.join5;
}
This is a LINQ Left-Outer-Join which takes only one row from the right side.

"Execution Timeout" on converting LINQ query results using ToList()

As the title states, I'm getting a "Wait operation timed out" message (inner exception message: "Timeout expired") on a module I'm maintaining. Everytime the app tries to convert the query results using ToList(), it times out regardless of the number of results.
Reason this needs to be converted to list: Results needed to be exported to Excel for download.
Below is the code:
public Tuple<IEnumerable<ProductPriceSearchResultDto>, int> GetProductPriceSearchResults(ProductPriceFilterDto filter, int? pageNo = null)
{
//// Predicate builder
var predicate = GetProductPriceSearchFilter(filter);
//// This runs for approx. 1 minute before throwing a "Wait operation timed out" message...
var query = this.GetProductPriceSearchQuery()
.Where(predicate)
.Distinct()
.OrderBy(x => x.DosageFormName)
.ToList();
return Tuple.Create<IEnumerable<ProductPriceSearchResultDto>, int>(query, 0);
}
My query:
var query = (from price in this.context.ProductPrice.AsExpandable()
join product in this.context.vwDistributorProducts.AsExpandable()
on price.DosageFormCode equals product.DosageFormCode
join customer in this.context.vwCustomerBranch.AsExpandable()
on price.CustCd equals customer.CustomerCode
where price.CountryId == CurrentUserService.Identity.CountryId && !product.IsInactive
select new { price.PriceKey, price.EffectivityDateFrom, price.ContractPrice, price.ListPrice,
product.DosageFormName, product.MpgCode, product.DosageFormCode,
customer.CustomerName }).GroupBy(x => x.DosageFormCode)
.Select(x => x.OrderByDescending(y => y.EffectivityDateFrom).FirstOrDefault())
.Select(
x =>
new ProductPriceSearchResultDto
{
PriceKey = x.PriceKey,
DosageFormCode = x.DosageFormCode,
DosageFormName = x.DosageFormName,
EffectiveFrom = x.EffectivityDateFrom,
Price = x.ListPrice,
MpgCode = x.MpgCode,
ContractPrice = x.ContractPrice,
CustomerName = x.CustomerName
});
return query;
Notes:
ProductPrice is a table and has a non-clustered index pointing at columns CountryId and DosageFormCode.
vwDistributorProducts and vwCustomerBranch are views copied from the client's database.
I'm already at my wit's end. How do I get rid of this error? Is there something in the code that I need to change?
Edit: As much as possible, I don't want to resort to setting a command timeout because 1.) app's doing okay without it by far...except for this function and 2.) this is already a huge application and I don't want to possibly put the other modules' performances at risk.
Any help would be greatly appreciated. Thank you.
I'd try and log the sql this translates into.
The actual sql may then be used to get the query plan, which may lead you closer to the root cause.

how to take 100 records from linq query based on a condition

I have a query, which will give the result set . based on a condition I want to take the 100 records. that means . I have a variable x, if the value of x is 100 then I have to do .take(100) else I need to get the complete records.
var abc=(from st in Context.STopics
where st.IsActive==true && st.StudentID == 123
select new result()
{
name = st.name }).ToList().Take(100);
Because LINQ returns an IQueryable which has deferred execution, you can create your query, then restrict it to the first 100 records if your condition is true and then get the results. That way, if your condition is false, you will get all results.
var abc = (from st in Context.STopics
where st.IsActive && st.StudentID == 123
select new result
{
name = st.name
});
if (x == 100)
abc = abc.Take(100);
abc = abc.ToList();
Note that it is important to do the Take before the ToList, otherwise, it would retrieve all the records, and then only keep the first 100 - it is much more efficient to get only the records you need, especially if it is a query on a database table that could contain hundreds of thousands of rows.
One of the most important concept in SQL TOP command is order by. You should not use TOP without order by because it may return different results at different situations.
The same concept is applicable to linq too.
var results = Context.STopics.Where(st => st.IsActive && st.StudentID == 123)
.Select(st => new result(){name = st.name})
.OrderBy(r => r.name)
.Take(100).ToList();
Take and Skip operations are well defined only against ordered sets. More info
Although the other users are correct in giving you the results you want...
This is NOT how you should be using Entity Framework.
This is the better way to use EF.
var query = from student in Context.Students
where student.Id == 123
from topic in student.Topics
order by topic.Name
select topic;
Notice how the structure more closely follows the logic of the business requirements.
You can almost read the code in English.

Entity Framework - Get row count of a (group by Subquery)

I have this simple expression to get each Order's amount:
public IQueryable<Orders> GetAccountSummery()
{
return context.Orders.GroupBy(a => new { orderNo = a.orderNo })
.Select(b => new
{
orderNo = b.Key.orderNo,
amount = b.Sum(r => r.amount)
});
}
I needed to Get The total number of records returned by the previous expression:
SQL
select COUNT(1) from
(
SELECT orderNo,SUM(amount) Amount
FROM Orders
group by orderNo
)tbl -- I get 125,000 row count here
EF
public int GetOrdersCount()
{
return GetAccountSummery().Count(); // This guy here gives 198,000 rows which counts all rows from orders table
// The following line gives the correct row count:
return GetAccountSummery().AsEnumerable().Count(); // 125,000 row
}
The problem with GetAccountSummery().AsEnumerable().Count() is that it runs the query first at the server side then calculates the correct row count at client side (consider the table size here)
Is there any way to get only the correct count without executing the select statement ?
EDIT
If that is not possible with groupBy subquery, why is it for Where subs?
The way you currently have it structured? No. You'll always execute the .Select()statement because .Count() employs immediate execution (see here for a list). However, .Count() also has an override which takes in a Func, which you should be able to use to grab the count without having to perform the select first, as so:
context.Orders.GroupBy(a => new { orderNo = a.orderNo })
.Count(a => a.Key.orderNo != string.Empty);
EDIT: Whoops, forgot that it was a bool. Edited accordingly.
EDIT 2: Per the comments, I don't think it is possible, since you'll always need to have that select in there at any given time, and .Count() will always call it. Sorry.

Linq optimization of query and foreach

I return a List from a Linq query, and after it I have to fill the values in it with a for cycle.
The problem is that it is too slow.
var formentries = (from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
}).ToList();
and then I use the for in this way:
for(var x=0; x<formentries.Count(); x++)
{
var values = (from e in entryvalues
where e.FormEntryID.Equals(formentries.ElementAt(x).FormEntryID)
select e).ToList<FormEntryValue>();
formentries.ElementAt(x).Values = values;
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
But it is definitely too slow.
Is there a way to make it faster?
it is definitely too slow. Is there a way to make it faster?
Maybe. Maybe not. But that's not the right question to ask. The right question is:
Why is it so slow?
It is a lot easier to figure out the answer to the first question if you have an answer to the second question! If the answer to the second question is "because the database is in Tokyo and I'm in Rome, and the fact that the packets move no faster than speed of light is the cause of my unacceptable slowdown", then the way you make it faster is you move to Japan; no amount of fixing the query is going to change the speed of light.
To figure out why it is so slow, get a profiler. Run the code through the profiler and use that to identify where you are spending most of your time. Then see if you can speed up that part.
For what I see, you are iterating trough formentries 2 more times without reason - when you populate the values, and when you convert to dictionary.
If entryvalues is a database driven - i.e. you get them from the database, then put the value field population in the first query.
If it's not, then you do not need to invoke ToList() on the first query, do the loop, and then the Dictionary creation.
var formentries = from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
};
var formEntryDictionary = new Dictionary<int, FormEntry>();
foreach (formEntry in formentries)
{
formentry.Values = GetValuesForFormEntry(formentry, entryvalues);
formEntryDict.Add(formEntry.FormEntryID, formEntry);
}
return formEntryDictionary;
And the values preparation:
private IList<FormEntryValue> GetValuesForFormEntry(FormEntry formEntry, IEnumerable<FormEntryValue> entryValues)
{
return (from e in entryValues
where e.FormEntryID.Equals(formEntry.FormEntryID)
select e).ToList<FormEntryValue>();
}
You can change the private method to accept only entryId instead the whole formEntry if you wish.
It's slow because your O(N*M) where N is formentries.Count and M is entryvalues.Count Even with a simple test I was getting more than 20 times slower with only 1000 elements any my type only had an int id field, with 10000 elements in the list it was over 1600 times slower than the code below!
Assuming your entryvalues is a local list and not hitting a database (just .ToList() it to a new variable somewhere if that's the case), and assuming your FormEntryId is unique (which it seems to be from the .ToDictionary call then try this instead:
var entryvaluesDictionary = entryvalues.ToDictionary(entry => entry.FormEntryID, entry => entry);
for(var x=0; x<formentries.Count; x++)
{
formentries[x] = entryvaluesDictionary[formentries[x].FormEntryID];
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
It should go a long way to making it at least scale better.
Changes: .Count instead of .Count() just because it's better to not call extension method when you don't need to. Using a dictionary to find the values rather than doing a where for every x value in the for loop effectively removes the M from the bigO.
If this isn't entirely correct I'm sure you can change whatever is missing to suit your work case instead. But as an aside, you should really consider using case for your variable names formentries versus formEntries one is just that little bit easier to read.
There are some reasons why this might be slow regarding the way you use formentries.
The formentries List<T> from above has a Count property, but you are calling the enumerable Count() extension method instead. This extension may or may not have an optimization that detects that you're operating on a collection type that has a Count property that it can defer to instead of walking the enumeration to compute the count.
Similarly the formEntries.ElementAt(x) expression is used twice; if they have not optimized ElementAt to determine that they are working with a collection like a list that can jump to an item by its index then LINQ will have to redundantly walk the list to get to the xth item.
The above evaluation may miss the real problem, which you'll only really know if you profile. However, you can avoid the above while making your code significantly easier to read if you switch how you iterate the collection of formentries as follows:
foreach(var fe in formentries)
{
fe.Values = entryvalues
.Where(e => e.FormEntryID.Equals(fe.FormEntryID))
.ToList<FormEntryValue>();
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
You may have resorted to the for(var x=...) ...ElementAt(x) approach because you thought you could not modify properties on object referenced by the foreach loop variable fe.
That said, another point that could be an issue is if formentries has multiple items with the same FormEntryID. This would result in the same work being done multiple times inside the loop. While the top query appears to be against a database, you can still do joins with data in linq-to-object land. Happy optimizing/profiling/coding - let us know what works for you.

Categories