Got a kind of edge case issue here. I've been tasked with pulling all data from one database to another, where the destination database has a different schema.
I've chosen to write a WinForms utility to do the data mapping and transfer with Entity Framework/ADO.NET when necessary.
This has worked great so far, except for this one specific table that has 2.5 million records. The transfer is about 10 minutes total when I disregard all foreign keys, however when I start mapping foreign keys with FirstOrDefault() calls against in memory lists of data that have been already moved to the destination database, quite literally 4 days are added to the amount of time that it takes.
I'm going to need to run this tool a lot over the coming days so this isn't really acceptable for me.
Here's my current approach (Not my first approach, this is the result of much trial and error for efficiencies sake):
private OldModelContext _oldModelContext { get; } //instantiated in controller
using (var newModelContext = new NewModelContext())
{
//Takes no time at all to load these into memory, collections are small, 3 - 20 records each
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToList();
var alreadyMigratedTable2 = newModelContext.alreadyMigratedTable2.ToList();
var alreadyMigratedTable3 = newModelContext.alreadyMigratedTable3.ToList();
var alreadyMigratedTable4 = newModelContext.alreadyMigratedTable4.ToList();
var alreadyMigratedTable5 = newModelContext.alreadyMigratedTable5.ToList();
var oldDatasetInMemory = _oldModelContext.MasterData.AsNoTracking().ToList();//2.5 Million records, takes about 6 minutes
var table = new DataTable("MasterData");
table.Columns.Add("Column1");
table.Columns.Add("Column2");
table.Columns.Add("Column3");
table.Columns.Add("ForeignKeyColumn1");
table.Columns.Add("ForeignKeyColumn2");
table.Columns.Add("ForeignKeyColumn3");
table.Columns.Add("ForeignKeyColumn4");
table.Columns.Add("ForeignKeyColumn5");
foreach(var masterData in oldDatasetInMemory){
DataRow row = table.NewRow();
//With just these properties mapped, this takes about 2 minutes for all 2.5 Million
row["Column1"] = masterData.Property1;
row["Column2"] = masterData.Property2;
row["Column3"] = masterData.Property3;
//With this mapping, we add about 4 days to the overall process.
row["ForeignKeyColumn1"] = alreadyMigratedTable1.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn2"] = alreadyMigratedTable2.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn3"] = alreadyMigratedTable3.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn4"] = alreadyMigratedTable4.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn5"] = alreadyMigratedTable5.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
table.Rows.Add(row);
}
//Save table with SQLBulkCopy is very fast, takes about a minute and a half.
}
}
Note: uniquePropertyOn(New/Old)Dataset is most often a unique description string shared among the datasets, can't match Ids as they won't be the same across databases.
I have tried:
Instead of using a foreach, cast with a linq select statement, not much improvement was had.
Use .Where(predicate).FirstOrDefault(), didn't see any considerable improvement
Running FirstOrDefault() against iqueryable instead of lists of migrated data, didn't see any improvement.
Mapping to a List instead of a datatable, but that makes no difference in the mapping speed, and also makes bulk saves slower.
I've been messing around with the idea of turning the foreach into a parallel foreach loop and locking the calls to the datatable, but I keep running into
Entity Framework connection closed issues
when querying the in memory lists while using the parallel foreach.... not really sure what that's about but initially the speed results were promising.
I'd be happy to post that code/errors if anyone thinks it's the right road to go down, but i'm not sure anymore..
The first thing I'd try is a dictionary, and pre-fetching the columns:
var fk1 = oldDatasetInMemory.Columns["ForeignKeyColumn1"];
// ...
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToDictionary(
x => x.uniquePropertyOnNewDataset);
// ...
if (alreadyMigratedTable1.TryGetValue(masterData.uniquePropertyOnOldDataset, out var val))
row[fk1] = val;
However, in reality: I'd also try to avoid the entire DataTable piece unless it is really, really necessary.
If there is really no other way to migrate this data than to load everything into memory, you can make it more efficient by avoiding this nested loop and by linking the lists via Join.
Read: Why is LINQ JOIN so much faster than linking with WHERE?
var newData =
from master in oldDatasetInMemory
join t1 in alreadyMigratedTable1
on master.uniquePropertyOnOldDataset equals t1.uniquePropertyOnNewDataset into t1Group
from join1 in t1Group.Take(1).DefaultIfEmpty()
join t2 in alreadyMigratedTable2
on master.uniquePropertyOnOldDataset equals t2.uniquePropertyOnNewDataset into t2Group
from join2 in t2Group.Take(1).DefaultIfEmpty()
join t3 in alreadyMigratedTable3
on master.uniquePropertyOnOldDataset equals t3.uniquePropertyOnNewDataset into t3Group
from join3 in t1Group.Take(1).DefaultIfEmpty()
join t4 in alreadyMigratedTable4
on master.uniquePropertyOnOldDataset equals t4.uniquePropertyOnNewDataset into t4Group
from join4 in t1Group.Take(1).DefaultIfEmpty()
join t5 in alreadyMigratedTable5
on master.uniquePropertyOnOldDataset equals t5.uniquePropertyOnNewDataset into t5Group
from join5 in t1Group.Take(1).DefaultIfEmpty()
select new { master, join1, join2, join3, join4, join5};
foreach (var x in newData)
{
DataRow row = table.Rows.Add();
row["Column1"] = x.master.Property1;
row["Column2"] = x.master.Property2;
row["Column3"] = x.master.Property3;
row["ForeignKeyColumn1"] = x.join1;
row["ForeignKeyColumn2"] = x.join2;
row["ForeignKeyColumn3"] = x.join3;
row["ForeignKeyColumn4"] = x.join4;
row["ForeignKeyColumn5"] = x.join5;
}
This is a LINQ Left-Outer-Join which takes only one row from the right side.
Related
I seem to have written some very slow piece of code which gets slower when I have to deal with EF Core.
Basically I have a list of items that store attributes in a Json string in the database as I am storing many different items with different attributes.
I then have another table that contains the display order for each attribute, so when I send the items to the client I am order them based on that order.
It is kinda slow at doing 700 records in about 18-30 seconds (from where I start my timer, not the whole block of code).
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId);
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
Stopwatch a = new Stopwatch();
a.Start();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
var specDtos = new List<SpecDto>();
foreach (var inventorySpecification in inventorySpecifications.OrderBy(x => x.DisplayOrder))
{
if (specs.ContainsKey(inventorySpecification.JsonKey))
{
var value = specs.GetValue(inventorySpecification.JsonKey);
var newSpecDto = new SpecDto()
{
Key = inventorySpecification.JsonKey,
Value = displaySpec.ToString()
};
specDtos.Add(newSpecDto);
}
}
var dto = new InventoryItemDto()
{
// create dto
};
inventoryItemDtos.Add(dto);
}
Now it goes crazy slow when I add EF some more columns that I need info from.
In the //create dto area I access some information from other tables
var dto = new InventoryItemDto()
{
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
};
By trying to access these columns in the loop takes 6mins to process 700 rows.
I don't understand why it is so slow, it's the only change I really made and I made sure to eager load everything in.
To me it almost makes me think eager loading is not working, but I don't know how to verify if it is or not.
var inventoryItems = dbContext.InventoryItems.Include(x => x.Branch).ThenInclude(x => x.Company)
.Include(x => x.Branch).ThenInclude(x => x.Country)
.Include(x => x.Branch).ThenInclude(x => x.State)
.Include(x => x.Brand)
.Where(x => x.InventoryCategoryId == categoryId).ToList();
so I thought because of doing this the speed would not be that much different then the original 18-30 seconds.
I would like to speed up the original code too but I am not really sure how to get rid of the dual foreach loops that is probably slowing it down.
First, loops inside loops is a very bad thing, you should refactor that out and make it a single loop. This should not be a problem because inventorySpecifications is declared outside the loop
Second, the line
var inventorySpecifications = dbContext.InventoryCategorySpecifications.Where(x => x.InventoryCategoryId == categoryId).Select(x => x.InventorySpecification);
should end with ToList(), because it's enumerations is happening within the inner foreach, which means that the query is running for each of "inventoryItems"
that should save you a good amount of time
I'm no expert but this part of your second foreach raises a red flag: inventorySpecifications.OrderBy(x => x.DisplayOrder). Because this is getting called inside another foreach it's doing the .OrderBy call every time you iterate over inventoryItems.
Before your first foreach loop, try this: var orderedInventorySpecs = inventorySpecifications.OrderBy(x => x.DisplayOrder); and then use foreach (var inventorySpec in orderedInventorySpecs) and see if it makes a difference.
To help you better understand what EF is running behind the scenes add some logging in to expose the SQL being run which might help you see how/where your queries are going wrong. This can be extremely helpful to help determine if your queries are hitting the DB too often. As a very general rule you want to hit the DB as few times as possible and retrieve only the information you need via the use of .Select() to reduce what is being returned. The docs for the logging are: http://learn.microsoft.com/en-us/ef/core/miscellaneous/logging
I obviously cannot test this and I am a little unsure where your specDto's go once you have them but I assume they become part of the InventoryItemDto?
var itemDtos = new List<ItemDto>();
var inventoryItems = dbContext.InventoryItems.Where(x => x.InventoryCategoryId == categoryId).Select(x => new InventoryItemDto() {
Attributes = x.Attributes,
//.....
// access brand columns
// access company columns
// access branch columns
// access country columns
// access state columns
}).ToList();
var inventorySpecifications = dbContext.InventoryCategorySpecifications
.Where(x => x.InventoryCategoryId == categoryId)
.OrderBy(x => x.DisplayOrder)
.Select(x => x.InventorySpecification).ToList();
foreach (var item in inventoryItems)
{
var specs = JObject.Parse(item.Attributes);
// Assuming the specs become part of an inventory item?
item.specs = inventorySpecification.Where(x => specs.ContainsKey(x.JsonKey)).Select(x => new SpecDto() { Key = x.JsonKey, Value = specs.GetValue(x.JsonKey)});
}
The first call to the DB for inventoryItems should produce one SQL query that will pull all the information you need at once to construct your InventoryItemDto and thus only hits the DB once. Then it pulls the specs out and uses OrderBy() before materialising which means the OrderBy will be run as part of the SQL query rather than in memory. Both those results are materialised via .ToList() which will cause EF to pull the results into memory in one go.
Finally the loop goes over your constructed inventoryItems, parses the Json and then filters the specs based on that. I am unsure of where you were using the specDtos so I made an assumption that it was part of the model. I would recomend checking the performance of the Json work you are doing as that could be contributing to your slow down.
A more integrated approach to using Json as part of your EF models can be seen at this answer: https://stackoverflow.com/a/51613611/621524 however you will still be unable to use those properties to offload execution to SQL as accessing properties that are defined within code will cause queries to fragment and run in several parts.
I am working on a method that takes in two datatables and a list of primary key column names and gives back the matches. I do not have any other info about the tables.
I have searched the site for a solution to this problem and have found some answers, but none have given me a fast enough solution.
Based on results from stackoverflow I now have this:
var matches =
(from rowA in tableA.AsEnumerable()
from rowB in tableB.AsEnumerable()
where primaryKeyColumnNames.All(column => rowA[column].ToString() == rowB[column].ToString())
select new { rowA, rowB });
The problem is this is REALLY slow. It takes 4 minutes for two tables of 8000 rows each. Before I came to stackoverflow I was actually iterating through the columns and rows it took 2 minutes. (so this is actually slower than what I had) 2-4 minutes doesn't seem so bad until I hit the table with 350,000 rows. It takes days. I need to find a better solution.
Can anyone think of a way for this be faster?
Edit: Per a suggestion from tinstaafl this is now my code.
var matches = tableA.Rows.Cast<DataRow>().Select(rowA => new
{
rowA,
rowB = tableB.Rows.Find(rowA.ItemArray.Where((x, y) =>
primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName,
StringComparer.InvariantCultureIgnoreCase)).ToArray())
})
.Where(x => x.rowB != null);
Using the PrimaryKey property of the DataTable, which will accept an array of columns, should help. Perhaps something like this:
tableA.PrimaryKey = primaryKeyColumnNames.Select(x => tableA.Columns[x]).ToArray();
tableB.PrimaryKey = primaryKeyColumnNames.Select(x => tableB.Columns[x]).ToArray();
var matches = (from System.Data.DataRow RowA in tableA.Rows
where tableB.Rows.Contains(RowA.ItemArray.Where((x,y) => primaryKeyColumnNames.Contains(tableA.Columns[y].ColumnName)).ToArray())
select RowA).ToList();
In a test with 2 tables with 9900 rows and returning 9800 as common, this took about 1/3 of a second.
I return a List from a Linq query, and after it I have to fill the values in it with a for cycle.
The problem is that it is too slow.
var formentries = (from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
}).ToList();
and then I use the for in this way:
for(var x=0; x<formentries.Count(); x++)
{
var values = (from e in entryvalues
where e.FormEntryID.Equals(formentries.ElementAt(x).FormEntryID)
select e).ToList<FormEntryValue>();
formentries.ElementAt(x).Values = values;
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
But it is definitely too slow.
Is there a way to make it faster?
it is definitely too slow. Is there a way to make it faster?
Maybe. Maybe not. But that's not the right question to ask. The right question is:
Why is it so slow?
It is a lot easier to figure out the answer to the first question if you have an answer to the second question! If the answer to the second question is "because the database is in Tokyo and I'm in Rome, and the fact that the packets move no faster than speed of light is the cause of my unacceptable slowdown", then the way you make it faster is you move to Japan; no amount of fixing the query is going to change the speed of light.
To figure out why it is so slow, get a profiler. Run the code through the profiler and use that to identify where you are spending most of your time. Then see if you can speed up that part.
For what I see, you are iterating trough formentries 2 more times without reason - when you populate the values, and when you convert to dictionary.
If entryvalues is a database driven - i.e. you get them from the database, then put the value field population in the first query.
If it's not, then you do not need to invoke ToList() on the first query, do the loop, and then the Dictionary creation.
var formentries = from f in db.bNetFormEntries
join s in db.bNetFormStatus on f.StatusID.Value equals s.StatusID into entryStatus
join s2 in db.bNetFormStatus on f.ExternalStatusID.Value equals s2.StatusID into entryStatus2
where f.FormID == formID
orderby f.FormEntryID descending
select new FormEntry
{
FormEntryID = f.FormEntryID,
FormID = f.FormID,
IPAddress = f.IpAddress,
UserAgent = f.UserAgent,
CreatedBy = f.CreatedBy,
CreatedDate = f.CreatedDate,
UpdatedBy = f.UpdatedBy,
UpdatedDate = f.UpdatedDate,
StatusID = f.StatusID,
StatusText = entryStatus.FirstOrDefault().Status,
ExternalStatusID = f.ExternalStatusID,
ExternalStatusText = entryStatus2.FirstOrDefault().Status
};
var formEntryDictionary = new Dictionary<int, FormEntry>();
foreach (formEntry in formentries)
{
formentry.Values = GetValuesForFormEntry(formentry, entryvalues);
formEntryDict.Add(formEntry.FormEntryID, formEntry);
}
return formEntryDictionary;
And the values preparation:
private IList<FormEntryValue> GetValuesForFormEntry(FormEntry formEntry, IEnumerable<FormEntryValue> entryValues)
{
return (from e in entryValues
where e.FormEntryID.Equals(formEntry.FormEntryID)
select e).ToList<FormEntryValue>();
}
You can change the private method to accept only entryId instead the whole formEntry if you wish.
It's slow because your O(N*M) where N is formentries.Count and M is entryvalues.Count Even with a simple test I was getting more than 20 times slower with only 1000 elements any my type only had an int id field, with 10000 elements in the list it was over 1600 times slower than the code below!
Assuming your entryvalues is a local list and not hitting a database (just .ToList() it to a new variable somewhere if that's the case), and assuming your FormEntryId is unique (which it seems to be from the .ToDictionary call then try this instead:
var entryvaluesDictionary = entryvalues.ToDictionary(entry => entry.FormEntryID, entry => entry);
for(var x=0; x<formentries.Count; x++)
{
formentries[x] = entryvaluesDictionary[formentries[x].FormEntryID];
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
It should go a long way to making it at least scale better.
Changes: .Count instead of .Count() just because it's better to not call extension method when you don't need to. Using a dictionary to find the values rather than doing a where for every x value in the for loop effectively removes the M from the bigO.
If this isn't entirely correct I'm sure you can change whatever is missing to suit your work case instead. But as an aside, you should really consider using case for your variable names formentries versus formEntries one is just that little bit easier to read.
There are some reasons why this might be slow regarding the way you use formentries.
The formentries List<T> from above has a Count property, but you are calling the enumerable Count() extension method instead. This extension may or may not have an optimization that detects that you're operating on a collection type that has a Count property that it can defer to instead of walking the enumeration to compute the count.
Similarly the formEntries.ElementAt(x) expression is used twice; if they have not optimized ElementAt to determine that they are working with a collection like a list that can jump to an item by its index then LINQ will have to redundantly walk the list to get to the xth item.
The above evaluation may miss the real problem, which you'll only really know if you profile. However, you can avoid the above while making your code significantly easier to read if you switch how you iterate the collection of formentries as follows:
foreach(var fe in formentries)
{
fe.Values = entryvalues
.Where(e => e.FormEntryID.Equals(fe.FormEntryID))
.ToList<FormEntryValue>();
}
return formentries.ToDictionary(entry => entry.FormEntryID, entry => entry);
You may have resorted to the for(var x=...) ...ElementAt(x) approach because you thought you could not modify properties on object referenced by the foreach loop variable fe.
That said, another point that could be an issue is if formentries has multiple items with the same FormEntryID. This would result in the same work being done multiple times inside the loop. While the top query appears to be against a database, you can still do joins with data in linq-to-object land. Happy optimizing/profiling/coding - let us know what works for you.
I have two entities, Class and Student, linked in a many-to-many relationship.
When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.
For example:
{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }
{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }
There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.
I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:
var sb = new StringBuilder();
using (var ctx = new Ctx()) {
ctx.CommandTimeout = 10000; // Because the next line takes so long!
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
foreach (var c in allClasses) {
var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
foreach (var d in duplicates)
sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
}
}
lblResult.Text = sb.ToString();
This is no good because I get the error:
NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.
Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId in LINQ.
Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.
I'd really appreciate some advice.
The comparison of the SubjectId is not the problem because c.SubjectId is a value of a primitive type (int, I guess). The exception complains about Equals(c.Students). c.Students is a constant (with respect to the query duplicates) but not a primitive type.
I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first foreach loop: It executes the query allClasses. Then inside of the loop you extend the IQueryable allClasses to the IQueryable duplicates which gets executed then in the inner foreach loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.
So I would try to perform the content of the first foreach in memory. For the comparison of the Students list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
.ToList(); // executes query, allClasses is now a List, not an IQueryable
// everything from here runs in memory
foreach (var c in allClasses)
{
var duplicates = allClasses.Where(
o => o.SubjectId == c.SubjectId &&
o.Id != c.Id &&
o.Students.OrderBy(s => s.Name).Select(s => s.Name)
.SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));
// duplicates is an IEnumerable, not an IQueryable
foreach (var d in duplicates)
sb.Append(d.LongName)
.Append(" is a duplicate of ")
.Append(c.LongName)
.Append("<br />");
}
}
lblResult.Text = sb.ToString();
Ordering the sequences by name is necessary because, I believe, SequenceEqual compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.
Edit To your comment that the first query is still slow.
If you have 1300 classes with 30 students each the performance of eager loading (Include) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs a JOIN between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.
An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:
var sb = new StringBuilder();
using (var ctx = new Ctx())
{
ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
foreach (var c in allClasses)
{
// "Explicite loading": This is a new roundtrip to the DB
ctx.LoadProperty(c, "Students");
}
foreach (var c in allClasses)
{
// ... same code as above
}
}
lblResult.Text = sb.ToString();
You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no JOIN between classes and students).
Explicite loading is explained here:
http://msdn.microsoft.com/en-us/library/bb896272.aspx
For POCOs (works also for EntityObject derived entities): http://msdn.microsoft.com/en-us/library/dd456855.aspx
For EntityObject derived entities you can also use the Load method of EntityCollection: http://msdn.microsoft.com/en-us/library/bb896370.aspx
If you work with Lazy Loading the first foreach with LoadProperty would not be necessary as the Students collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.
I'm searching for a bunch of int32's in a SQL (Compact edition) database using LINQ2SQL.
My main problem is that I have a large list (thousands) of int32 and I want all records in the DB where id field in DB matches any of my int32's. Currently I'm selecting one row at the time, effectively searching the index thousands of times.
How can I optimize this? Temp table?
This sounds like you could use a Contains query:
int[] intArray = ...;
var matches = from item in context.SomeTable
where intArray.Contains(item.id)
select item;
For serarching for thousands of values, your options are:
Send an XML block to a stored procedure (complex, but doable)
Create a temp table, bulk upload the data, then join onto it (can cause problems with concurrency)
Execute multiple queries (i.e. break your group of IDs into chunks of a thousand or so and use BrokenGlass's solution)
I'm not sure which you can do with Compact Edition.
Insert your ints in a SQL table then do :
var items = from row in table
join intRow in intTable on row.TheIntColumn equals intRow.IntColumn
select row;
Edit 1 & 2: Changed the answer so he joins 2 tables, no collections.
My Preference would be to writing a Stored Procedure for the search. If you have an Index on the field that you are searching, It would make life a lot easier for you in the future when the amount of rows to process increases.
The complexity you will come across is writing a select statement that can do an IN Clause from an input parameter. What you need is to have a Table-Valued function to convert the string (of Id's) into a Column and use that column in the IN Clause.
like:
Select *
From SomeTable So
Where So.ID In (Select Column1 From dbo.StringToTable(InputIds))
I've come up with this linq solution after being tired of writing manual batching code.
It's not perfect (i.e. the batches are not exactly perfect) but it solves the problem.
Very useful when you are not allowed to write stored procs or sql functions. Works with almost every linq expression.
Enjoy:
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, int batchSize, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery)
{
return RunQueryWithBatching(listToBatch, initialQuery, batchSize);
}
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery)
{
return RunQueryWithBatching(listToBatch, initialQuery, 0);
}
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery, int batchSize)
{
if (listToBatch == null)
throw new ArgumentNullException("listToBatch");
if (initialQuery == null)
throw new ArgumentNullException("initialQuery");
if (batchSize <= 0)
batchSize = 1000;
int batchCount = (listToBatch.Count / batchSize) + 1;
var batchGroup = listToBatch.AsQueryable().Select((elem, index) => new { GroupKey = index % batchCount, BatchElement = elem }); // Enumerable.Range(0, listToBatch.Count).Zip(listToBatch, (first, second) => new { GroupKey = first, BatchElement = second });
var keysBatchGroup = from obj in batchGroup
group obj by obj.GroupKey into grouped
select grouped;
var groupedBatches = keysBatchGroup.Select(key => key.Select((group) => group.BatchElement));
var map = from employeekeysBatchGroup in groupedBatches
let batchResult = initialQuery(employeekeysBatchGroup.ToList()).ToList() // force to memory because of stupid translation error in linq2sql
from br in batchResult
select br;
return map;
}
usage:
using (var context = new SourceDataContext())
{
// some code
var myBatchResult = intArray.RunQueryWithBatching(batch => from v1 in context.Table where batch.Contains(v1.IntProperty) select v1, 2000);
// some other code that makes use of myBatchResult
}
then either use result, either expand to list, or whatever you need. Just make sure you don't lose the DataContext reference.