LINQ2SQL select rows based on large where - c#

I'm searching for a bunch of int32's in a SQL (Compact edition) database using LINQ2SQL.
My main problem is that I have a large list (thousands) of int32 and I want all records in the DB where id field in DB matches any of my int32's. Currently I'm selecting one row at the time, effectively searching the index thousands of times.
How can I optimize this? Temp table?

This sounds like you could use a Contains query:
int[] intArray = ...;
var matches = from item in context.SomeTable
where intArray.Contains(item.id)
select item;

For serarching for thousands of values, your options are:
Send an XML block to a stored procedure (complex, but doable)
Create a temp table, bulk upload the data, then join onto it (can cause problems with concurrency)
Execute multiple queries (i.e. break your group of IDs into chunks of a thousand or so and use BrokenGlass's solution)
I'm not sure which you can do with Compact Edition.

Insert your ints in a SQL table then do :
var items = from row in table
join intRow in intTable on row.TheIntColumn equals intRow.IntColumn
select row;
Edit 1 & 2: Changed the answer so he joins 2 tables, no collections.

My Preference would be to writing a Stored Procedure for the search. If you have an Index on the field that you are searching, It would make life a lot easier for you in the future when the amount of rows to process increases.
The complexity you will come across is writing a select statement that can do an IN Clause from an input parameter. What you need is to have a Table-Valued function to convert the string (of Id's) into a Column and use that column in the IN Clause.
like:
Select *
From SomeTable So
Where So.ID In (Select Column1 From dbo.StringToTable(InputIds))

I've come up with this linq solution after being tired of writing manual batching code.
It's not perfect (i.e. the batches are not exactly perfect) but it solves the problem.
Very useful when you are not allowed to write stored procs or sql functions. Works with almost every linq expression.
Enjoy:
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, int batchSize, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery)
{
return RunQueryWithBatching(listToBatch, initialQuery, batchSize);
}
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery)
{
return RunQueryWithBatching(listToBatch, initialQuery, 0);
}
public static IQueryable<TResultElement> RunQueryWithBatching<TBatchElement, TResultElement>(this IList<TBatchElement> listToBatch, Func<List<TBatchElement>, IQueryable<TResultElement>> initialQuery, int batchSize)
{
if (listToBatch == null)
throw new ArgumentNullException("listToBatch");
if (initialQuery == null)
throw new ArgumentNullException("initialQuery");
if (batchSize <= 0)
batchSize = 1000;
int batchCount = (listToBatch.Count / batchSize) + 1;
var batchGroup = listToBatch.AsQueryable().Select((elem, index) => new { GroupKey = index % batchCount, BatchElement = elem }); // Enumerable.Range(0, listToBatch.Count).Zip(listToBatch, (first, second) => new { GroupKey = first, BatchElement = second });
var keysBatchGroup = from obj in batchGroup
group obj by obj.GroupKey into grouped
select grouped;
var groupedBatches = keysBatchGroup.Select(key => key.Select((group) => group.BatchElement));
var map = from employeekeysBatchGroup in groupedBatches
let batchResult = initialQuery(employeekeysBatchGroup.ToList()).ToList() // force to memory because of stupid translation error in linq2sql
from br in batchResult
select br;
return map;
}
usage:
using (var context = new SourceDataContext())
{
// some code
var myBatchResult = intArray.RunQueryWithBatching(batch => from v1 in context.Table where batch.Contains(v1.IntProperty) select v1, 2000);
// some other code that makes use of myBatchResult
}
then either use result, either expand to list, or whatever you need. Just make sure you don't lose the DataContext reference.

Related

FirstOrDefault() adding days of time to iteration

Got a kind of edge case issue here. I've been tasked with pulling all data from one database to another, where the destination database has a different schema.
I've chosen to write a WinForms utility to do the data mapping and transfer with Entity Framework/ADO.NET when necessary.
This has worked great so far, except for this one specific table that has 2.5 million records. The transfer is about 10 minutes total when I disregard all foreign keys, however when I start mapping foreign keys with FirstOrDefault() calls against in memory lists of data that have been already moved to the destination database, quite literally 4 days are added to the amount of time that it takes.
I'm going to need to run this tool a lot over the coming days so this isn't really acceptable for me.
Here's my current approach (Not my first approach, this is the result of much trial and error for efficiencies sake):
private OldModelContext _oldModelContext { get; } //instantiated in controller
using (var newModelContext = new NewModelContext())
{
//Takes no time at all to load these into memory, collections are small, 3 - 20 records each
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToList();
var alreadyMigratedTable2 = newModelContext.alreadyMigratedTable2.ToList();
var alreadyMigratedTable3 = newModelContext.alreadyMigratedTable3.ToList();
var alreadyMigratedTable4 = newModelContext.alreadyMigratedTable4.ToList();
var alreadyMigratedTable5 = newModelContext.alreadyMigratedTable5.ToList();
var oldDatasetInMemory = _oldModelContext.MasterData.AsNoTracking().ToList();//2.5 Million records, takes about 6 minutes
var table = new DataTable("MasterData");
table.Columns.Add("Column1");
table.Columns.Add("Column2");
table.Columns.Add("Column3");
table.Columns.Add("ForeignKeyColumn1");
table.Columns.Add("ForeignKeyColumn2");
table.Columns.Add("ForeignKeyColumn3");
table.Columns.Add("ForeignKeyColumn4");
table.Columns.Add("ForeignKeyColumn5");
foreach(var masterData in oldDatasetInMemory){
DataRow row = table.NewRow();
//With just these properties mapped, this takes about 2 minutes for all 2.5 Million
row["Column1"] = masterData.Property1;
row["Column2"] = masterData.Property2;
row["Column3"] = masterData.Property3;
//With this mapping, we add about 4 days to the overall process.
row["ForeignKeyColumn1"] = alreadyMigratedTable1.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn2"] = alreadyMigratedTable2.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn3"] = alreadyMigratedTable3.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn4"] = alreadyMigratedTable4.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
row["ForeignKeyColumn5"] = alreadyMigratedTable5.FirstOrDefault(s => s.uniquePropertyOnNewDataset == masterData.uniquePropertyOnOldDataset);
table.Rows.Add(row);
}
//Save table with SQLBulkCopy is very fast, takes about a minute and a half.
}
}
Note: uniquePropertyOn(New/Old)Dataset is most often a unique description string shared among the datasets, can't match Ids as they won't be the same across databases.
I have tried:
Instead of using a foreach, cast with a linq select statement, not much improvement was had.
Use .Where(predicate).FirstOrDefault(), didn't see any considerable improvement
Running FirstOrDefault() against iqueryable instead of lists of migrated data, didn't see any improvement.
Mapping to a List instead of a datatable, but that makes no difference in the mapping speed, and also makes bulk saves slower.
I've been messing around with the idea of turning the foreach into a parallel foreach loop and locking the calls to the datatable, but I keep running into
Entity Framework connection closed issues
when querying the in memory lists while using the parallel foreach.... not really sure what that's about but initially the speed results were promising.
I'd be happy to post that code/errors if anyone thinks it's the right road to go down, but i'm not sure anymore..
The first thing I'd try is a dictionary, and pre-fetching the columns:
var fk1 = oldDatasetInMemory.Columns["ForeignKeyColumn1"];
// ...
var alreadyMigratedTable1 = newModelContext.alreadyMigratedTable1.ToDictionary(
x => x.uniquePropertyOnNewDataset);
// ...
if (alreadyMigratedTable1.TryGetValue(masterData.uniquePropertyOnOldDataset, out var val))
row[fk1] = val;
However, in reality: I'd also try to avoid the entire DataTable piece unless it is really, really necessary.
If there is really no other way to migrate this data than to load everything into memory, you can make it more efficient by avoiding this nested loop and by linking the lists via Join.
Read: Why is LINQ JOIN so much faster than linking with WHERE?
var newData =
from master in oldDatasetInMemory
join t1 in alreadyMigratedTable1
on master.uniquePropertyOnOldDataset equals t1.uniquePropertyOnNewDataset into t1Group
from join1 in t1Group.Take(1).DefaultIfEmpty()
join t2 in alreadyMigratedTable2
on master.uniquePropertyOnOldDataset equals t2.uniquePropertyOnNewDataset into t2Group
from join2 in t2Group.Take(1).DefaultIfEmpty()
join t3 in alreadyMigratedTable3
on master.uniquePropertyOnOldDataset equals t3.uniquePropertyOnNewDataset into t3Group
from join3 in t1Group.Take(1).DefaultIfEmpty()
join t4 in alreadyMigratedTable4
on master.uniquePropertyOnOldDataset equals t4.uniquePropertyOnNewDataset into t4Group
from join4 in t1Group.Take(1).DefaultIfEmpty()
join t5 in alreadyMigratedTable5
on master.uniquePropertyOnOldDataset equals t5.uniquePropertyOnNewDataset into t5Group
from join5 in t1Group.Take(1).DefaultIfEmpty()
select new { master, join1, join2, join3, join4, join5};
foreach (var x in newData)
{
DataRow row = table.Rows.Add();
row["Column1"] = x.master.Property1;
row["Column2"] = x.master.Property2;
row["Column3"] = x.master.Property3;
row["ForeignKeyColumn1"] = x.join1;
row["ForeignKeyColumn2"] = x.join2;
row["ForeignKeyColumn3"] = x.join3;
row["ForeignKeyColumn4"] = x.join4;
row["ForeignKeyColumn5"] = x.join5;
}
This is a LINQ Left-Outer-Join which takes only one row from the right side.

Where IN for linq

I have seen questions with this subject but mine is different.
i have stored procedure (EmpsByManager) imported in EF. it returns data of following fields:
EmpId, EmpName, PrimaryMobile
I have a claimTable in the db having the following fields
EmpId, ClaimId, ClaimDetails...
I want to return all claims from the claimTable IN the Employees of EmpsByManager(ManagerId)
I could manage to do this with a loop:
public dynamic getActiveClaims(int ManagerId)
{
db.Configuration.ProxyCreationEnabled = false;
var myEmps = db.getEmpDetofManager(ManagerId).ToList();
List<List<claimJSON>> claimsList = new List<List<claimJSON>>();
foreach(var Emp in myEmps)
{
claimsList.Add(db.claimJSONs.Where(e => e.EmpId == Emp.EmpId && e.claimstatus != 0 && e.claimstatus != 8).ToList());
}
return claimsList;
}
This is giving correct results but, I myself am not convinced with the complexity and number of database hits to get the required result.
Anyone? Thank You.
Currently you are hitting the db everytime inside your loop. You can replace the Where clause inside your foreach loop with the use of the Contains() method.
var myEmps = db.getEmpDetofManager(ManagerId).ToList();
// Get all EmpIds from the result and store to a List of Int
List<int> empIds = myEmps.Select(f=>f.EmpId).ToList();
// Use the List of EmpId's in your LINQ query.
var claimsList = db.claimJSONs.Where(e => empIds.Contains(e.EmpId)
&& e.claimstatus != 0 && e.claimstatus != 8).ToList();
Also, not that the result in claimsList variable will be a List<claimJSON> , not List<List<claimJSON>>>
This will result in 2 hits to the db. One for the stored proc and another for getting data from the claimJSON table for the list of EmpIds we got from the stored proc result.
Well, there is not a lot you CAN optimize. The main problem is the stored procedure.
As you can not join - even without the limitations of EF - the output of a stored procedure... no, you have no way. Not without rewriting the SP. Which should not be there anyway - this functionality is much better suited for a function which can then be used in a more complex query style. Someone forced that into a SP - and now you have to live with the limitations.

Using linq to get min and max of a particular field in one query

Lets say you have a class like:
public class Section {
public DateTime StartDate;
public DateTime? EndDate;
}
I have a list of these objects, and I would like to get the minimum start date and the maximum end date, but I would like to use one linq query so I know that I'm only iterating over the list once.
For instance, if I was doing this without linq, my code would look a bit like this (not checking for nulls):
DateTime? minStartDate;
DateTime? maxEndDate;
foreach(var s in sections) {
if(s.StartDate < minStartDate) minStartDate = s.StartDate;
if(s.EndDate > maxEndDate) maxEndDate = s.EndDate;
}
I could have two linq queries to get the min and max, but I know that under the covers, it would require iterating over all values twice.
I've seen min and max queries like this before, but with grouping. How would you do this without grouping, and in a single linq query?
How would you do this without grouping, and in a single linq query?
If I had to do that, then I'd do:
var minMax = (from s0 in sections
from s1 in sections
orderby s0.StartDate, s1.EndDate descending
select new {s0.StartDate, s1.EndDate}).FirstOrDefault();
But I'd also consider the performance impact depending on the provider in question.
On a database I'd expect this to become something like:
SELECT s0.StartDate, s1.EndDate
FROM Sections AS s0
CROSS JOIN Sections AS s1
ORDER BY created ASC, EndDate DESC
LIMIT 1
OR
SELECT TOP 1 s0.StartDate, s1.EndDate
FROM Sections AS s0, Sections AS s1
ORDER BY created ASC, EndDate DESC
Depending on database type. How that in turn would be executed could well be two table scans, but if I was going to care about these dates I'd have indices on those columns so it should be two index look-scans toward the end of each index, so I'd expect it to be pretty fast.
I have a list of these objects
Then if I cared a lot about performance, I wouldn't use Linq.
but I would like to use one linq query so I know that I'm only iterating over the list once
That's why I wouldn't use linq. Since there's nothing in linq designed to deal with this particular case, it would hit the worse combination. Indeed it would be worse than 2 iterations, it would be N +1 iterations where N is the number of elements in Sections. Linq providers are good, but they aren't magic.
If I really wanted to be able to do this in Linq, as for example I was sometimes doing this against lists in memory and sometimes against databases and so on, I'd add my own methods to do each the best way possible:
public static Tuple<DateTime, DateTime?> MinStartMaxEnd(this IQueryable<Section> source)
{
if(source == null)
return null;
var minMax = (from s0 in source
from s1 in source
orderby s0.StartDate, s1.EndDate descending
select new {s0.StartDate, s1.EndDate}).FirstOrDefault();
return minMax == null ? null : Tuple.Create(minMax.StartDate, minMax.EndDate);
}
public static Tuple<DateTime, DateTime?> MinStartMaxEnd(this IEnumerable<Section> source)
{
if(source != null)
using(var en = source.GetEnumerator())
if(en.MoveNext())
{
var cur = en.Current;
var start = cur.StartDate;
var end = cur.EndDate;
while(en.MoveNext())
{
cur = en.Current;
if(cur.StartDate < start)
start = cur.StartDate;
if(cur.EndDate.HasValue && (!end.HasValue || cur.EndDate > end))
end = cur.EndDate;
}
return Tuple.Create(start, end);
}
return null;
}
but I would like to use one linq query so I know that I'm only iterating over the list once
To come back to this. Linq does not promise to iterate over a list once. It can sometimes do so (or not iterate at all). It can call into database queries that in turn turn what is conceptually several iterations into one or two (common with CTEs). It can produce code that is very efficient for a variety of similar-but-not-quite-the-same queries where the alternative in hand-coding would be to either suffer a lot of waste or else to write reams of similar-but-not-quite-the-same methods.
But it can also hide some N+1 or N*N behaviour in what looks like a lot less if you assume Linq gives you a single pass. If you need particular single-pass behaviour, add to Linq; it's extensible.
You can use Min and Max:
List<Section> test = new List<Section>();
minStartDate = test.Min(o => o.StartDate);
maxEndDate = test.Max(o => o.EndDate);

How to get row index by entity key in a dynamically built query using Entity Framework

In a grid, I need to page to an record by its ID. That is why I need to find its index in the user-filtered and user-sorted set.
I'm working with LINQ to Entities. The query is built dynamically, based on user input.
The table contains too many (more than 10^5) records, for the following Stack Overflow suggestion to be any good:
Recs = Recs.Where( /* Filters */ );
Recs = Recs.OrderBy( /* Sort criteria */ );
Recs.AsEnumerable()
.Select((x,index) => new {RowNumber = index, Record = x})
.Where(x=>x.Record.ID = 35);
Because LINQ to Entities doesn't support Select((entity, index) => ...), it would require downloading 250,000 records from the SQL server just so I could decide to show page 25,000.
Currently, my most promising idea is to transform each sort criterion into a filter. So finding the index of a person sorted by ascending height would become counting the shorter persons (sort criterion 'height ascending' => filter 'height less than' + count).
How should I approach this? Is this problem already solved? Is there any library for .NET that takes me even half way there?
Here is a recursive function you can call to figure out the row number. If your database records are changing frequently it probably won't work, since this is calling the database multiple times narrowing down the search in half each time.
public static int FindRowNumber<T>(IQueryable<T> query, Expression<Func<T, bool>> search, int skip, int take)
{
if(take < 1) return -1;
if(take == 1) return query.Skip(skip).Take(take).Any(search) ? skip : -1;
int bottomSkip = skip;
int bottomTake = take / 2;
int topSkip = bottomTake + bottomSkip;
int topTake = take - bottomTake;
if(query.Skip(bottomSkip).Take(bottomTake).Any(search))
{
return FindRowNumber(query, search, bottomSkip, bottomTake);
}
if(query.Skip(topSkip).Take(topTake).Any(search))
{
return FindRowNumber(query, search, topSkip, topTake);
}
return -1;
}
You call it like so:
var query = ... //your query with ordering and filtering
int rownumber = FindRowNumber(query, x => x.Record.ID == 35, 0, query.Count());

LINQ: Paging technique, using take and skip but need total records also - how to implement this?

I have implemented a paging routine using skip and take. It works great, but I need the total number of records in the table prior to calling Take and Skip.
I know I can submit 2 separate queries.
Get Count
Skip and Take
But I would prefer not to issue 2 calls to LINQ.
How can I return it in the same query (e.g. using a nested select statement)?
Previously, I used a paging technique in a stored procedure. I returned the items by using a temporary table, and I passed the count to an output parameter.
I'm sorry, but you can't. At least, not in a pretty way.
You can do it in an unpretty way, but I don't think you like that:
var query = from e in db.Entities where etc etc etc;
var pagedQuery =
from e in query.Skip(pageSize * pageNumber).Take(pageSize)
select new
{
Count = query.Count(),
Entity = e
};
You see? Not pretty at all.
There is no reason to do two seperate queries or even a stored procedure. Use a let binding to note a sub-query when you are done you can have an anon type that contains both your selected item as well as your total count. A single query to the database, 1 linq expression and your done. TO Get the values it would be jobQuery.Select(x => x.item) or jobQuery.FirstOrDefault().Count
Let expressions are an amazing thing.
var jobQuery = (
from job in jc.Jobs
let jobCount = (
from j in jc.Jobs
where j.CustomerNumber.Equals(CustomerNumber)
select
j
).Count()
where job.CustomerNumber.Equals(CustomerNumber)
select
new
{
item = job.OrderBy(x => x.FieldName).Skip(0).Take(100),
Count = jobCount
}
);

Categories