LinqToSql query loading unwanted data into memory - c#

I am simulating a join on a linked server through linq to sql. My question is, it appears that linqtosql is bringing all the rows for y.Xstatuses into memory and then doing the join. If this is true how do i keep all the memory on sql server(and still do a cross datacontext join operation), if this not true what is going on that is eating all my ram?
var x = new fooDataContext();
var y = new barDataContext();
var allXNotDeleted = (from w in x.CoolTable
where x.IsDeleted != false).ToList();//for our demo this returns 218 records
var allXWithCompleteStatus = (from notDeleted in allXNotDeleted
join s in y.XStatuses on notDeleted.StatusID equals s.StatusID
where s.StatusID == 1
select notDeleted).Tolist();// insert massive memory gobbler here
return allXwithCompleteStatus;
EDIT:
Trying to implement Kevinbabcock's idea
using (x = new fooDataContext())
using (var y = new barDataContext())
{
var n = (from notDeleted in x.GetTable<CoolTable>()
join z in y.GetTable<Xstatus>() on x.StatusID equals z.StatusID
where z.StatusID == 1 and x.IsDeleted != false
select x).ToList();
}
This still throws a cross context query exeception

It is not possible to perform cross data context query directly on the database.
Fetching in memory one of the recordset (ToList()) forces anyway the other joined to be processed in memory.
If you want to perform everything on sql server you have to have every entity in the same DataContext.

I'd recommend not calling ToList on allXNotDeleted for a start. That will pull those records into memory, which will probably mean that you can't avoid pulling all the other data into memory when you perform your second query.
EDIT:
As an additional note if your tables are particularly big, and you only need data from a few columns, you could set Delay Loaded to True in your database object models for the columns you don't need.
EDIT2:
I have just noticed both queries come from different contexts. In that case I suggest you create a stored procedure and call that from one of the contexts. The sproc should be responsible for spanning the contexts.

Do not call ToList() on allXNotDeleted. This materializes those records in memory, which will cause the entire XStatuses table to also be materialized in memory to perform the join.
Try this:
using(var context = new DataContext(connectionString))
{
var allXNotDeleted =
from w in context.GetTable<CoolTable>()
where x.IsDeleted != false;
var allXWithCompleteStatus = (
from notDeleted in allXNotDeleted
join s in context.GetTable<XStatuses>()
on notDeleted.StatusID equals s.StatusID
where s.StatusID == 1
select notDeleted)
.ToList();
return allXwithCompleteStatus;
}
This will only send a single query to SQL Server, and will only materialize the "notDeleted" values returned from the query. Don't forget to wrap your DataContext instance in using statements so that Dispose() is properly called when they go out of context.
Also, did you mean to filter CoolTable with IsDeleted != false? This is equivalent to IsDeleted == true, which to me indicates that you want to join all deleted records (which the name of your variable, allXNotDeleted, seems to contradict).
EDIT: updated code to work with a single DataContext instance, which should eliminate the "query contains a reference to another DataContext" error. You will need to pass in the ConnectionString to the DataContext constructor if you're not using a derived DataContext class.

Related

Entity Framework LINQ to Entities Join Query Timeout

I am executing the following LINQ to Entities query but it is stuck and does not return response until timeout. I executed the same query on SQL Server and it return 92000 in 3 sec.
var query = (from r in WinCtx.PartsRoutings
join s in WinCtx.Tab_Processes on r.ProcessName equals s.ProcessName
join p in WinCtx.Tab_Parts on r.CustPartNum equals p.CustPartNum
select new { r}).ToList();
SQL Generated:
SELECT [ I omitted columns]
FROM [dbo].[PartsRouting] AS [Extent1]
INNER JOIN [dbo].[Tab_Processes] AS [Extent2] ON ([Extent1].[ProcessName] = [Extent2].[ProcessName]) OR (([Extent1].[ProcessName] IS NULL) AND ([Extent2].[ProcessName] IS NULL))
INNER JOIN [dbo].[Tab_Parts] AS [Extent3] ON ([Extent1].[CustPartNum] = [Extent3].[CustPartNum]) OR (([Extent1].[CustPartNum] IS NULL) AND ([Extent3].[CustPartNum] IS NULL))
PartsRouting Table has 100,000+ records, Parts = 15000+, Processes = 200.
I tried too many things found online but nothing worked for me as to how I can achieve the result with same performance of SQL.
Based on the comments, looks like the issue is caused by the additional OR with IS NULL conditions in joins generated by the EF SQL translator. They were added in EF in order to emulate the C# == operator semantics which are different from SQL = for NULL values.
You can start by turning that EF behavior off through UseDatabaseNullSemantics property (it's false by default):
WinCtx.Configuration.UseDatabaseNullSemantics = true;
Unfortunately that's not enough, because it fixes the normal comparison operators, but they simply forgot to do the same for join conditions.
In case you are using joins just for filtering (as it seems), you can replace them with LINQ Any conditions which translates to SQL EXISTS and nowadays database query optimizers are treating it the same way as if it was an inner join:
var query = (from r in WinCtx.PartsRoutings
where WinCtx.Tab_Processes.Any(s => r.ProcessName == s.ProcessName)
where WinCtx.Tab_Parts.Any(p => r.CustPartNum == p.CustPartNum)
select new { r }).ToList();
You might also consider using just select r since creating anonymous type with single property just introdeces additional memory overhead with no advantages.
Update: Looking at the latest comment, you do need fields from joined tables (that's why it's important to not omit relevant parts of the query in question). In such case, you could try the alternative join syntax with where clauses:
WinCtx.Configuration.UseDatabaseNullSemantics = true;
var query = (from r in WinCtx.PartsRoutings
from s in WinCtx.Tab_Processes where r.ProcessName == s.ProcessName
from p in WinCtx.Tab_Parts where r.CustPartNum == p.CustPartNum
select new { r, s.Foo, p.Bar }).ToList();

Does Entity Framework query the database multiple times if I use different fields of the same Linq query at different times?

I tried the Internet and the SOF but couldn't locate a helpful resource. Perhaps I may not be using correct wording to search. If there are any previous questions I have missed due to this reason please let me know and I will take this question down.
I am dealing with a busy database so I am required to send less queries to the database.
If I access different columns of the same Linq query from different levels of the code then is Entity Framework smart enough to foresee the required columns and bring them all or does it call the db twice?
eg.
var query = from t1 in table_1
join t2 in table_2 on t1.col1 equals t2.col1
where t1.EmployeeId == EmployeeId
group new { t1, t2 } by t1.col2 into grouped
orderby grouped.Count() descending
select new { Column1 = grouped.Key, Column2 = grouped.Sum(g=>g.t2.col4) };
var records = query.Take(10);
// point x
var x = records.Select(a => a.Column1).ToArray();
var y = records.Select(a => a.Column2).ToArray();
Does EF generate query the database twice to faciliate x and y (send a query first to get Column1, and then send another to get Column2) or is it smart enough to know it needs both Columns to be materialised and bring them both at point x?
Added to clarify the intention of the question:
I understand I can simply add a greedy method to the end of query.Take(10) and get it done but I am trying to understand if the approach I try (and in my opinion, more elegant) does work of if not what makes EF to make two queries please.
Yes currently your code will generate 2 queries that will be executed to the database. Reason being is because you have 2 different sqls generated:
First is the top query, taking only 10 records and then only Column1
Second is the top query, taking only 10 records and then only Column2
The reason these are 2 queries is because you have a ToArray over different Select statements -> generating different sql. Most of linq queries are differed executed and will be executed only when you use something like ToArray()/ToList()/FirstOrDefault() and so on - those that actually give you the concrete data. In your original query you have 2 different ToArray on data that has not yet been retrieved - meaning 2 queries (once for the first field and then for the second).
The following code will result in a single query to the database
var records = (from t1 in table_1
join t2 in table_2 on t1.col1 equals t2.col1
where t1.EmployeeId == EmployeeId
group new { t1, t2 } by t1.col2 into grouped
orderby grouped.Count() descending
select new { Column1 = grouped.Key, Column2 = grouped.Sum(g=>g.t2.col4) })
.Take(10).ToList();
var x = records.Select(a => a.Column1).ToArray();
var y = records.Select(a => a.Column2).ToArray();
In my solution above I added a ToList() after filtering out only that data you need (Take(10)) and then at that point it will execute to the database. Then you have all the data in memory and you can do any other linq operation over it without it going again to the database.
Add to your code ToString() so you can check the generated sql at different points. Then you will understand when and what is being executed:
var query = from t1 in table_1
join t2 in table_2 on t1.col1 equals t2.col1
where t1.EmployeeId == EmployeeId
group new { t1, t2 } by t1.col2 into grouped
orderby grouped.Count() descending
select new { Column1 = grouped.Key, Column2 = grouped.Sum(g=>g.t2.col4) };
var generatedSql = query.ToString(); // Here you will see a query that brings all records
var records = query.Take(10);
generatedSql = query.ToString(); // Here you will see it taking only 10 records
// point x
var xQuery = records.Select(a => a.Column1);
generatedSql = xQuery.ToString(); // Here you will see only 1 column in query
// Still nothing has been executed to DB at this point
var x = xQuery.ToArray(); // And that is what will be executed here
// Now you are before second execution
var yQuery = records.Select(a => a.Column2);
generatedSql = yQuery.ToString(); // Here you will see only the second column in query
// Finally, second execution, now with the other column
var y = yQuery.ToArray();
When you are running linq statement on an entity in EF if only prepares the Select statement (thats why the type is IQueryable). The data is loaded lazily. When you try to use a value from that query then only the result gets evaluated using a enumerator.
So when you turn it to a collection (.toList() etc.) explicitly it tries to get data to populate the list and hence the sql command is fired.
It is designed so to enhance the performance. So if a particular property of an entity is to be used EF doesn't get the value for all the columns from that table

How to improve AsEnumerable performance in EF

I work on vs2012 ef.
I have 1 to many mapping table structure in my edmx.
var query = (
from bm in this.Context.BilBillMasters.AsEnumerable ()
join g in
(
from c in this.Context.BilBillDetails.AsEnumerable ()
group c by new { c.BillID }
)
on bm.BillID equals (g == null ? 0 : g.Key.BillID) into bDG
from billDetailGroup in bDG.DefaultIfEmpty()
where bm.IsDeleted == false
&& (companyID == 0 || bm.CompanyID == companyID)
&& (userID == 0 || bm.CustomerID == userID)
select new
{
bm.BillID,
BillNo = bm.CustomCode,
bm.BillDate,
BillMonth = bm.MonthFrom,
TransactionTypeID = bm.TransactionTypeID ?? 0,
CustomerID = bm.CustomerID,
Total = billDetailGroup.Sum(p => p.Amount),//group result
bm.ReferenceID,
bm.ReferenceTypeID
}
);
This method is taking close 30 seconds to return back the result in the first run.
Not sure what is wrong.
I tried getting List of results and tried elementAt(0) that is also slow.
As soon as you use AsEnumerable, your query stops being a "queryable". That means that what you're doing is that you're downloading the whole BilBillMasters and BilBillDetails tables and then doing some processing on those in your application, rather than on the SQL server. This is bound to be slow.
The obvious solution is obvious - don't use AsEnumerable - it basically moves processing from the SQL server (which has all the data and indexes etc.) to your application server (which has neither and has to get the data from the DB server; all of the data).
At the very least, you want to limit the amount of data downloaded as much as possible, ie. for example filter the tables by CompanyID and CustomerID before using AsEnumerable. However, overall, I see no reason why the query couldn't be executed completely on the SQL server - this is usually the preferred solution for many reasons.
Overall, it sounds as if you're using the AsEnumerable as a fix to another problem, but it's almost definitely a bad solution - at least without further filtering of the data before using AsEnumerable.

Why is this Cross Join so Slow in Linq?

I wrote this piece of Linq to handle doing a CROSS Join just like a database would between multiple lists.
But for some reason it's extremely slow when any of the lists go more than 3000. I'd wait for 30s ? These lists could go to very large numbers.
This query is looped for each relationship with the other list's data coming from ColumnDataIndex.
Any Advice ?
UPDATE ** - The data is inserted into normal lists that are built before hand from the configured sources. This is all in memory at the moment.
RunningResult[parameter.Uid] = (from source_row in RunningResult[parameter.Uid]
from target_row in ColumnDataIndex[dest_key]
where GetColumnFromUID(source_row, rel.SourceColumn) == GetColumnFromUID(target_row, rel.TargetColumn)
select new Row()
{
Columns = MergeColumns(source_row.Columns, target_row.Columns)
}).ToList();
The 2 extra functions:
MergeColumns: Takes the Columns from the 2 items and merges them into a single array.
public static Columnn[] MergeColumns(Column[] source_columns, Column[] target_columns)
{
Provider.Data.BucketColumn[] new_column = new Provider.Data.BucketColumn[source_columns.Length + target_columns.Length];
source_columns.CopyTo(new_column, 0);
target_columns.CopyTo(new_column, source_columns.Length);
return new_column;
}
GetColumnFromUID: Returns the Value of the Column in the Item matching the column uid given.
private static String GetColumnFromUID(Row row, String column_uid)
{
if (row != null)
{
var dest_col = row.Columns.FirstOrDefault(col => col.ColumnUid == column_uid);
return dest_col == null ? "" + row.RowId : dest_col.Value.ToString().ToLower();
}
else return String.Empty;
}
Update:
Ended up moving the data and the query to a database. This reduced to the speed to a number of ms. Could have written a optimized looped function but this was the fastest way out for me.
You don't actually need to be performing a cross join. Cross joins are inherently expensive operations. You shouldn't be doing that unless you really need it. In your case what you really need is just an inner join. You're performing a cross join which is resulting in lots of values that you don't need at all, and then you're filtering out a huge percentage of those values to leave you with the few that you need. If you just did an inner join from the start you would only compute the values that you need. That will save you from needing to create a whole lot of rows you don't need just to have them be thrown away.
LINQ has its own inner join operation, Join, so you don't even need to write your own:
RunningResult[parameter.Uid] = (from source_row in RunningResult[parameter.Uid]
join target_row in ColumnDataIndex[dest_key]
on GetColumnFromUID(source_row, rel.SourceColumn) equals
GetColumnFromUID(target_row, rel.TargetColumn)
select new Row()
{
Columns = MergeColumns(source_row.Columns, target_row.Columns)
}).ToList();
You're not doing a cross join, but an inner join with an ON clause, only in your case, the ON clause in the where predicate.
An inner join is typically done with two hash sets/tables, so you can quickly find the row in set X based on the value in row Y.
So 'weston's answer is OK, yet you need to use dictionaries/hashtables to make it really fast. Be aware that it might be there are more rows per key. You can use a multi-value hashtable/dictionary like this one for that:
https://github.com/SolutionsDesign/Algorithmia/blob/master/SD.Tools.Algorithmia/GeneralDataStructures/MultiValueDictionary.cs

Is this LINQ Query "correct"?

I have the following LINQ query, that is returning the results that I expect, but it does not "feel" right.
Basically it is a left join. I need ALL records from the UserProfile table.
Then the LastWinnerDate is a single record from the winner table (possible multiple records) indicating the DateTime the last record was entered in that table for the user.
WinnerCount is the number of records for the user in the winner table (possible multiple records).
Video1 is basically a bool indicating there is, or is not a record for the user in the winner table matching on a third table Objective (should be 1 or 0 rows).
Quiz1 is same as Video 1 matching another record from Objective Table (should be 1 or 0 rows).
Video and Quiz is repeated 12 times because it is for a report to be displayed to a user listing all user records and indicate if they have met the objectives.
var objectiveIds = new List<int>();
objectiveIds.AddRange(GetObjectiveIds(objectiveName, false));
var q =
from up in MetaData.UserProfile
select new RankingDTO
{
UserId = up.UserID,
FirstName = up.FirstName,
LastName = up.LastName,
LastWinnerDate = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner.CreatedOn).First(),
WinnerCount = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner).Count(),
Video1 = (
from winner in MetaData.Winner
join o in MetaData.Objective on winner.ObjectiveID equals o.ObjectiveID
where o.ObjectiveNm == Constants.Promotions.SecVideo1
where winner.Active
where winner.UserID == up.UserID
select winner).Count(),
Quiz1 = (
from winner2 in MetaData.Winner
join o2 in MetaData.Objective on winner2.ObjectiveID equals o2.ObjectiveID
where o2.ObjectiveNm == Constants.Promotions.SecQuiz1
where winner2.Active
where winner2.UserID == up.UserID
select winner2).Count(),
};
You're repeating join winners table part several times. In order to avoid it you can break it into several consequent Selects. So instead of having one huge select, you can make two selects with lesser code. In your example I would first of all select winner2 variable before selecting other result properties:
var q1 =
from up in MetaData.UserProfile
select new {up,
winners = from winner in MetaData.Winner
where winner.Active
where winner.UserID == up.UserID
select winner};
var q = from upWinnerPair in q1
select new RankingDTO
{
UserId = upWinnerPair.up.UserID,
FirstName = upWinnerPair.up.FirstName,
LastName = upWinnerPair.up.LastName,
LastWinnerDate = /* Here you will have more simple and less repeatable code
using winners collection from "upWinnerPair.winners"*/
The query itself is pretty simple: just a main outer query and a series of subselects to retrieve actual column data. While it's not the most efficient means of querying the data you're after (joins and using windowing functions will likely get you better performance), it's the only real way to represent that query using either the query or expression syntax (windowing functions in SQL have no mapping in LINQ or the LINQ-supporting extension methods).
Note that you aren't doing any actual outer joins (left or right) in your code; you're creating subqueries to retrieve the column data. It might be worth looking at the actual SQL being generated by your query. You don't specify which ORM you're using (which would determine how to examine it client-side) or which database you're using (which would determine how to examine it server-side).
If you're using the ADO.NET Entity Framework, you can cast your query to an ObjectQuery and call ToTraceString().
If you're using SQL Server, you can use SQL Server Profiler (assuming you have access to it) to view the SQL being executed, or you can run a trace manually to do the same thing.
To perform an outer join in LINQ query syntax, do this:
Assuming we have two sources alpha and beta, each having a common Id property, you can select from alpha and perform a left join on beta in this way:
from a in alpha
join btemp in beta on a.Id equals btemp.Id into bleft
from b in bleft.DefaultIfEmpty()
select new { IdA = a.Id, IdB = b.Id }
Admittedly, the syntax is a little oblique. Nonetheless, it works and will be translated into something like this in SQL:
select
a.Id as IdA,
b.Id as Idb
from alpha a
left join beta b on a.Id = b.Id
It looks fine to me, though I could see why the multiple sub-queries could trigger inefficiency worries in the eyes of a coder.
Take a look at what SQL is produced though (I'm guessing you're running this against a database source from your saying "table" above), before you start worrying about that. The query providers can be pretty good at producing nice efficient SQL that in turn produces a good underlying database query, and if that's happening, then happy days (it will also give you another view on being sure of the correctness).

Categories