Why is this Cross Join so Slow in Linq? - c#

I wrote this piece of Linq to handle doing a CROSS Join just like a database would between multiple lists.
But for some reason it's extremely slow when any of the lists go more than 3000. I'd wait for 30s ? These lists could go to very large numbers.
This query is looped for each relationship with the other list's data coming from ColumnDataIndex.
Any Advice ?
UPDATE ** - The data is inserted into normal lists that are built before hand from the configured sources. This is all in memory at the moment.
RunningResult[parameter.Uid] = (from source_row in RunningResult[parameter.Uid]
from target_row in ColumnDataIndex[dest_key]
where GetColumnFromUID(source_row, rel.SourceColumn) == GetColumnFromUID(target_row, rel.TargetColumn)
select new Row()
{
Columns = MergeColumns(source_row.Columns, target_row.Columns)
}).ToList();
The 2 extra functions:
MergeColumns: Takes the Columns from the 2 items and merges them into a single array.
public static Columnn[] MergeColumns(Column[] source_columns, Column[] target_columns)
{
Provider.Data.BucketColumn[] new_column = new Provider.Data.BucketColumn[source_columns.Length + target_columns.Length];
source_columns.CopyTo(new_column, 0);
target_columns.CopyTo(new_column, source_columns.Length);
return new_column;
}
GetColumnFromUID: Returns the Value of the Column in the Item matching the column uid given.
private static String GetColumnFromUID(Row row, String column_uid)
{
if (row != null)
{
var dest_col = row.Columns.FirstOrDefault(col => col.ColumnUid == column_uid);
return dest_col == null ? "" + row.RowId : dest_col.Value.ToString().ToLower();
}
else return String.Empty;
}
Update:
Ended up moving the data and the query to a database. This reduced to the speed to a number of ms. Could have written a optimized looped function but this was the fastest way out for me.

You don't actually need to be performing a cross join. Cross joins are inherently expensive operations. You shouldn't be doing that unless you really need it. In your case what you really need is just an inner join. You're performing a cross join which is resulting in lots of values that you don't need at all, and then you're filtering out a huge percentage of those values to leave you with the few that you need. If you just did an inner join from the start you would only compute the values that you need. That will save you from needing to create a whole lot of rows you don't need just to have them be thrown away.
LINQ has its own inner join operation, Join, so you don't even need to write your own:
RunningResult[parameter.Uid] = (from source_row in RunningResult[parameter.Uid]
join target_row in ColumnDataIndex[dest_key]
on GetColumnFromUID(source_row, rel.SourceColumn) equals
GetColumnFromUID(target_row, rel.TargetColumn)
select new Row()
{
Columns = MergeColumns(source_row.Columns, target_row.Columns)
}).ToList();

You're not doing a cross join, but an inner join with an ON clause, only in your case, the ON clause in the where predicate.
An inner join is typically done with two hash sets/tables, so you can quickly find the row in set X based on the value in row Y.
So 'weston's answer is OK, yet you need to use dictionaries/hashtables to make it really fast. Be aware that it might be there are more rows per key. You can use a multi-value hashtable/dictionary like this one for that:
https://github.com/SolutionsDesign/Algorithmia/blob/master/SD.Tools.Algorithmia/GeneralDataStructures/MultiValueDictionary.cs

Related

LINQ to Entities performance issue with Where and Contains

Table A has - ID(PK), PartNumber, Code1, Code2
Table B has - InventoryID(PK) PartNumber, Part, and a bunch of other columns.
I need to get everything from Table B where Table B's PartNumber is NOT in Table A.
Example: Table B has PartNumber 123. There is no PartNumber in Table A for 123. Get that row.
What I currently have:
using (SomeEntity context = new SomeEntity())
{
var partmasterids = context.PartsMasters.Select(x => x.PartNumber).Distinct().ToList();
var test = context.Parts.Where(x => !partmasterids.Contains(x.PartNumber)).ToList();
}
I currently first get and select all the distinct part numbers from Table A.
Then I check Table A and Table B's partnumbers and get each part from Table B where there that part number is not in Table A.
There are about 11,000 records in table B and 200,000 records in table A.
I should be getting about 9000 parts which are not in table A.
I am running into huge performance issues with that second LINQ statement. If I do a .Take(100), that will even take around 20-30 seconds. Anything above 1000 will take way too long.
Is there a better way to write this LINQ statement?
From what I understand of your question, the equivalent in SQL would be something like
SELECT DISTINCT B.PartNumber AS MissingParts
FROM TableB as B
LEFT OUTER JOIN TableA as A ON B.PartNumber = A.PartNumber
WHERE A.PartNumber IS NULL
Run that SQL and measure the time it takes. Without indexes, that's as fast as it's going to get.
Now, if you really have to do it in EF, you'll need to do an equivalent statement, complete with the left join. Based on this question, it would look something like this
var query = from b in TableB
join a in TableA on b.PartNumber equals a.PartNumber into joind
from existsInA in joind.DefaultIfEmpty()
where existsInA == null
select b.PartNumber;
var missingParts = query.Distinct().ToList();

Writing a subquery using LINQ in C#

I would like to query a DataTable that produces a DataTable that requires a subquery. I am having trouble finding an appropriate example.
This is the subquery in SQL that I would like to create:
SELECT *
FROM SectionDataTable
WHERE SectionDataTable.CourseID = (SELECT SectionDataTable.CourseID
FROM SectionDataTable
WHERE SectionDataTable.SectionID = iSectionID)
I have the SectionID, iSectionID and I would like to return all of the records in the Section table that has the CourseID of the iSectionID.
I can do this using 2 separate queries as shown below, but I think a subquery would be better.
string tstrFilter = createEqualFilterExpression("SectionID", strCriteria);
tdtFiltered = TableInfo.Select(tstrFilter).CopyToDataTable();
iSelectedCourseID = tdtFiltered.AsEnumerable().Select(id => id.Field<int>("CourseID")).FirstOrDefault();
tdtFiltered.Clear();
tstrFilter = createEqualFilterExpression("CourseID", iSelectedCourseID.ToString());
tdtFiltered = TableInfo.Select(tstrFilter).CopyToDataTable();
Although it doesn't answer your question directly, what you are trying to do is much better suited for an inner join:
SELECT *
FROM SectionDataTable S1
INNER JOIN SectionDataTable S2 ON S1.CourseID = S2.CourseID
WHERE S2.SectionID = iSectionID
This then could be modeled very similarily using linq:
var query = from s1 in SectionDataTable
join s2 in SectionDataTable
on s1.CourseID equals s2.CourseID
where s2.SectionID == iSectionID
select s1;
When working in LINQ you have to think of the things a bit differently. Though you can go as per the Miky's suggestion. But personally I would prefer to use the Navigational properties.
For example in your given example I can understand that you have at-least 2 tables,
Course Master
Section Master
One Section must contain a Course reference
Which means
One Course can be in multiple Sections
Now if I see these tables as entities in my model I would see navigational properties as,
Course.Sections //<- Sections is actually a collection
Section.Course //<- Course is an object
So the same query can be written as,
var lstSections = context.Sections.Where(s => s.Course.Sections.Any(c => c.SectionID == iSectionID)).ToList();
I think you main goal is, you are trying extract all the Sections where Courses are same as given Section's Courses.

Linq to Entity Paging With Large dataset too slow

I'm analyzing player data over millions of matches from an online game. I'm trying to page data into memory in chunks to reduce load times but using OrderBy with skip/take takes way too long (20+ minutes even for smaller queries).
This is my query:
var playerMatches = (from p in context.PlayerMatchEntities
join m in context.MatchEntities
on p.MatchId equals m.MatchId
where m.GameMode == (byte) gameMode
&& m.LobbyType == (byte) lobbyType
select p)
.OrderBy(p => p.MatchId)
.Skip(page - 1 * pageSize)
.Take(pageSize)
.ToList();
MatchId is indexed.
Each match has 10 players, and I currently have 3.3 million matches w/ 33 million rows in the PlayerMatch table, but data is being collected constantly.
Is there a way to get around the large performance drop caused by OrderBy?
This post is similar but didn't seem to be resolved.
Edit:
This is the SQL query generated:
SELECT
`Project1`.`AccountId`,
`Project1`.`MatchId`,
`Project1`.`PlayerSlot`,
`Project1`.`HeroId`,
`Project1`.`Item_0`,
`Project1`.`Item_1`,
`Project1`.`Item_2`,
`Project1`.`Item_3`,
`Project1`.`Item_4`,
`Project1`.`Item_5`,
`Project1`.`Kills`,
`Project1`.`Deaths`,
`Project1`.`Assists`,
`Project1`.`LeaverStatus`,
`Project1`.`Gold`,
`Project1`.`GoldSpent`,
`Project1`.`LastHits`,
`Project1`.`Denies`,
`Project1`.`GoldPerMin`,
`Project1`.`XpPerMin`,
`Project1`.`Level`,
`Project1`.`HeroDamage`,
`Project1`.`TowerDamage`,
`Project1`.`HeroHealing`
FROM (SELECT
`Extent2`.`AccountId`,
`Extent2`.`MatchId`,
`Extent2`.`PlayerSlot`,
`Extent2`.`HeroId`,
`Extent2`.`Item_0`,
`Extent2`.`Item_1`,
`Extent2`.`Item_2`,
`Extent2`.`Item_3`,
`Extent2`.`Item_4`,
`Extent2`.`Item_5`,
`Extent2`.`Kills`,
`Extent2`.`Deaths`,
`Extent2`.`Assists`,
`Extent2`.`LeaverStatus`,
`Extent2`.`Gold`,
`Extent2`.`GoldSpent`,
`Extent2`.`LastHits`,
`Extent2`.`Denies`,
`Extent2`.`GoldPerMin`,
`Extent2`.`XpPerMin`,
`Extent2`.`Level`,
`Extent2`.`HeroDamage`,
`Extent2`.`TowerDamage`,
`Extent2`.`HeroHealing`
FROM `match` AS `Extent1` INNER JOIN `playermatch` AS `Extent2` ON `Extent1`.`MatchId` = `Extent2`.`MatchId`
WHERE ((`Extent1`.`GameMode`) = 2) AND ((`Extent1`.`LobbyType`) = 7)) AS `Project1`
ORDER BY
`Project1`.`MatchId` ASC LIMIT 0,1000
Another approach could be to have a VIEW that does the join and indexes the appropriate columns and then create a Table-Valued Function that uses the VIEW and returns a TABLE with only the page data.
You'll have to manually write the SQL query for the paging, but i think it would be faster.
I haven't tried something like that so i can't be sure there is gonna be a big speed boost.
You didn't include enough information to help you so I'll suggest.
One way to avoid order by is to store rows in a table already in the order. I suggest 'MatchId' is a primary key and a clustered index of MatchEntities. That means MatchEntities.MatchId is stored physically sorted. If you switch join streams to pull the sorted stream first and additive stream second you avoid expensive sorting.
Like this:
var playerMatches = (from m in context.MatchEntities // note the switch: MatchEntities goes first
join p in context.PlayerMatchEntities
on p.MatchId equals m.MatchId
where m.GameMode == (byte) gameMode
&& m.LobbyType == (byte) lobbyType
select p)
// .OrderBy(p => p.MatchId) // no need for this any more
.Skip(page - 1 * pageSize)
.Take(pageSize)
.ToList();
Also see a query plan to find out how the query is executed by the database, what type of join is being used, etc. Maybe your original query does not exploit sorting at all.

LinqToSql query loading unwanted data into memory

I am simulating a join on a linked server through linq to sql. My question is, it appears that linqtosql is bringing all the rows for y.Xstatuses into memory and then doing the join. If this is true how do i keep all the memory on sql server(and still do a cross datacontext join operation), if this not true what is going on that is eating all my ram?
var x = new fooDataContext();
var y = new barDataContext();
var allXNotDeleted = (from w in x.CoolTable
where x.IsDeleted != false).ToList();//for our demo this returns 218 records
var allXWithCompleteStatus = (from notDeleted in allXNotDeleted
join s in y.XStatuses on notDeleted.StatusID equals s.StatusID
where s.StatusID == 1
select notDeleted).Tolist();// insert massive memory gobbler here
return allXwithCompleteStatus;
EDIT:
Trying to implement Kevinbabcock's idea
using (x = new fooDataContext())
using (var y = new barDataContext())
{
var n = (from notDeleted in x.GetTable<CoolTable>()
join z in y.GetTable<Xstatus>() on x.StatusID equals z.StatusID
where z.StatusID == 1 and x.IsDeleted != false
select x).ToList();
}
This still throws a cross context query exeception
It is not possible to perform cross data context query directly on the database.
Fetching in memory one of the recordset (ToList()) forces anyway the other joined to be processed in memory.
If you want to perform everything on sql server you have to have every entity in the same DataContext.
I'd recommend not calling ToList on allXNotDeleted for a start. That will pull those records into memory, which will probably mean that you can't avoid pulling all the other data into memory when you perform your second query.
EDIT:
As an additional note if your tables are particularly big, and you only need data from a few columns, you could set Delay Loaded to True in your database object models for the columns you don't need.
EDIT2:
I have just noticed both queries come from different contexts. In that case I suggest you create a stored procedure and call that from one of the contexts. The sproc should be responsible for spanning the contexts.
Do not call ToList() on allXNotDeleted. This materializes those records in memory, which will cause the entire XStatuses table to also be materialized in memory to perform the join.
Try this:
using(var context = new DataContext(connectionString))
{
var allXNotDeleted =
from w in context.GetTable<CoolTable>()
where x.IsDeleted != false;
var allXWithCompleteStatus = (
from notDeleted in allXNotDeleted
join s in context.GetTable<XStatuses>()
on notDeleted.StatusID equals s.StatusID
where s.StatusID == 1
select notDeleted)
.ToList();
return allXwithCompleteStatus;
}
This will only send a single query to SQL Server, and will only materialize the "notDeleted" values returned from the query. Don't forget to wrap your DataContext instance in using statements so that Dispose() is properly called when they go out of context.
Also, did you mean to filter CoolTable with IsDeleted != false? This is equivalent to IsDeleted == true, which to me indicates that you want to join all deleted records (which the name of your variable, allXNotDeleted, seems to contradict).
EDIT: updated code to work with a single DataContext instance, which should eliminate the "query contains a reference to another DataContext" error. You will need to pass in the ConnectionString to the DataContext constructor if you're not using a derived DataContext class.

Is this LINQ Query "correct"?

I have the following LINQ query, that is returning the results that I expect, but it does not "feel" right.
Basically it is a left join. I need ALL records from the UserProfile table.
Then the LastWinnerDate is a single record from the winner table (possible multiple records) indicating the DateTime the last record was entered in that table for the user.
WinnerCount is the number of records for the user in the winner table (possible multiple records).
Video1 is basically a bool indicating there is, or is not a record for the user in the winner table matching on a third table Objective (should be 1 or 0 rows).
Quiz1 is same as Video 1 matching another record from Objective Table (should be 1 or 0 rows).
Video and Quiz is repeated 12 times because it is for a report to be displayed to a user listing all user records and indicate if they have met the objectives.
var objectiveIds = new List<int>();
objectiveIds.AddRange(GetObjectiveIds(objectiveName, false));
var q =
from up in MetaData.UserProfile
select new RankingDTO
{
UserId = up.UserID,
FirstName = up.FirstName,
LastName = up.LastName,
LastWinnerDate = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner.CreatedOn).First(),
WinnerCount = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner).Count(),
Video1 = (
from winner in MetaData.Winner
join o in MetaData.Objective on winner.ObjectiveID equals o.ObjectiveID
where o.ObjectiveNm == Constants.Promotions.SecVideo1
where winner.Active
where winner.UserID == up.UserID
select winner).Count(),
Quiz1 = (
from winner2 in MetaData.Winner
join o2 in MetaData.Objective on winner2.ObjectiveID equals o2.ObjectiveID
where o2.ObjectiveNm == Constants.Promotions.SecQuiz1
where winner2.Active
where winner2.UserID == up.UserID
select winner2).Count(),
};
You're repeating join winners table part several times. In order to avoid it you can break it into several consequent Selects. So instead of having one huge select, you can make two selects with lesser code. In your example I would first of all select winner2 variable before selecting other result properties:
var q1 =
from up in MetaData.UserProfile
select new {up,
winners = from winner in MetaData.Winner
where winner.Active
where winner.UserID == up.UserID
select winner};
var q = from upWinnerPair in q1
select new RankingDTO
{
UserId = upWinnerPair.up.UserID,
FirstName = upWinnerPair.up.FirstName,
LastName = upWinnerPair.up.LastName,
LastWinnerDate = /* Here you will have more simple and less repeatable code
using winners collection from "upWinnerPair.winners"*/
The query itself is pretty simple: just a main outer query and a series of subselects to retrieve actual column data. While it's not the most efficient means of querying the data you're after (joins and using windowing functions will likely get you better performance), it's the only real way to represent that query using either the query or expression syntax (windowing functions in SQL have no mapping in LINQ or the LINQ-supporting extension methods).
Note that you aren't doing any actual outer joins (left or right) in your code; you're creating subqueries to retrieve the column data. It might be worth looking at the actual SQL being generated by your query. You don't specify which ORM you're using (which would determine how to examine it client-side) or which database you're using (which would determine how to examine it server-side).
If you're using the ADO.NET Entity Framework, you can cast your query to an ObjectQuery and call ToTraceString().
If you're using SQL Server, you can use SQL Server Profiler (assuming you have access to it) to view the SQL being executed, or you can run a trace manually to do the same thing.
To perform an outer join in LINQ query syntax, do this:
Assuming we have two sources alpha and beta, each having a common Id property, you can select from alpha and perform a left join on beta in this way:
from a in alpha
join btemp in beta on a.Id equals btemp.Id into bleft
from b in bleft.DefaultIfEmpty()
select new { IdA = a.Id, IdB = b.Id }
Admittedly, the syntax is a little oblique. Nonetheless, it works and will be translated into something like this in SQL:
select
a.Id as IdA,
b.Id as Idb
from alpha a
left join beta b on a.Id = b.Id
It looks fine to me, though I could see why the multiple sub-queries could trigger inefficiency worries in the eyes of a coder.
Take a look at what SQL is produced though (I'm guessing you're running this against a database source from your saying "table" above), before you start worrying about that. The query providers can be pretty good at producing nice efficient SQL that in turn produces a good underlying database query, and if that's happening, then happy days (it will also give you another view on being sure of the correctness).

Categories