I'm analyzing player data over millions of matches from an online game. I'm trying to page data into memory in chunks to reduce load times but using OrderBy with skip/take takes way too long (20+ minutes even for smaller queries).
This is my query:
var playerMatches = (from p in context.PlayerMatchEntities
join m in context.MatchEntities
on p.MatchId equals m.MatchId
where m.GameMode == (byte) gameMode
&& m.LobbyType == (byte) lobbyType
select p)
.OrderBy(p => p.MatchId)
.Skip(page - 1 * pageSize)
.Take(pageSize)
.ToList();
MatchId is indexed.
Each match has 10 players, and I currently have 3.3 million matches w/ 33 million rows in the PlayerMatch table, but data is being collected constantly.
Is there a way to get around the large performance drop caused by OrderBy?
This post is similar but didn't seem to be resolved.
Edit:
This is the SQL query generated:
SELECT
`Project1`.`AccountId`,
`Project1`.`MatchId`,
`Project1`.`PlayerSlot`,
`Project1`.`HeroId`,
`Project1`.`Item_0`,
`Project1`.`Item_1`,
`Project1`.`Item_2`,
`Project1`.`Item_3`,
`Project1`.`Item_4`,
`Project1`.`Item_5`,
`Project1`.`Kills`,
`Project1`.`Deaths`,
`Project1`.`Assists`,
`Project1`.`LeaverStatus`,
`Project1`.`Gold`,
`Project1`.`GoldSpent`,
`Project1`.`LastHits`,
`Project1`.`Denies`,
`Project1`.`GoldPerMin`,
`Project1`.`XpPerMin`,
`Project1`.`Level`,
`Project1`.`HeroDamage`,
`Project1`.`TowerDamage`,
`Project1`.`HeroHealing`
FROM (SELECT
`Extent2`.`AccountId`,
`Extent2`.`MatchId`,
`Extent2`.`PlayerSlot`,
`Extent2`.`HeroId`,
`Extent2`.`Item_0`,
`Extent2`.`Item_1`,
`Extent2`.`Item_2`,
`Extent2`.`Item_3`,
`Extent2`.`Item_4`,
`Extent2`.`Item_5`,
`Extent2`.`Kills`,
`Extent2`.`Deaths`,
`Extent2`.`Assists`,
`Extent2`.`LeaverStatus`,
`Extent2`.`Gold`,
`Extent2`.`GoldSpent`,
`Extent2`.`LastHits`,
`Extent2`.`Denies`,
`Extent2`.`GoldPerMin`,
`Extent2`.`XpPerMin`,
`Extent2`.`Level`,
`Extent2`.`HeroDamage`,
`Extent2`.`TowerDamage`,
`Extent2`.`HeroHealing`
FROM `match` AS `Extent1` INNER JOIN `playermatch` AS `Extent2` ON `Extent1`.`MatchId` = `Extent2`.`MatchId`
WHERE ((`Extent1`.`GameMode`) = 2) AND ((`Extent1`.`LobbyType`) = 7)) AS `Project1`
ORDER BY
`Project1`.`MatchId` ASC LIMIT 0,1000
Another approach could be to have a VIEW that does the join and indexes the appropriate columns and then create a Table-Valued Function that uses the VIEW and returns a TABLE with only the page data.
You'll have to manually write the SQL query for the paging, but i think it would be faster.
I haven't tried something like that so i can't be sure there is gonna be a big speed boost.
You didn't include enough information to help you so I'll suggest.
One way to avoid order by is to store rows in a table already in the order. I suggest 'MatchId' is a primary key and a clustered index of MatchEntities. That means MatchEntities.MatchId is stored physically sorted. If you switch join streams to pull the sorted stream first and additive stream second you avoid expensive sorting.
Like this:
var playerMatches = (from m in context.MatchEntities // note the switch: MatchEntities goes first
join p in context.PlayerMatchEntities
on p.MatchId equals m.MatchId
where m.GameMode == (byte) gameMode
&& m.LobbyType == (byte) lobbyType
select p)
// .OrderBy(p => p.MatchId) // no need for this any more
.Skip(page - 1 * pageSize)
.Take(pageSize)
.ToList();
Also see a query plan to find out how the query is executed by the database, what type of join is being used, etc. Maybe your original query does not exploit sorting at all.
Related
Predicament
I have a large dataset that I need to perform a complex calculation upon a part of. For my calculation, I need to take a chunk of ordered data from a large set based on input parameters.
My method signature looks like this:
double Process(Entity e, DateTimeOffset? start, DateTimeOffset? end)
Potential Solutions
The two following methods spring to mind:
Method 1 - WHERE Clause
double result = 0d;
IEnumerable<Quote> items = from item in e.Items
where (!start.HasValue || item.Date >= start.Value)
&& (!end.HasValue || item.Date <= end.Value)
orderby item.Date ascending
select item;
...
return result;
Method 2 - Skip & Take
double result = 0d;
IEnumerable<Item> items = e.Items.OrderBy(i => i.Date);
if (start.HasValue)
items = items.SkipWhile(i => i.Date < start.Value);
if (end.HasValue)
items = items.TakeWhile(i => i.Date <= end.Value);
...
return result;
Question
If I were just throwing this together, I'd probably just go with Method 1, but the size of my dataset and the the size of the set of datasets are both too large to ignore slight efficiency losses, and it is of vital importance that the resulting enumerable is ordered.
Which approach will generate the more efficient query? And is there a more efficient approach that I am yet to consider?
Any solutions presented can make the safe assumption that the table is well indexed.
SkipWhile is not supported for translation to SQL. You need to throw that option away.
The best way to go about this is to create an index on the field you use to range select and then issue a query that is SARGable. where date >= start && date < end is SARGable and can make use of an index.
!start.HasValue || is not a good idea because that destroys SARGability. Build the query so that this is not needed. For example:
if(start != null) query = query.Where(...);
Make the index covering and you have optimal performance. There is no single extra row that needs to be processed.
According to link you can't use SkipWhile without materializing a query, so in 2. case you materialize all entities, then calculate the result.
In 1. scenario you can let sql handle this query and materialize only necessary records, so it's better option.
EDIT:
I wrote sample data, queries made to database:
SELECT
[Project1].[Id] AS [Id],
[Project1].[AddedDate] AS [AddedDate],
[Project1].[SendDate] AS [SendDate]
FROM ( SELECT
[Extent1].[Id] AS [Id],
[Extent1].[AddedDate] AS [AddedDate],
[Extent1].[SendDate] AS [SendDate]
FROM [dbo].[Alerts] AS [Extent1]
WHERE ([Extent1].[AddedDate] >= #p__linq__0) AND ([Extent1].[AddedDate] <= #p__linq__1)
) AS [Project1]
ORDER BY [Project1].[AddedDate] ASC
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[AddedDate] AS [AddedDate],
[Extent1].[SendDate] AS [SendDate]
FROM [dbo].[Alerts] AS [Extent1]
ORDER BY [Extent1].[AddedDate] ASC
I inserted 1 000 000 records, and wrote query with expected 1 row in result. In 1 case query time was 291 ms and instant materialize. In second case query time was 1065 ms and I had to wait about 10 second to materialize result;
I am running this query :
List<RerocdByGeo> reports = (from p in db.SOMETABLE.ToList()
where (p.colID == prog && p.status != "some_string" && p.col_date < enddate && p.col_date > startdate)
group p by new {
country = (
(p.some_integer_that_represents_an_id <= -0) ? "unknown" : (from f in db.A_LAGE_TABLE where (f.ID == p.some_integer_that_represents_and_id) select f.COUNTRIES_TABLE.COU_Name).FirstOrDefault()),
p.status }
into g
select new TransRerocdByGeo
{
ColA = g.Sum(x => x.ColA),
ColB = g.Sum(x => x.ColB),
Percentage = (g.Sum(x => x.ColA) != null && g.Sum(x => x.ColA) != 0) ? (g.Sum(x => x.ColB) / g.Sum(x => x.ColA)) * 100 : 0,
Status = g.Key.status,
Country = g.Key.country
}).ToList();
a similar query in sql for the same database would run for a few seconds while this one takes about 30-60 seconds in the good case...
the table SOMETABLE contains abount 10-60 K rows
and the table called here A_LARGE_TABLE contains about 10-20 mill rows
the coulmn some_inteher_that_reoresents_an_id is the id on the large table but can also be 0 or -1 and than needs to get the "unknown" value, so i cannot make a relationship (or can i ? if so please explain)
the COUNTRIES_TABLE contains 100-200 rows.
the coulID and ID are identity columns ...
any suggestions ?
You're calling ToList on "SOMETABLE" right at the start. This is pulling the entire database table, with all rows and all columns, into memory and then performing all of the subsequent operations via Linq-to-objects on that in-memory data structure.
Not only are you suffering the penalty of transferring way more information across the network than you need to (which is slow), but C# isn't able to perform the operations nearly as efficiently as a database. That's partly because it looses access to any indexes, any database caching, any cached compiled queries, it isn't as efficient at dealing with data sets that large to begin with, and any higher level optimizations of the query itself (databases tend to do a lot of that).
Next, you have a query inside of your GroupBy clause from f in db.A_LAGE_TABLE where [...] that is performed for every row in the sequence. If the entire query is evaluated at the database level that would potentially be optimized, and even if it's not you're not going across the network to pass information (which is quite slow) for each record.
from p in db.SOMETABLE.ToList()
This basically says "get every record from SOMETABLE and put it in a List", without filtering at all. This is probably your problem.
I have the following code, which is misbehaving:
TPM_USER user = UserManager.GetUser(context, UserId);
var tasks = (from t in user.TPM_TASK
where t.STAGEID > 0 && t.STAGEID != 3 && t.TPM_PROJECTVERSION.STAGEID <= 10
orderby t.DUEDATE, t.PROJECTID
select t);
The first line, UserManager.GetUser just does a simple lookup in the database to get the correct TPM_USER record. However, the second line causes all sorts of SQL chaos.
First off, it's executing two SQL statements here. The first one grabs every single row in TPM_TASK which is linked to that user, which is sometimes tens of thousands of rows:
SELECT
-- Columns
FROM TPMDBO.TPM_USERTASKS "Extent1"
INNER JOIN TPMDBO.TPM_TASK "Extent2" ON "Extent1".TASKID = "Extent2".TASKID
WHERE "Extent1".USERID = :EntityKeyValue1
This query takes about 18 seconds on users with lots of tasks. I would expect the WHERE clause to contain the STAGEID filters too, which would remove the majority of the rows.
Next, it seems to execute a new query for each TPM_PROJECTVERSION pair in the list above:
SELECT
-- Columns
FROM TPMDBO.TPM_PROJECTVERSION "Extent1"
WHERE ("Extent1".PROJECTID = :EntityKeyValue1) AND ("Extent1".VERSIONID = :EntityKeyValue2)
Even though this query is fast, it's executed several hundred times if the user has tasks in a whole bunch of projects.
The query I would like to generate would look something like:
SELECT
-- Columns
FROM TPMDBO.TPM_USERTASKS "Extent1"
INNER JOIN TPMDBO.TPM_TASK "Extent2" ON "Extent1".TASKID = "Extent2".TASKID
INNER JOIN TPMDBO.TPM_PROJECTVERSION "Extent3" ON "Extent2".PROJECTID = "Extent3".PROJECTID AND "Extent2".VERSIONID = "Extent3".VERSIONID
WHERE "Extent1".USERID = 5 and "Extent2".STAGEID > 0 and "Extent2".STAGEID <> 3 and "Extent3".STAGEID <= 10
The query above would run in about 1 second. Normally, I could specify that JOIN using the Include method. However, this doesn't seem to work on properties. In other words, I can't do:
from t in user.TPM_TASK.Include("TPM_PROJECTVERSION")
Is there any way to optimize this LINQ statement? I'm using .NET4 and Oracle as the backend DB.
Solution:
This solution is based on Kirk's suggestions below, and works since context.TPM_USERTASK cannot be queried directly:
var tasks = (from t in context.TPM_TASK.Include("TPM_PROJECTVERSION")
where t.TPM_USER.Any(y => y.USERID == UserId) &&
t.STAGEID > 0 && t.STAGEID != 3 && t.TPM_PROJECTVERSION.STAGEID <= 10
orderby t.DUEDATE, t.PROJECTID
select t);
It does result in a nested SELECT rather than querying TPM_USERTASK directly, but it seems fairly efficient none-the-less.
Yes, you are pulling down a specific user, and then referencing the relationship TPM_TASK. That it is pulling down every task attached to that user is exactly what it's supposed to be doing. There's no ORM SQL translation when you're doing it this way. You're getting a user, then getting all his tasks into memory, and then performing some client-side filtering. This is all done using lazy-loading, so the SQL is going to be exceptionally inefficient as it can't batch anything up.
Instead, rewrite your query to go directly against TPM_TASK and filter against the user:
var tasks = (from t in context.TPM_TASK
where t.USERID == user.UserId && t.STAGEID > 0 && t.STAGEID != 3 && t.TPM_PROJECTVERSION.STAGEID <= 10
orderby t.DUEDATE, t.PROJECTID
select t);
Note how we're checking t.USERID == user.UserId. This produces the same effect as user.TPM_TASK but now all the heavy lifting is done by the database rather than in memory.
I have the following LINQ query, that is returning the results that I expect, but it does not "feel" right.
Basically it is a left join. I need ALL records from the UserProfile table.
Then the LastWinnerDate is a single record from the winner table (possible multiple records) indicating the DateTime the last record was entered in that table for the user.
WinnerCount is the number of records for the user in the winner table (possible multiple records).
Video1 is basically a bool indicating there is, or is not a record for the user in the winner table matching on a third table Objective (should be 1 or 0 rows).
Quiz1 is same as Video 1 matching another record from Objective Table (should be 1 or 0 rows).
Video and Quiz is repeated 12 times because it is for a report to be displayed to a user listing all user records and indicate if they have met the objectives.
var objectiveIds = new List<int>();
objectiveIds.AddRange(GetObjectiveIds(objectiveName, false));
var q =
from up in MetaData.UserProfile
select new RankingDTO
{
UserId = up.UserID,
FirstName = up.FirstName,
LastName = up.LastName,
LastWinnerDate = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner.CreatedOn).First(),
WinnerCount = (
from winner in MetaData.Winner
where objectiveIds.Contains(winner.ObjectiveID)
where winner.Active
where winner.UserID == up.UserID
orderby winner.CreatedOn descending
select winner).Count(),
Video1 = (
from winner in MetaData.Winner
join o in MetaData.Objective on winner.ObjectiveID equals o.ObjectiveID
where o.ObjectiveNm == Constants.Promotions.SecVideo1
where winner.Active
where winner.UserID == up.UserID
select winner).Count(),
Quiz1 = (
from winner2 in MetaData.Winner
join o2 in MetaData.Objective on winner2.ObjectiveID equals o2.ObjectiveID
where o2.ObjectiveNm == Constants.Promotions.SecQuiz1
where winner2.Active
where winner2.UserID == up.UserID
select winner2).Count(),
};
You're repeating join winners table part several times. In order to avoid it you can break it into several consequent Selects. So instead of having one huge select, you can make two selects with lesser code. In your example I would first of all select winner2 variable before selecting other result properties:
var q1 =
from up in MetaData.UserProfile
select new {up,
winners = from winner in MetaData.Winner
where winner.Active
where winner.UserID == up.UserID
select winner};
var q = from upWinnerPair in q1
select new RankingDTO
{
UserId = upWinnerPair.up.UserID,
FirstName = upWinnerPair.up.FirstName,
LastName = upWinnerPair.up.LastName,
LastWinnerDate = /* Here you will have more simple and less repeatable code
using winners collection from "upWinnerPair.winners"*/
The query itself is pretty simple: just a main outer query and a series of subselects to retrieve actual column data. While it's not the most efficient means of querying the data you're after (joins and using windowing functions will likely get you better performance), it's the only real way to represent that query using either the query or expression syntax (windowing functions in SQL have no mapping in LINQ or the LINQ-supporting extension methods).
Note that you aren't doing any actual outer joins (left or right) in your code; you're creating subqueries to retrieve the column data. It might be worth looking at the actual SQL being generated by your query. You don't specify which ORM you're using (which would determine how to examine it client-side) or which database you're using (which would determine how to examine it server-side).
If you're using the ADO.NET Entity Framework, you can cast your query to an ObjectQuery and call ToTraceString().
If you're using SQL Server, you can use SQL Server Profiler (assuming you have access to it) to view the SQL being executed, or you can run a trace manually to do the same thing.
To perform an outer join in LINQ query syntax, do this:
Assuming we have two sources alpha and beta, each having a common Id property, you can select from alpha and perform a left join on beta in this way:
from a in alpha
join btemp in beta on a.Id equals btemp.Id into bleft
from b in bleft.DefaultIfEmpty()
select new { IdA = a.Id, IdB = b.Id }
Admittedly, the syntax is a little oblique. Nonetheless, it works and will be translated into something like this in SQL:
select
a.Id as IdA,
b.Id as Idb
from alpha a
left join beta b on a.Id = b.Id
It looks fine to me, though I could see why the multiple sub-queries could trigger inefficiency worries in the eyes of a coder.
Take a look at what SQL is produced though (I'm guessing you're running this against a database source from your saying "table" above), before you start worrying about that. The query providers can be pretty good at producing nice efficient SQL that in turn produces a good underlying database query, and if that's happening, then happy days (it will also give you another view on being sure of the correctness).
Anybody know how to write a LINQ to SQL statement to return every nth row from a table? I'm needing to get the title of the item at the top of each page in a paged data grid back for fast user scanning. So if i wanted the first record, then every 3rd one after that, from the following names:
Amy, Eric, Jason, Joe, John, Josh, Maribel, Paul, Steve, Tom
I'd get Amy, Joe, Maribel, and Tom.
I suspect this can be done... LINQ to SQL statements already invoke the ROW_NUMBER() SQL function in conjunction with sorting and paging. I just don't know how to get back every nth item. The SQL Statement would be something like WHERE ROW_NUMBER MOD 3 = 0, but I don't know the LINQ statement to use to get the right SQL.
Sometimes, TSQL is the way to go. I would use ExecuteQuery<T> here:
var data = db.ExecuteQuery<SomeObjectType>(#"
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS [__row]
FROM [YourTable]) x WHERE (x.__row % 25) = 1");
You could also swap out the n:
var data = db.ExecuteQuery<SomeObjectType>(#"
DECLARE #n int = 2
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS [__row]
FROM [YourTable]) x WHERE (x.__row % #n) = 1", n);
Once upon a time, there was no such thing as Row_Number, and yet such queries were possible. Behold!
var query =
from c in db.Customers
let i = (
from c2 in db.Customers
where c2.ID < c.ID
select c2).Count()
where i%3 == 0
select c;
This generates the following Sql
SELECT [t2].[ID], [t2]. --(more fields)
FROM (
SELECT [t0].[ID], [t0]. --(more fields)
(
SELECT COUNT(*)
FROM [dbo].[Customer] AS [t1]
WHERE [t1].[ID] < [t0].[ID]
) AS [value]
FROM [dbo].[Customer] AS [t0]
) AS [t2]
WHERE ([t2].[value] % #p0) = #p1
Here's an option that works, but it might be worth checking that it doesn't have any performance issues in practice:
var nth = 3;
var ids = Table
.Select(x => x.Id)
.ToArray()
.Where((x, n) => n % nth == 0)
.ToArray();
var nthRecords = Table
.Where(x => ids.Contains(x.Id));
Just googling around a bit I haven't found (or experienced) an option for Linq to SQL to directly support this.
The only option I can offer is that you write a stored procedure with the appropriate SQL query written out and then calling the sproc via Linq to SQL. Not the best solution, especially if you have any kind of complex filtering going on.
There really doesn't seem to be an easy way to do this:
How do I add ROW_NUMBER to a LINQ query or Entity?
How to find the ROW_NUMBER() of a row with Linq to SQL
But there's always:
peopleToFilter.AsEnumerable().Where((x,i) => i % AmountToSkipBy == 0)
NOTE: This still doesn't execute on the database side of things!
This will do the trick, but it isn't the most efficient query in the world:
var count = query.Count();
var pageSize = 10;
var pageTops = query.Take(1);
for(int i = pageSize; i < count; i += pageSize)
{
pageTops = pageTops.Concat(query.Skip(i - (i % pageSize)).Take(1));
}
return pageTops;
It dynamically constructs a query to pull the (nth, 2*nth, 3*nth, etc) value from the given query. If you use this technique, you'll probably want to create a limit of maybe ten or twenty names, similar to how Google results page (1-10, and Next), in order to avoid getting an expression so large the database refuses to attempt to parse it.
If you need better performance, you'll probably have to use a stored procedure or a view to represent your query, and include the row number as part of the stored proc results or the view's fields.