I have a question about Entity Framework query execution performance.
Schema:
I have a table structure like this:
CREATE TABLE [dbo].[DataLogger]
(
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[ProjectID] [bigint] NULL,
CONSTRAINT [PrimaryKey1] PRIMARY KEY CLUSTERED ( [ID] ASC )
)
CREATE TABLE [dbo].[DCDistributionBox]
(
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[DataLoggerID] [bigint] NOT NULL,
CONSTRAINT [PrimaryKey2] PRIMARY KEY CLUSTERED ( [ID] ASC )
)
ALTER TABLE [dbo].[DCDistributionBox]
ADD CONSTRAINT [FK_DCDistributionBox_DataLogger]
FOREIGN KEY([DataLoggerID]) REFERENCES [dbo].[DataLogger] ([ID])
CREATE TABLE [dbo].[DCString]
(
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[DCDistributionBoxID] [bigint] NOT NULL,
[CurrentMPP] [decimal](18, 2) NULL,
CONSTRAINT [PrimaryKey3] PRIMARY KEY CLUSTERED ( [ID] ASC )
)
ALTER TABLE [dbo].[DCString]
ADD CONSTRAINT [FK_DCString_DCDistributionBox]
FOREIGN KEY([DCDistributionBoxID]) REFERENCES [dbo].[DCDistributionBox] ([ID])
CREATE TABLE [dbo].[StringData]
(
[DCStringID] [bigint] NOT NULL,
[TimeStamp] [datetime] NOT NULL,
[DCCurrent] [decimal](18, 2) NULL,
CONSTRAINT [PrimaryKey4] PRIMARY KEY CLUSTERED ( [TimeStamp] DESC, [DCStringID] ASC)
)
CREATE NONCLUSTERED INDEX [TimeStamp_DCCurrent-NonClusteredIndex]
ON [dbo].[StringData] ([DCStringID] ASC, [TimeStamp] ASC)
INCLUDE ([DCCurrent])
Standard indexes on the foreign keys also exist (I don't want to list them all for space reasons).
The [StringData] table as has following storage stats:
Data space: 26,901.86 MB
Row count: 131,827,749
Partitioned: true
Partition count: 62
Usage:
I now want to group the data in the [StringData] table and do some aggregation.
I created an Entity Framework query (detailed infos to the query can be found here):
var compareData = model.StringDatas
.AsNoTracking()
.Where(p => p.DCString.DCDistributionBox.DataLogger.ProjectID == projectID && p.TimeStamp >= fromDate && p.TimeStamp < tillDate)
.Select(d => new
{
TimeStamp = d.TimeStamp,
DCCurrentMpp = d.DCCurrent / d.DCString.CurrentMPP
})
.GroupBy(d => DbFunctions.AddMinutes(DateTime.MinValue, DbFunctions.DiffMinutes(DateTime.MinValue, d.TimeStamp) / minuteInterval * minuteInterval))
.Select(d => new
{
TimeStamp = d.Key,
DCCurrentMppMin = d.Min(v => v.DCCurrentMpp),
DCCurrentMppMax = d.Max(v => v.DCCurrentMpp),
DCCurrentMppAvg = d.Average(v => v.DCCurrentMpp),
DCCurrentMppStDev = DbFunctions.StandardDeviationP(d.Select(v => v.DCCurrentMpp))
})
.ToList();
The excecution timespan is exceptional long!?
Execution result: 92rows
Execution time: ~16000ms
Attempts:
I now took a look into the Entity Framework generated SQL query and looks like this:
DECLARE #p__linq__4 DATETIME = 0;
DECLARE #p__linq__3 DATETIME = 0;
DECLARE #p__linq__5 INT = 15;
DECLARE #p__linq__6 INT = 15;
DECLARE #p__linq__0 BIGINT = 20827;
DECLARE #p__linq__1 DATETIME = '06.02.2016 00:00:00';
DECLARE #p__linq__2 DATETIME = '07.02.2016 00:00:00';
SELECT
1 AS [C1],
[GroupBy1].[K1] AS [C2],
[GroupBy1].[A1] AS [C3],
[GroupBy1].[A2] AS [C4],
[GroupBy1].[A3] AS [C5],
[GroupBy1].[A4] AS [C6]
FROM ( SELECT
[Project1].[K1] AS [K1],
MIN([Project1].[A1]) AS [A1],
MAX([Project1].[A2]) AS [A2],
AVG([Project1].[A3]) AS [A3],
STDEVP([Project1].[A4]) AS [A4]
FROM ( SELECT
DATEADD (minute, ((DATEDIFF (minute, #p__linq__4, [Project1].[TimeStamp])) / #p__linq__5) * #p__linq__6, #p__linq__3) AS [K1],
[Project1].[C1] AS [A1],
[Project1].[C1] AS [A2],
[Project1].[C1] AS [A3],
[Project1].[C1] AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp] AS [TimeStamp],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringData] AS [Extent1]
INNER JOIN [dbo].[DCString] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBox] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLogger] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE (([Extent4].[ProjectID] = #p__linq__0) OR (([Extent4].[ProjectID] IS NULL) AND (#p__linq__0 IS NULL))) AND ([Extent1].[TimeStamp] >= #p__linq__1) AND ([Extent1].[TimeStamp] < #p__linq__2)
) AS [Project1]
) AS [Project1]
GROUP BY [K1]
) AS [GroupBy1]
I copied this SQL query into SSMS on the same machine, connected with same connection string as the Entity Framework.
The result is a very much improved performance:
Execution result: 92rows
Execution time: 517ms
I also do some loop runing test and the result is strange. The test looks like this
for (int i = 0; i < 50; i++)
{
DateTime begin = DateTime.UtcNow;
[...query...]
TimeSpan excecutionTimeSpan = DateTime.UtcNow - begin;
Debug.WriteLine("{0}th run: {1}", i, excecutionTimeSpan.ToString());
}
The result is very different and looks random(?):
0th run: 00:00:11.0618580
1th run: 00:00:11.3339467
2th run: 00:00:10.0000676
3th run: 00:00:10.1508140
4th run: 00:00:09.2041939
5th run: 00:00:07.6710321
6th run: 00:00:10.3386312
7th run: 00:00:17.3422765
8th run: 00:00:13.8620557
9th run: 00:00:14.9041528
10th run: 00:00:12.7772906
11th run: 00:00:17.0170235
12th run: 00:00:14.7773750
Question:
Why is Entity Framework query execution so slow? The resulting row count is really low and the raw SQL query shows a very fast performance.
Update 1:
I take care that its not a MetaContext or Model creation delay. Some other queries are executed on the same Model instance right before with good performance.
Update 2 (related to the answer of #x0007me):
Thanks for the hint but this can be eliminated by changing the model settings like this:
modelContext.Configuration.UseDatabaseNullSemantics = true;
The EF generated SQL is now:
SELECT
1 AS [C1],
[GroupBy1].[K1] AS [C2],
[GroupBy1].[A1] AS [C3],
[GroupBy1].[A2] AS [C4],
[GroupBy1].[A3] AS [C5],
[GroupBy1].[A4] AS [C6]
FROM ( SELECT
[Project1].[K1] AS [K1],
MIN([Project1].[A1]) AS [A1],
MAX([Project1].[A2]) AS [A2],
AVG([Project1].[A3]) AS [A3],
STDEVP([Project1].[A4]) AS [A4]
FROM ( SELECT
DATEADD (minute, ((DATEDIFF (minute, #p__linq__4, [Project1].[TimeStamp])) / #p__linq__5) * #p__linq__6, #p__linq__3) AS [K1],
[Project1].[C1] AS [A1],
[Project1].[C1] AS [A2],
[Project1].[C1] AS [A3],
[Project1].[C1] AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp] AS [TimeStamp],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringData] AS [Extent1]
INNER JOIN [dbo].[DCString] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBox] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLogger] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE ([Extent4].[ProjectID] = #p__linq__0) AND ([Extent1].[TimeStamp] >= #p__linq__1) AND ([Extent1].[TimeStamp] < #p__linq__2)
) AS [Project1]
) AS [Project1]
GROUP BY [K1]
) AS [GroupBy1]
So you can see the problem you described is now solved, but the execution time does not change.
Also, as you can see in the schema and the raw execution time, I used optimized structure with high optimized indexer.
Update 3 (related to the answer of #Vladimir Baranov):
I don't see why this can be related to query plan caching. Because in the MSDN is clearly descripted that the EF6 make use of query plan caching.
A simple test proof that the huge excecution time differenz is not related to the query plan caching (phseudo code):
using(var modelContext = new ModelContext())
{
modelContext.Query(); //1th run activates caching
modelContext.Query(); //2th used cached plan
}
As the result, both queries run with the same excecution time.
Update 4 (related to the answer of #bubi):
I tried to run the query that is generated by the EF as you descripted it:
int result = model.Database.ExecuteSqlCommand(#"SELECT
1 AS [C1],
[GroupBy1].[K1] AS [C2],
[GroupBy1].[A1] AS [C3],
[GroupBy1].[A2] AS [C4],
[GroupBy1].[A3] AS [C5],
[GroupBy1].[A4] AS [C6]
FROM ( SELECT
[Project1].[K1] AS [K1],
MIN([Project1].[A1]) AS [A1],
MAX([Project1].[A2]) AS [A2],
AVG([Project1].[A3]) AS [A3],
STDEVP([Project1].[A4]) AS [A4]
FROM ( SELECT
DATEADD (minute, ((DATEDIFF (minute, 0, [Project1].[TimeStamp])) / #p__linq__5) * #p__linq__6, 0) AS [K1],
[Project1].[C1] AS [A1],
[Project1].[C1] AS [A2],
[Project1].[C1] AS [A3],
[Project1].[C1] AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp] AS [TimeStamp],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringData] AS [Extent1]
INNER JOIN [dbo].[DCString] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBox] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLogger] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE ([Extent4].[ProjectID] = #p__linq__0) AND ([Extent1].[TimeStamp] >= #p__linq__1) AND ([Extent1].[TimeStamp] < #p__linq__2)
) AS [Project1]
) AS [Project1]
GROUP BY [K1]
) AS [GroupBy1]",
new SqlParameter("p__linq__0", 20827),
new SqlParameter("p__linq__1", fromDate),
new SqlParameter("p__linq__2", tillDate),
new SqlParameter("p__linq__5", 15),
new SqlParameter("p__linq__6", 15));
Execution result: 92
Execution time: ~16000ms
It took exact as long as the normal EF query!?
Update 5 (related to the answer of #vittore):
I create a traced call tree, maybe it helps:
Update 6 (related to the answer of #usr):
I created two showplan XML via SQL Server Profiler.
Fast run (SSMS).SQLPlan
Slow run (EF).SQLPlan
Update 7 (related to the comments of #VladimirBaranov):
I now run some more test case related to your comments.
First I eleminate time taking order operations by using a new computed column and a matching INDEXER. This reduce the perfomance lag related to DATEADD(MINUTE, DATEDIFF(MINUTE, 0, [TimeStamp] ) / 15* 15, 0). Detail for how and why you can find here.
The Result look s like this:
Pure EntityFramework query:
for (int i = 0; i < 3; i++)
{
DateTime begin = DateTime.UtcNow;
var result = model.StringDatas
.AsNoTracking()
.Where(p => p.DCString.DCDistributionBox.DataLogger.ProjectID == projectID && p.TimeStamp15Minutes >= fromDate && p.TimeStamp15Minutes < tillDate)
.Select(d => new
{
TimeStamp = d.TimeStamp15Minutes,
DCCurrentMpp = d.DCCurrent / d.DCString.CurrentMPP
})
.GroupBy(d => d.TimeStamp)
.Select(d => new
{
TimeStamp = d.Key,
DCCurrentMppMin = d.Min(v => v.DCCurrentMpp),
DCCurrentMppMax = d.Max(v => v.DCCurrentMpp),
DCCurrentMppAvg = d.Average(v => v.DCCurrentMpp),
DCCurrentMppStDev = DbFunctions.StandardDeviationP(d.Select(v => v.DCCurrentMpp))
})
.ToList();
TimeSpan excecutionTimeSpan = DateTime.UtcNow - begin;
Debug.WriteLine("{0}th run pure EF: {1}", i, excecutionTimeSpan.ToString());
}
0th run pure EF: 00:00:12.6460624
1th run pure EF: 00:00:11.0258393
2th run pure EF: 00:00:08.4171044
I now used the EF generated SQL as a SQL query:
for (int i = 0; i < 3; i++)
{
DateTime begin = DateTime.UtcNow;
int result = model.Database.ExecuteSqlCommand(#"SELECT
1 AS [C1],
[GroupBy1].[K1] AS [TimeStamp15Minutes],
[GroupBy1].[A1] AS [C2],
[GroupBy1].[A2] AS [C3],
[GroupBy1].[A3] AS [C4],
[GroupBy1].[A4] AS [C5]
FROM ( SELECT
[Project1].[TimeStamp15Minutes] AS [K1],
MIN([Project1].[C1]) AS [A1],
MAX([Project1].[C1]) AS [A2],
AVG([Project1].[C1]) AS [A3],
STDEVP([Project1].[C1]) AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp15Minutes] AS [TimeStamp15Minutes],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringData] AS [Extent1]
INNER JOIN [dbo].[DCString] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBox] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLogger] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE ([Extent4].[ProjectID] = #p__linq__0) AND ([Extent1].[TimeStamp15Minutes] >= #p__linq__1) AND ([Extent1].[TimeStamp15Minutes] < #p__linq__2)
) AS [Project1]
GROUP BY [Project1].[TimeStamp15Minutes]
) AS [GroupBy1];",
new SqlParameter("p__linq__0", 20827),
new SqlParameter("p__linq__1", fromDate),
new SqlParameter("p__linq__2", tillDate));
TimeSpan excecutionTimeSpan = DateTime.UtcNow - begin;
Debug.WriteLine("{0}th run: {1}", i, excecutionTimeSpan.ToString());
}
0th run: 00:00:00.8381200
1th run: 00:00:00.6920736
2th run: 00:00:00.7081006
and with OPTION(RECOMPILE):
for (int i = 0; i < 3; i++)
{
DateTime begin = DateTime.UtcNow;
int result = model.Database.ExecuteSqlCommand(#"SELECT
1 AS [C1],
[GroupBy1].[K1] AS [TimeStamp15Minutes],
[GroupBy1].[A1] AS [C2],
[GroupBy1].[A2] AS [C3],
[GroupBy1].[A3] AS [C4],
[GroupBy1].[A4] AS [C5]
FROM ( SELECT
[Project1].[TimeStamp15Minutes] AS [K1],
MIN([Project1].[C1]) AS [A1],
MAX([Project1].[C1]) AS [A2],
AVG([Project1].[C1]) AS [A3],
STDEVP([Project1].[C1]) AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp15Minutes] AS [TimeStamp15Minutes],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringData] AS [Extent1]
INNER JOIN [dbo].[DCString] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBox] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLogger] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE ([Extent4].[ProjectID] = #p__linq__0) AND ([Extent1].[TimeStamp15Minutes] >= #p__linq__1) AND ([Extent1].[TimeStamp15Minutes] < #p__linq__2)
) AS [Project1]
GROUP BY [Project1].[TimeStamp15Minutes]
) AS [GroupBy1]
OPTION(RECOMPILE);",
new SqlParameter("p__linq__0", 20827),
new SqlParameter("p__linq__1", fromDate),
new SqlParameter("p__linq__2", tillDate));
TimeSpan excecutionTimeSpan = DateTime.UtcNow - begin;
Debug.WriteLine("{0}th run: {1}", i, excecutionTimeSpan.ToString());
}
0th run with RECOMPILE: 00:00:00.8260932
1th run with RECOMPILE: 00:00:00.9139730
2th run with RECOMPILE: 00:00:01.0680665
Same SQL query excecuted in SSMS (without RECOMPILE):
00:00:01.105
Same SQL query excecuted in SSMS (with RECOMPILE):
00:00:00.902
I hope this are all values you needed.
In this answer I'm focusing on the original observation: the query generated by EF is slow, but when the same query is run in SSMS it is fast.
One possible explanation of this behaviour is Parameter sniffing.
SQL Server uses a process called parameter sniffing when it executes
stored procedures that have parameters. When the
procedure is compiled or recompiled, the value passed into the
parameter is evaluated and used to create an execution plan. That
value is then stored with the execution plan in the plan cache. On
subsequent executions, that same value – and same plan – is used.
So, EF generates a query that has few parameters. The first time you run this query the server creates an execution plan for this query using values of parameters that were in effect in the first run. That plan is usually pretty good. But, later on you run the same EF query using other values for parameters. It is possible that for new values of parameters the previously generated plan is not optimal and the query becomes slow. The server keeps using the previous plan, because it is still the same query, just values of parameters are different.
If at this moment you take the query text and try to run it directly in SSMS the server will create a new execution plan, because technically it is not the same query that is issued by EF application. Even one character difference is enough, any change in the session settings is also enough for the server to treat the query as a new one. As a result the server has two plans for the seemingly same query in its cache. The first "slow" plan is slow for the new values of parameters, because it was originally built for different parameter values. The second "fast" plan is built for the current parameter values, so it is fast.
The article Slow in the Application, Fast in SSMS by Erland Sommarskog explains this and other related areas in much more details.
There are several ways to discard cached plans and force the server to regenerate them. Changing the table or changing the table indexes should do it - it should discard all plans that are related to this table, both "slow" and "fast". Then you run the query in EF application with new values of parameters and get a new "fast" plan. You run the query in SSMS and get a second "fast" plan with new values of parameters. The server still generates two plans, but both plans are fast now.
Another variant is adding OPTION(RECOMPILE) to the query. With this option the server would not store the generated plan in its cache. So, every time the query runs the server would use actual parameter values to generate the plan that (it thinks) would be optimal for the given parameter values. The downside is an added overhead of the plan generation.
Mind you, the server still could choose a "bad" plan with this option due to outdated statistics, for example. But, at least, parameter sniffing would not be a problem.
Those who wonder how to add OPTION (RECOMPILE) hint to the query that is generated by EF have a look at this answer:
https://stackoverflow.com/a/26762756/4116017
I know I'm a bit late here, but since I've participated in the building of the query in question, I feel obliged to take some action.
The general problem I see with Linq to Entities queries is that the typical way we build them introduces unnecessary parameters, which may affect the cached database query plan (so called Sql Server parameter sniffing problem).
Let take a look at your query group by expression
d => DbFunctions.AddMinutes(DateTime.MinValue, DbFunctions.DiffMinutes(DateTime.MinValue, d.TimeStamp) / minuteInterval * minuteInterval)
Since minuteInterval is a variable (i.e. non constant), it introduces a parameter. Same for DateTime.MinValue (note that the primitive types expose similar things as constants, but for DateTime, decimal etc. they are static readonly fields which makes a big diference how they are treated inside the expressions).
But regardless of how it's represented in the CLR system, DateTime.MinValue logically is a constant. What about minuteInterval, it depends on your usage.
My attempt to solve the issue would be to eliminate all the parameters related to that expression. Since we cannot do that with compiler generated expression, we need to build it manually using System.Linq.Expressions. The later is not intuitive, but fortunately we can use a hybrid approach.
First, we need a helper method which allows us to replace expression parameters:
public static class ExpressionUtils
{
public static Expression ReplaceParemeter(this Expression expression, ParameterExpression source, Expression target)
{
return new ParameterReplacer { Source = source, Target = target }.Visit(expression);
}
class ParameterReplacer : ExpressionVisitor
{
public ParameterExpression Source;
public Expression Target;
protected override Expression VisitParameter(ParameterExpression node)
{
return node == Source ? Target : base.VisitParameter(node);
}
}
}
Now we have everything needed. Let encapsulate the logic inside a custom method:
public static class QueryableUtils
{
public static IQueryable<IGrouping<DateTime, T>> GroupBy<T>(this IQueryable<T> source, Expression<Func<T, DateTime>> dateSelector, int minuteInterval)
{
Expression<Func<DateTime, DateTime, int, DateTime>> expr = (date, baseDate, interval) =>
DbFunctions.AddMinutes(baseDate, DbFunctions.DiffMinutes(baseDate, date) / interval).Value;
var selector = Expression.Lambda<Func<T, DateTime>>(
expr.Body
.ReplaceParemeter(expr.Parameters[0], dateSelector.Body)
.ReplaceParemeter(expr.Parameters[1], Expression.Constant(DateTime.MinValue))
.ReplaceParemeter(expr.Parameters[2], Expression.Constant(minuteInterval))
, dateSelector.Parameters[0]
);
return source.GroupBy(selector);
}
}
Finally, replace
.GroupBy(d => DbFunctions.AddMinutes(DateTime.MinValue, DbFunctions.DiffMinutes(DateTime.MinValue, d.TimeStamp) / minuteInterval * minuteInterval))
with
.GroupBy(d => d.TimeStamp, minuteInterval * minuteInterval)
and the generated SQL query would be like this (for minuteInterval = 15):
SELECT
1 AS [C1],
[GroupBy1].[K1] AS [C2],
[GroupBy1].[A1] AS [C3],
[GroupBy1].[A2] AS [C4],
[GroupBy1].[A3] AS [C5],
[GroupBy1].[A4] AS [C6]
FROM ( SELECT
[Project1].[K1] AS [K1],
MIN([Project1].[A1]) AS [A1],
MAX([Project1].[A2]) AS [A2],
AVG([Project1].[A3]) AS [A3],
STDEVP([Project1].[A4]) AS [A4]
FROM ( SELECT
DATEADD (minute, (DATEDIFF (minute, convert(datetime2, '0001-01-01 00:00:00.0000000', 121), [Project1].[TimeStamp])) / 225, convert(datetime2, '0001-01-01 00:00:00.0000000', 121)) AS [K1],
[Project1].[C1] AS [A1],
[Project1].[C1] AS [A2],
[Project1].[C1] AS [A3],
[Project1].[C1] AS [A4]
FROM ( SELECT
[Extent1].[TimeStamp] AS [TimeStamp],
[Extent1].[DCCurrent] / [Extent2].[CurrentMPP] AS [C1]
FROM [dbo].[StringDatas] AS [Extent1]
INNER JOIN [dbo].[DCStrings] AS [Extent2] ON [Extent1].[DCStringID] = [Extent2].[ID]
INNER JOIN [dbo].[DCDistributionBoxes] AS [Extent3] ON [Extent2].[DCDistributionBoxID] = [Extent3].[ID]
INNER JOIN [dbo].[DataLoggers] AS [Extent4] ON [Extent3].[DataLoggerID] = [Extent4].[ID]
WHERE ([Extent4].[ProjectID] = #p__linq__0) AND ([Extent1].[TimeStamp] >= #p__linq__1) AND ([Extent1].[TimeStamp] < #p__linq__2)
) AS [Project1]
) AS [Project1]
GROUP BY [K1]
) AS [GroupBy1]
As you may see, we successfully eliminated some of the query parameters. Will that help? Well, as with any database query tuning, it might or might not. You need to try and see.
The DB engine determines the plan for each query based on how it is called. In case of your EF Linq query, the plan is prepared in such a way that each input parameter is treated as an unknown(since you have no idea what's coming in). In your actual query, you have all you parameters as part of the query so it will run under a different plan than that for a parameterized one. One of the affected piece that I see immediately is
...(#p__linq__0 IS NULL)..
This is FALSE since p_linq_0 = 20827 and is NOT NULL, so your first half of the WHERE is FALSE to begin with and does not need to be looked at any more. In case of LINQ queries, the DB has no idea what's coming in so evaluates everything anyway.
You'll need to see if you can use indices or other techniques to make this run faster.
When EF runs the query, it wraps it and runs it with sp_executesql, which means the execution plan will be cached in the stored procedure execution plan cache. Due to differences (parameter sniffing etc) in how the raw sql statement vs the SP version have their execution plans built, the two can differ.
When running the EF (sp wrapped) version, SQL server is most likely using a more generic execution plan that covers a wider range of timestamps than the values you are actually passing in.
That said, to reduce the chance of SQL server trying something "funny" with hash joins etc, the first things I would do are:
1) Index the columns used in the where clause, and in joins
create index ix_DataLogger_ProjectID on DataLogger (ProjectID);
create index ix_DCDistributionBox_DataLoggerID on DCDistributionBox (DataLoggerID);
create index ix_DCString_DCDistributionBoxID on DCString (DCDistributionBoxID);
2) Do explicit joins in the Linq query to eliminate the or ProductID is null part
I've been fooling around with some LINQ over Entities and I'm getting strange results and I would like to get an explanation...
Given the following LINQ query,
// Sample # 1
IEnumerable<GroupInformation> groupingInfo;
groupingInfo = from a in context.AccountingTransaction
group a by a.Type into grp
select new GroupInformation()
{
GroupName = grp.Key,
GroupCount = grp.Count()
};
I get the following SQL query (taken from SQL Profiler):
SELECT
1 AS [C1],
[GroupBy1].[K1] AS [Type],
[GroupBy1].[A1] AS [C2]
FROM ( SELECT
[Extent1].[Type] AS [K1],
COUNT(1) AS [A1]
FROM [dbo].[AccountingTransaction] AS [Extent1]
GROUP BY [Extent1].[Type]
) AS [GroupBy1]
So far so good.
If I change my LINQ query to:
// Sample # 2
groupingInfo = context.AccountingTransaction.
GroupBy(a => a.Type).
Select(grp => new GroupInformation()
{
GroupName = grp.Key,
GroupCount = grp.Count()
});
it yields to the exact same SQL query. Makes sense to me.
Here comes the interesting part... If I change my LINQ query to:
// Sample # 3
IEnumerable<AccountingTransaction> accounts;
IEnumerable<IGrouping<object, AccountingTransaction>> groups;
IEnumerable<GroupInformation> groupingInfo;
accounts = context.AccountingTransaction;
groups = accounts.GroupBy(a => a.Type);
groupingInfo = groups.Select(grp => new GroupInformation()
{
GroupName = grp.Key,
GroupCount = grp.Count()
});
the following SQL is executed (I stripped a few of the fields from the actual query, but all the fields from the table (~ 15 fields) were included in the query, twice):
SELECT
[Project2].[C1] AS [C1],
[Project2].[Type] AS [Type],
[Project2].[C2] AS [C2],
[Project2].[Id] AS [Id],
[Project2].[TimeStamp] AS [TimeStamp],
-- <snip>
FROM ( SELECT
[Distinct1].[Type] AS [Type],
1 AS [C1],
[Extent2].[Id] AS [Id],
[Extent2].[TimeStamp] AS [TimeStamp],
-- <snip>
CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
FROM (SELECT DISTINCT
[Extent1].[Type] AS [Type]
FROM [dbo].[AccountingTransaction] AS [Extent1] ) AS [Distinct1]
LEFT OUTER JOIN [dbo].[AccountingTransaction] AS [Extent2] ON [Distinct1].[Type] = [Extent2].[Type]
) AS [Project2]
ORDER BY [Project2].[Type] ASC, [Project2].[C2] ASC
Why are the SQLs generated are so different? After all, the exact same code is executed, it's just that sample # 3 is using intermediate variables to get the same job done!
Also, if I do:
Console.WriteLine(groupingInfo.ToString());
for sample # 1 and sample # 2, I get the exact same query that was captured by SQL Profiler, but for sample # 3, I get:
System.Linq.Enumerable+WhereSelectEnumerableIterator`2[System.Linq.IGrouping`2[System.Object,TestLinq.AccountingTransaction],TestLinq.GroupInformation]
What is the difference? Why can't I get the SQL Query generated by LINQ if I split the LINQ query in multiple instructions?
The ulitmate goal is to be able to add operators to the query (Where, OrderBy, etc.) at run-time.
BTW, I've seen this behavior in EF 4.0 and EF 6.0.
Thank you for your help.
The reason is because in your third attempt you're referring to accounts as IEnumerable<AccountingTransaction> which will cause the query to be invoked using Linq-To-Objects (Enumerable.GroupBy and Enumerable.Select)
On the other hand, in your first and second attempts the reference to AccountingTransaction is preserved as IQueryable<AccountingTransaction> and the query will be executed using Linq-To-Entities which will then transform it to the appropriate SQL statement.
I have been experimenting trying to get the following Linq working without joy. I'm convinced that it's right, but that might just be my bad Linq. I originally added this as a answer to a similar question here:
Linq-to-entities - Include() method not loading
But as it's a very old question, and mine is more specific, I figured it would do better as an explicit question.
In the linked question, Alex James gives two interesting solutions, however if you try them and check the SQL, it's horrible.
The example I was working on is:
var theRelease = from release in context.Releases
where release.Name == "Hello World"
select release;
var allProductionVersions = from prodVer in context.ProductionVersions
where prodVer.Status == 1
select prodVer;
var combined = (from release in theRelease
join p in allProductionVersions on release.Id equals p.ReleaseID
select release).Include(release => release.ProductionVersions);
var allProductionsForChosenRelease = combined.ToList();
This follows the simpler of the two examples. Without the include it produces the perfectly respectable sql:
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name]
FROM [dbo].[Releases] AS [Extent1]
INNER JOIN [dbo].[ProductionVersions] AS [Extent2] ON [Extent1].[Id] = [Extent2].[ReleaseID]
WHERE ('Hello World' = [Extent1].[Name]) AND (1 = [Extent2].[Status])
But with, OMG:
SELECT
[Project1].[Id1] AS [Id],
[Project1].[Id] AS [Id1],
[Project1].[Name] AS [Name],
[Project1].[C1] AS [C1],
[Project1].[Id2] AS [Id2],
[Project1].[Status] AS [Status],
[Project1].[ReleaseID] AS [ReleaseID]
FROM ( SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent2].[Id] AS [Id1],
[Extent3].[Id] AS [Id2],
[Extent3].[Status] AS [Status],
[Extent3].[ReleaseID] AS [ReleaseID],
CASE WHEN ([Extent3].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
FROM [dbo].[Releases] AS [Extent1]
INNER JOIN [dbo].[ProductionVersions] AS [Extent2] ON [Extent1].[Id] = [Extent2].[ReleaseID]
LEFT OUTER JOIN [dbo].[ProductionVersions] AS [Extent3] ON [Extent1].[Id] = [Extent3].[ReleaseID]
WHERE ('Hello World' = [Extent1].[Name]) AND (1 = [Extent2].[Status])
) AS [Project1]
ORDER BY [Project1].[Id1] ASC, [Project1].[Id] ASC, [Project1].[C1] ASC
Total garbage. The key point to note here is the fact that it returns the outer joined version of the table which has not been limited by status=1.
This results in the WRONG data being returned:
Id Id1 Name C1 Id2 Status ReleaseID
2 1 Hello World 1 1 2 1
2 1 Hello World 1 2 1 1
Note that the status of 2 is being returned there, despite our restriction. It simply does not work.
If I have gone wrong somewhere, I would be delighted to find out, as this is making a mockery of Linq. I love the idea, but the execution doesn't seem to be usable at the moment.
Out of curiosity, I tried the LinqToSQL dbml rather than the LinqToEntities edmx that produced the mess above:
SELECT [t0].[Id], [t0].[Name], [t2].[Id] AS [Id2], [t2].[Status], [t2].[ReleaseID], (
SELECT COUNT(*)
FROM [dbo].[ProductionVersions] AS [t3]
WHERE [t3].[ReleaseID] = [t0].[Id]
) AS [value]
FROM [dbo].[Releases] AS [t0]
INNER JOIN [dbo].[ProductionVersions] AS [t1] ON [t0].[Id] = [t1].[ReleaseID]
LEFT OUTER JOIN [dbo].[ProductionVersions] AS [t2] ON [t2].[ReleaseID] = [t0].[Id]
WHERE ([t0].[Name] = #p0) AND ([t1].[Status] = #p1)
ORDER BY [t0].[Id], [t1].[Id], [t2].[Id]
Slightly more compact - weird count clause, but overall same total FAIL.
Please tell me I've missed something obvious, as I really want to like Linq!
Okay, after another evening of head scratching I cracked it.
In LinqToSQL:
using (var context = new TestSQLModelDataContext())
{
context.DeferredLoadingEnabled = false;
DataLoadOptions ds = new DataLoadOptions();
ds.LoadWith<ProductionVersion>(prod => prod.Release);
context.LoadOptions = ds;
var combined = from release in context.Releases
where release.Name == "Hello World"
select from prodVer in release.ProductionVersions
where prodVer.Status == 1
select prodVer;
var allProductionsForChosenRelease = combined.ToList();
}
This produces the much more reasonable SQL:
SELECT [t2].[Id], [t2].[Status], [t2].[ReleaseID], [t0].[Id] AS [Id2], [t0].[Name], (
SELECT COUNT(*)
FROM [dbo].[ProductionVersions] AS [t3]
WHERE ([t3].[Status] = 1) AND ([t3].[ReleaseID] = [t0].[Id])
) AS [value]
FROM [dbo].[Releases] AS [t0]
OUTER APPLY (
SELECT [t1].[Id], [t1].[Status], [t1].[ReleaseID]
FROM [dbo].[ProductionVersions] AS [t1]
WHERE ([t1].[Status] =1) AND ([t1].[ReleaseID] = [t0].[Id])
) AS [t2]
WHERE [t0].[Name] = 'Hello World'
ORDER BY [t0].[Id], [t2].[Id]
Which produces the correct results:
Id Status ReleaseID Id2 Name value
2 1 1 1 Hello World 1
And in LinqToEntities (I couldn't get the Include syntax to work, so I use the quirk where including the desired table in the results links it up correctly):
using (var context = new TestEntities1())
{
var combined = (from release in context.Releases
where release.Name == "Hello World"
select from prodVer in release.ProductionVersions
where prodVer.Status == 1
select new { prodVer, Release =prodVer.Release });
var allProductionsForChosenRelease = combined.ToList();
}
And this produces the SQL:
SELECT
[Project1].[Id] AS [Id],
[Project1].[C1] AS [C1],
[Project1].[Id1] AS [Id1],
[Project1].[Status] AS [Status],
[Project1].[ReleaseID] AS [ReleaseID],
[Project1].[Id2] AS [Id2],
[Project1].[Name] AS [Name]
FROM ( SELECT
[Extent1].[Id] AS [Id],
[Join1].[Id1] AS [Id1],
[Join1].[Status] AS [Status],
[Join1].[ReleaseID] AS [ReleaseID],
[Join1].[Id2] AS [Id2],
[Join1].[Name] AS [Name],
CASE WHEN ([Join1].[Id1] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
FROM [dbo].[Releases] AS [Extent1]
LEFT OUTER JOIN (SELECT [Extent2].[Id] AS [Id1], [Extent2].[Status] AS [Status], [Extent2].[ReleaseID] AS [ReleaseID], [Extent3].[Id] AS [Id2], [Extent3].[Name] AS [Name]
FROM [dbo].[ProductionVersions] AS [Extent2]
INNER JOIN [dbo].[Releases] AS [Extent3] ON [Extent2].[ReleaseID] = [Extent3].[Id] ) AS [Join1] ON ([Extent1].[Id] = [Join1].[ReleaseID]) AND (1 = [Join1].[Status])
WHERE 'Hello World' = [Extent1].[Name]
) AS [Project1]
ORDER BY [Project1].[Id] ASC, [Project1].[C1] ASC
Which is fairly mental, but it does work.
Id C1 Id1 Status ReleaseID Id2 Name
1 1 2 1 1 1 Hello World
All of which leads me to the conclusion that Linq is far from finished. It can be used, but with extreme caution. Use it as a strongly typed and compile time checked, but laborious/error prone, way of writing bad SQL. It's a trade-off. You get more security at the C# end, but man it's a lot harder than writing SQL!
Taking a second look, I now understand the elusive effect of the Include.
Just as in plain SQL, a join in LINQ will repeat results when the right side of the join is the "n" end of a 1-n association.
Let's assume you have one Release with two ProductionVersions. Without the Include, the join will give you two identical Releases, because after all the statement selects releases. Now when you add the Include, EF will not only return two releases, but will also fully populate their ProductionVersions collections.
Looking a bit deeper, in the context's cache, it appears that EF really only materialized just 1 Release and 2ProductionVersions. It's just that the releases are returned twice in the final result set.
In a way, you got what you asked for: give me releases, multiplied by their number of versions. But that's not what you intended to ask.
What you (probably) intended reveals a weak spot in EF's toolbox: we can't Include partial collections. I think you tried to get releases populated with ProductionVersions of Status = 1 only. If possible, you'd rather have done this:
context.Releases.Include(r => r.ProductionVersions.Where(v => v.Status == 1))
.Where(r => r.Name == "Hello World")
But that throws an exception:
The Include path expression must refer to a navigation property defined on the type. Use dotted paths for reference navigation properties and the Select operator for collection navigation properties.
Parameter name: path
This "filtered include" problem has been noted before and until the EF team (or a contributor) decides to grab this issue we have to do with elaborate work-arounds. I described a common one here.
I have the following line:
WorkPlaces.FirstOrDefault()
.WorkSteps.Where(x=>x.Failcodes_Id != null)
.OrderByDescending(x=>x.Timestamp)
.FirstOrDefault()
There are about 10-20 workplaces and each workplace have thousands of worksteps. I would like to get the last workstep for each of the workplaces.
The code above is an example from linqpad because I couldn't believe that the generated sql looks like this:
SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent1].[Description] AS [Description],
[Extent1].[Active] AS [Active],
[Extent1].[ProductionLine_Id] AS [ProductionLine_Id],
[Extent1].[DefaultTechnology_Id] AS [DefaultTechnology_Id],
[Extent1].[PrinterName] AS [PrinterName],
[Extent1].[Deleted] AS [Deleted],
[Extent2].[Id] AS [Id1],
[Extent1].[LoggedInUser_UserId] AS [LoggedInUser_UserId]
FROM [dbo].[WorkPlaces] AS [Extent1]
LEFT OUTER JOIN [dbo].[WorkplaceParameterSet] AS [Extent2] ON [Extent1].[Id] = [Extent2].[WorkPlace_Id]
GO
-- Region Parameters
DECLARE #EntityKeyValue1 Int = 1
-- EndRegion
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Timestamp] AS [Timestamp],
[Extent1].[Description] AS [Description],
[Extent1].[WorkPlace_Id] AS [WorkPlace_Id],
[Extent1].[WorkItemState_Id] AS [WorkItemState_Id],
[Extent1].[UserId] AS [UserId],
[Extent1].[WorkItem_Id] AS [WorkItem_Id],
[Extent1].[Technology_Id] AS [Technology_Id],
[Extent1].[Failcodes_Id] AS [Failcodes_Id],
[Extent1].[DrawingNo] AS [DrawingNo],
[Extent1].[ManualData] AS [ManualData],
[Extent1].[Deleted] AS [Deleted],
[Extent1].[WorkItemState_Arrival_Id] AS [WorkItemState_Arrival_Id]
FROM [dbo].[WorkSteps] AS [Extent1]
WHERE [Extent1].[WorkPlace_Id] = #EntityKeyValue1
Is there a way to get one line from worksteps without downloading 9000 records to pick one from the top of the list?
Rather than getting each workplace individually and then getting the workstep for that work place in a query you can use Select to project each workplace into the workstep that you want in one query:
var query = WorkPlaces.Select(workplace => workplace.WorkSteps
.Where(x => x.Failcodes_Id != null)
.OrderByDescending(x => x.Timestamp)
.FirstOrDefault());
You should use IQueryable interface instead of IEnumerable. Also check this.