I am using EntityFramework 4 to access a SQL Server 2008 database.
One of the SQL queries that the EF generates is having a behavior that I cannot explain.
The query is like this:
SELECT tableA.field1, tableA.field2, ...
FROM tableA join tableB on tableA.field1 = tableB.field1
WHERE
tableA.field2 > '20110825'
and tableA.field3 in ('a', 'b', 'c,')
and tableB.field4 = 'xxx'
Where tableA.field2 is datetime not null, and the other fields are varchars.
tableA contains circa 1.5 million records, tableB contains circa 2 million records, and the query returns 1877 rows.
The problem is, it returns them in 86 seconds, and that time changes dramatically when I change the '20110825' literal to older values.
For instance if I put '20110725' the query returns 3483 rows in 35 milliseconds.
I found out in the execution plan that the difference between the two lies in the indexes SQL Server chooses to use depending on the date used to compare.
When it is taking time, the execution plan shows:
50%: index seek on tableA.field2 (it's a clustered index on this field alone)
50%: index seek on tableB.field1 (non-unique, non-clustered index on this field alone)
0%: join
When it is almost instantaneous, the execution plan shows:
98%: index seek on tableA.field1 (non-unique, non-clustered index on this field alone)
2%: index seek on tableB.field1 (non-unique, non-clustered index on this field alone)
0%: join
So it seems to me that the decision of the optimizer to use the clustered index on tableA.field2 is not optimal.
Is there a flaw in the database design? In the SQL query?
Can I force in any way the database to use the correct execution plan?
Given that you are using literal values and are only encountering the issue with recent date strings I would suspect you are hitting the issue described here and need to schedule a job to update your statistics.
Presumably when they were last updated there were few or no rows meeting the '20110825' criteria and SQL Server is using a join strategy predicated on that assumption.
Related
I am working on a console app (C#, asp-core 2.1, Entity Framework Core) which is connected to a local SQL Server database, the default (localdb)\MSSQLLocalDB (SQL Server 2016 v13.0) provided with Visual Studio.
The problem I am facing is that it takes quite a long time to insert data into a table. The table has 400.000 rows, 6 columns, and I insert them 200 at a time.
Right now, the request takes 20 seconds to be executed. And this execution time keeps increasing. Considering the fact that I still have 20.000 x200 rows to insert, it's worth figuring out where does this problem comes from!
A couple of facts :
There is no Index on the table
My computer is not new but I have a quite good hardware (i7, 16 Go RAM) and I don't hit 100% CPU while inserting
So, my questions are :
Is 400 k rows considered to be a 'large' database? I've never worked with a table that big before but I thought it was common to have a dataset like this.
How can I investigate where does the inserting time come from? I have only Visual Studio installed so far (but I am opened to other options)
Here is the SQL code of the table in question :
CREATE TABLE [dbo].[KfStatDatas]
(
[Id] INT IDENTITY (1, 1) NOT NULL,
[DistrictId] INT NOT NULL,
[StatId] INT NOT NULL,
[DataSourceId] INT NOT NULL,
[Value] NVARCHAR(300) NULL,
[SnapshotDate] DATETIME2(7) NOT NULL
);
EDIT
I ran SQL Server Management Studio, and I found the request that is the slowing down the whole process. It is the insertion request.
But, by looking at the SQL Request create by Entity Framework, it looks like it's doing an inner join and going through the whole table, which would explain why the processing time increases with the table.
I may miss a point but why would you need to enumerate the whole table to add rows?
Raw request being executed :
SELECT [t].[Id]
FROM [KfStatDatas] t
INNER JOIN #inserted0 i ON ([t].[Id] = [i].[Id])
ORDER BY [i].[_Position]
EDIT and SOLUTION
I eventually found the issue, and it was a stupid mistake : my Id field was not declared as primary key! So the system had to go through the whole DB for every inserted row. I added the PK and it now takes...100 ms for 200 rows, and this duration is stable.
Thanks for your time!
I think you may simply missing an primary key. You've declared to EF that Id is the Entity Key, but you don't have a unique index on the table to enforce that.
And when EF wants to fetch the inserted IDs, without an index, it's expensive. So this query
SELECT t.id from KfStatDatas t
inner join #inserted0 i
on t.id = i.id
order by i._Position
performs 38K logical reads, and takes 16sec on average.
So try:
ALTER TABLE [dbo].[KfStatDatas]
ADD CONSTRAINT PK_KfStatDatas
PRIMARY KEY (id)
BTW are you sure this is EF6? This looks more like EF Core batch insert.
No 400K rows is not large.
The most efficient way to insert a large number of rows from .NET is with SqlBulkCopy. This should take seconds rather than minutes for 400K rows.
With batching individual inserts, execute the entire batch in a single transaction to improve throughput. Otherwise, each insert is committed individually, requiring a synchronous flush of the log buffer to disk for each insert to harden the transaction.
EDIT:
I see from your comment that you are using Entity Framework. This answer may help you use SqlBulkCopy with EF.
I have an Entity Framework 6.1 project that is querying a SQL Server 2012 database table and getting back incorrect results.
To illustrate what is happening, I created 2 queries that should have the exact same results. The table ProjectTable has 23 columns and 20500ish rows:
var test1 = db.ProjectTable
.GroupBy(t => t.ProjectOwner)
.Select(g => g.Key)
.ToArray();
var test2 = db.ProjectTable
.ToArray()
.GroupBy(t => t.ProjectOwner)
.Select(g => g.Key)
.ToArray();
The queries are designed to get a list of all of the distinct project owners in the table. The first query does the heavy lifting on the SQL Server, where as the second query downloads the entire table into memory and then processes it on the client side.
The first variable test1 has a length of about 300 items. The second variable test2 has a length of 5.
Here are the raw SQL queries that EF is generating:
-- test1
SELECT [Distinct1].[ProjectOwner] AS [ProjectOwner]
FROM ( SELECT DISTINCT
[Extent1].[ProjectOwner] AS [ProjectOwner]
FROM [dbo].[ProjectTable] as [Extent1]
) AS [Distinct1]
-- test2
SELECT Col1, Col2 ... ProjectOwner, ... Col23
FROM [dbo].[ProjectTable]
When I run this query and analyze the returned entities, I notice that the full 20500ish rows are returned, but the ProjectOwner column gets overridden with one of only 5 different users!
var test = db.ProjectTable.ToArray();
I thought that maybe it was the SQL Server, so I did a packet trace and filtered on TDS. Randomly looking through the raw streams I see many names that aren't in the list of 5, so I know that data is being sent across the wire correctly.
How do I see the raw data that EF is getting? Is there something that might be messing with the cache and pulling incorrect results?
If I run the queries in either SSMS or Visual Studio, the list returned is correctly. It is only EF that has this issue.
EDIT
Ok, I added another test to make sure my sanity is in check.
I took the test2 raw sql query and did the following:
var test3 = db.Database
.SqlQuery<ProjectTable>(#"SELECT Col1..Col23")
.ToArray()
.Select(t => t.ProjectOwner)
.Distict()
.ToArray();
and I get the correct 300ish names back!
So, in short:
Having EF send projected DISTINCT query to SQL Server returns the correct results
Having EF select the entire table and then using LINQ to project and DISTINCT the data returns incorrect results
Giving EF THE EXACT SAME QUERY!!! that bullet #2 generates and doing a raw SQL query, returns the correct results
After downloading the Entity Framework source and stepping through many an Enumerator, I found the issue.
In the Shaper.HandleEntityAppendOnlymethod (found here), on line 187 the Context.ObjectStateManager.FindEntityEntry method is called. To my surprise, a non-null value was returned! Wait a minute, there shouldn't be any cached results, since I'm returning all rows?!
That's when I discovered that my Table has no Primary Key!
In my defence, the table is actually a cache of a view that I'm working with, I just did a SELECT * INTO CACHETABLE FROM USERVIEW
I then looked at which column Entity Framework thought was my Primary Key (they call it a singleton key) and it just so happens that the column they picked had only... drum roll please... 5 unique values!
When I looked at the model that EF generated, sure enough! That column was specified as a primary key. I changed the key to the appropriate column and now everything is working as it should!
I have a database with over 3,000,000 rows, each has an id and xml field with varchar(6000).
If I do SELECT id FROM bigtable it takes +- 2 minutes to complete. Is there any way to get this in 30 seconds?
Build clustered index on id column
See http://msdn.microsoft.com/en-us/library/ms186342.aspx
You could apply indexes to your tables. In your case a clustered index.
Clustered indexes:
http://msdn.microsoft.com/en-gb/library/aa933131(v=sql.80).aspx
I would also suggest filtering your query so it doesn't return all 3 million rows each time, this can be done by using TOP or WHERE.
TOP:
SELECT TOP 1000 ID
FROM bigtable
WHERE:
SELECT ID FROM
bigtable
WHERE id IN (1,2,3,4,5)
First of all, 3 milion records dont make a table 'Huge'.
To optimize your query, you should do the following.
Filter your query, why do you need to get ALL your IDs?
Create clustered index for the ID column to get a smaller lookup table to search first before pointing to the selected row.
Helpful threads, here and here
Okay, why are you retuning all the Ids to the client?
Even if your table has no clustered index (which I doubt), the vast majority of you processing time will be client-side, transferring the Id values over the network and displaying them on the screen.
Querying for all values rather defeats the point of having a query engine.
The only reason I can think of (perhaps I lack imagination) for getting all the Ids is some sort of misguided caching.
If you want to know many you have do
SELECT count(*) FROM [bigtable]
If you want to know if an Id exists do
SELECT count([Id[) FROM [bigtable] WHERE [Id] = 1 /* or some other Id */
This will return 1 row with a 1 or 0 indicating existence of the specified Id.
Both these queries will benefit massively from a clustered index on Id and will return minimal data with maximal information.
Both of these queries will return in less than 30 seconds, and in less than 30 milliseconds if you have a clustered index on Id
Selecting all the Ids will provide no more useful information than these queries and all it will achieve is a workout for you network and client.
You could index your table for better performance.
There are additional options as well which you could use to imrpove performance like partion feature.
I'm looking for an efficient way of inserting records into SQL server for my C#/MVC application. Anyone know what the best method would be?
Normally I've just done a while loop and insert statement within, but then again I've not had quite so many records to deal with. I need to insert around half a million, and at 300 rows a minute with the while loop, I'll be here all day!
What I'm doing is looping through a large holding table, and using it's rows to create records in a different table. I've set up some functions for lookup data which is necessary for the new table, and this is no doubt adding to the drain.
So here is the query I have. Extremely inefficient for large amounts of data!
Declare #HoldingID int
Set #HoldingID = (Select min(HoldingID) From HoldingTable)
While #JourneyHoldingID IS NOT NULL
Begin
Insert Into Journeys (DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
Where HoldingID = #HoldingID
Set #HoldingID = (Select MIN(HoldingID) From Holding Where HoldingID > #HoldingID)
End
I've heard about set-based approaches - is there anything that might work for the above problem?
If you want to insert a lot of data into a MSSQL Server then you should use BULK INSERTs - there is a command line tool called the bcp utility for this, and also a C# wrapper for performing Bulk Copy Operations, but under the covers they are all using BULK INSERT.
Depending on your application you may want to insert your data into a staging table first, and then either MERGE or INSERT INTO SELECT... to transfer those rows from the staging table(s) to the target table(s) - if you have a lot of data then this will take some time, however will be a lot quicker than performing the inserts individually.
If you want to speed this up then are various things that you can do such as changing the recovery model or tweaking / removing triggers and indexes (depending on whether or not this is a live database or not). If its still really slow then you should look into doing this process in batches (e.g. 1000 rows at a time).
This should be exactly what you are doing now.
Insert Into Journeys(DepartureID, ArrivalID, ProviderID, JourneyNumber, Active)
Select
dbo.GetHubIDFromName(StartHubName),
dbo.GetHubIDFromName(EndHubName),
dbo.GetBusIDFromName(CompanyName),
JourneyNo, 1
From Holding
ORDER BY HoldingID ASC
you (probably) are able to do it in one statement of the form
INSERT INTO JOURNEYS
SELECT * FROM HOLDING;
Without more information about your schema it is difficult to be absolutely sure.
SQLServer 2008 introduced Table Parameters. These allow you to insert multiple rows in a single trip to the database (send it as a large blob). Without using a temporary table. This article describes how it works (step four in the article)
http://www.altdevblogaday.com/2012/05/16/sql-server-high-performance-inserts/
It differs from bulk inserts in that you do not need special utilities and that all constraints and foreign keys are checked.
I quadrupled my throughput using this and parallelizing the inserts. Now at 15.000 inserts/second in the same table sustained. Regular table with indexes and over a billion rows.
Does anyone know what __syncTransactions table is used for? Here's my scenario:
We have few clients running on SQL Server Compact database syncing with SQL Server 2008 Express server database using Sync Framework v3.5 for a about a year now. It seems that for tables with large number of records (i.e. > 20000), it roughly takes more than a minute to sync and CPU reaches 100% as well.
After enabling sync trace logging, I managed to narrow down the query that was behind the slow syncing. Following is the query that checks for inserts in the client database for one of the tables that contains ~70,000 records:
select
ut.*
from
(
select
ut0.*
from
[tblPermissionGroupResourceRole] as ut0
where
( ut0._sysTrackingContext <> 'a4e40127-4083-4b27-88d0-ef3aed4ae343'
OR
ut0._sysTrackingContext IS NULL
)
AND (ut0._sysChangeTxBsn >= 9486853)
AND (ut0._sysInsertTxBsn NOT IN (SELECT SyncBsn FROM __syncTransactions))
) as ut
LEFT OUTER JOIN
(
select
txcs0.*
from
_sysTxCommitSequence as txcs0
) as txcs ON (ut._sysInsertTxBsn = txcs._sysTxBsn)
WHERE
COALESCE(txcs._sysTxCsn, ut._sysInsertTxBsn) > 9486853
AND
COALESCE(txcs._sysTxCsn, ut.__sysInsertTxBsn) <= 9487480
I have highlighted line that takes roughly 1 minute to execute. The reason is that __syncTransaction table contains around 1200 records and due to the cross reference with 70,000 odd records in my tblPermissionGroupResourceRole table, the query is quite slow.
Therefore, I need to understand how __syncTransactions table is used so I can try and clear records from this table or whether there's any other way to sort out my problem?
Any help is much appreciated.
Kind regards,
Sasanka.