With EF Core: Is there a way to tell it that the new value in the single field I update is the same for all entities and make it update the entities with one/few SQL statements?
I want to set a timestamp on a lot of records at the same time.
I currently have this code:
var downloadTimeStamp = DateTime.UtcNow;
foreach (var e in entities)
{
e.DownloadDateTime = downloadTimeStamp;
}
await db.SaveChangesAsync();
It generates SQL that executes one UPDATE statement for each entity. E.g.:
UPDATE [TableName] SET [DownloadDateTime] = #p1988
WHERE [Id] = #p1989;
SELECT ##ROWCOUNT;
Is that really the most optimal way to do this? Is it possible to make EF Core do this in fewer SQL statements?
In my current dataset there are 100,000 records that need to be updated. It is not the full contents of the table, only a subset. It takes about 90 seconds to do the above updates when running the code inside Visual Studio with a SQL server express on my local PC. When running in a small Azure web app with a small SQL server Azure instance, it takes several minutes. That's why I'm looking to optimize it, if possible.
Related
I am working on a c# .Net desktop application using IDE Visual Studio 2019. I have two tables in SQLite, both having around 1 Million records each. I have to update a few columns of Table A from table B based on some matches. I have also loaded records in DATATABLES too for local processing.
Search and update field criteria vary from time to time, so I have to build dynamic queries.
I have records in tables as well as loaded in datatables. So any approach to update through direct table queries or local processing in DATATABLES is appreciated.
Followed below approaches :
Approach 1 using the query : Found on google search that We can update records from one table to another table easily by the single query in SQL but we can't update records from one table to another in SQLite. I had tried the query approach but failed later found that SQLite does not support.
I also tried to update records one by one in the loop, insert queries in the loop are very fast (took just 4-5 seconds for 200K records). But update query in the loop is very slow taking around 9 records per second. That will take hours to update all records.
Approach 2 using Linq :
Schema of both tables are not the same but have links based on a few columns.
string field1 = "A";
string field2 = "B";
var updateQuery = from r1 in table1Data.AsEnumerable()
join r2 in table2Data.AsEnumerable()
on r1.Field<string>(field1) equals r2.Field<string>(field2)
select new { r1, r2 };
This works fine for the single field but I need dynamic way and with multicolumn too (column count will be decided on run time based on user selection)
I am working on a console app (C#, asp-core 2.1, Entity Framework Core) which is connected to a local SQL Server database, the default (localdb)\MSSQLLocalDB (SQL Server 2016 v13.0) provided with Visual Studio.
The problem I am facing is that it takes quite a long time to insert data into a table. The table has 400.000 rows, 6 columns, and I insert them 200 at a time.
Right now, the request takes 20 seconds to be executed. And this execution time keeps increasing. Considering the fact that I still have 20.000 x200 rows to insert, it's worth figuring out where does this problem comes from!
A couple of facts :
There is no Index on the table
My computer is not new but I have a quite good hardware (i7, 16 Go RAM) and I don't hit 100% CPU while inserting
So, my questions are :
Is 400 k rows considered to be a 'large' database? I've never worked with a table that big before but I thought it was common to have a dataset like this.
How can I investigate where does the inserting time come from? I have only Visual Studio installed so far (but I am opened to other options)
Here is the SQL code of the table in question :
CREATE TABLE [dbo].[KfStatDatas]
(
[Id] INT IDENTITY (1, 1) NOT NULL,
[DistrictId] INT NOT NULL,
[StatId] INT NOT NULL,
[DataSourceId] INT NOT NULL,
[Value] NVARCHAR(300) NULL,
[SnapshotDate] DATETIME2(7) NOT NULL
);
EDIT
I ran SQL Server Management Studio, and I found the request that is the slowing down the whole process. It is the insertion request.
But, by looking at the SQL Request create by Entity Framework, it looks like it's doing an inner join and going through the whole table, which would explain why the processing time increases with the table.
I may miss a point but why would you need to enumerate the whole table to add rows?
Raw request being executed :
SELECT [t].[Id]
FROM [KfStatDatas] t
INNER JOIN #inserted0 i ON ([t].[Id] = [i].[Id])
ORDER BY [i].[_Position]
EDIT and SOLUTION
I eventually found the issue, and it was a stupid mistake : my Id field was not declared as primary key! So the system had to go through the whole DB for every inserted row. I added the PK and it now takes...100 ms for 200 rows, and this duration is stable.
Thanks for your time!
I think you may simply missing an primary key. You've declared to EF that Id is the Entity Key, but you don't have a unique index on the table to enforce that.
And when EF wants to fetch the inserted IDs, without an index, it's expensive. So this query
SELECT t.id from KfStatDatas t
inner join #inserted0 i
on t.id = i.id
order by i._Position
performs 38K logical reads, and takes 16sec on average.
So try:
ALTER TABLE [dbo].[KfStatDatas]
ADD CONSTRAINT PK_KfStatDatas
PRIMARY KEY (id)
BTW are you sure this is EF6? This looks more like EF Core batch insert.
No 400K rows is not large.
The most efficient way to insert a large number of rows from .NET is with SqlBulkCopy. This should take seconds rather than minutes for 400K rows.
With batching individual inserts, execute the entire batch in a single transaction to improve throughput. Otherwise, each insert is committed individually, requiring a synchronous flush of the log buffer to disk for each insert to harden the transaction.
EDIT:
I see from your comment that you are using Entity Framework. This answer may help you use SqlBulkCopy with EF.
I have a strange performance issue with executing a simple merge SQL command on Entity Framework 6.
First my Entity Framework code:
var command = #"MERGE [StringData] AS TARGET
USING (VALUES (#DCStringID_Value, #TimeStamp_Value)) AS SOURCE ([DCStringID], [TimeStamp])
ON TARGET.[DCStringID] = SOURCE.[DCStringID] AND TARGET.[TimeStamp] = SOURCE.[TimeStamp]
WHEN MATCHED THEN
UPDATE
SET [DCVoltage] = #DCVoltage_Value,
[DCCurrent] = #DCCurrent_Value
WHEN NOT MATCHED THEN
INSERT ([DCStringID], [TimeStamp], [DCVoltage], [DCCurrent])
VALUES (#DCStringID_Value, #TimeStamp_Value, #DCVoltage_Value, #DCCurrent_Value);";
using (EntityModel context = new EntityModel())
{
for (int i = 0; i < 100; i++)
{
var entity = _buffer.Dequeue();
context.ContextAdapter.ObjectContext.ExecuteStoreCommand(command, new object[]
{
new SqlParameter("#DCStringID_Value", entity.DCStringID),
new SqlParameter("#TimeStamp_Value", entity.TimeStamp),
new SqlParameter("#DCVoltage_Value", entity.DCVoltage),
new SqlParameter("#DCCurrent_Value", entity.DCCurrent),
});
}
}
Execution time ~20 seconds.
This looks a little bit slow so I tried the same command to run direct in management studio (also 100 times in a row).
SQL Server Management Studio:
Execution time <1 second.
Ok that is strange!?
Some tests:
First I compare both execution plans (Entity Framework and SSMS) but they are absolutely identical.
Second I tried is using a transaction inside my code.
using (PowerdooModel context = PowerdooModel.CreateModel())
{
using (var dbContextTransaction = context.Database.BeginTransaction())
{
try
{
for (int i = 0; i < 100; i++)
{
context.ContextAdapter.ObjectContext.ExecuteStoreCommand(command, new object[]
{
new SqlParameter("#DCStringID_Value", entity.DCStringID),
new SqlParameter("#TimeStamp_Value", entity.TimeStamp),
new SqlParameter("#DCVoltage_Value", entity.DCVoltage),
new SqlParameter("#DCCurrent_Value", entity.DCCurrent),
});
}
dbContextTransaction.Commit();
}
catch (Exception)
{
dbContextTransaction.Rollback();
}
}
}
Third I added 'OPTION(recompile)' to avoid parameter sniffing.
Execution time still ~10 seconds. What is still very poor performance.
Question: what am I doing wrong? Please give me a hint.
-------- Some more tests - edited 18.11.2016 ---------
If I execute the commands inside a transaction (like above), the following times comes up:
Duration complete: 00:00:06.5936006
Average Command: 00:00:00.0653457
Commit: 00:00:00.0590299
Is it not strange that the commit nearly takes no time and the average command takes nearly the same time?
In SSMS you are running one long batch that contains 100 separate MERGE statements. In C# you are running 100 separate batches. Obviously it is longer.
Running 100 separate batches in 100 separate transactions is obviously longer than 100 batches in 1 transaction. Your measurements confirm that and show you how much longer.
To make it efficient use a single MERGE statement that processes all 100 rows from a table-valued parameter in one go. See also Table-Valued Parameters for .NET Framework
Often a table-valued parameter is a parameter of a stored procedure, but you don't have to use a stored procedure. It could be a single statement, but instead of multiple simple scalar parameters you'd pass the whole table at once.
I never used entity framework, so I can't show you a C# example how to call it. I'm sure if you search for "how to pass table-valued parameter in entity framework" you'll find an example. I use DataTable class to pass a table as a parameter.
I can show you an example of T-SQL stored procedure.
At first you define a table type that pretty much follows the definition of your StringData table:
CREATE TYPE dbo.StringDataTableType AS TABLE(
DCStringID int NOT NULL,
TimeStamp datetime2(0) NOT NULL,
DCVoltage float NOT NULL,
DCCurrent float NOT NULL
)
Then you use it as a type for the parameter:
CREATE PROCEDURE dbo.MergeStringData
#ParamRows dbo.StringDataTableType READONLY
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
BEGIN TRY
MERGE INTO dbo.StringData WITH (HOLDLOCK) as Dst
USING
(
SELECT
TT.DCStringID
,TT.TimeStamp
,TT.DCVoltage
,TT.DCCurrent
FROM
#ParamRows AS TT
) AS Src
ON
Dst.DCStringID = Src.DCStringID AND
Dst.TimeStamp = Src.TimeStamp
WHEN MATCHED THEN
UPDATE SET
Dst.DCVoltage = Src.DCVoltage
,Dst.DCCurrent = Src.DCCurrent
WHEN NOT MATCHED BY TARGET THEN
INSERT
(DCStringID
,TimeStamp
,DCVoltage
,DCCurrent)
VALUES
(Src.DCStringID
,Src.TimeStamp
,Src.DCVoltage
,Src.DCCurrent)
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
-- TODO: handle the error
ROLLBACK TRANSACTION;
END CATCH;
END
Again, it doesn't have to be a stored procedure, it can be just a MERGE statement with one table-valued parameter.
I'm pretty sure that it would be much faster than your loop with 100 separate queries.
Details on why there should be HOLDLOCK hint with MERGE: “UPSERT” Race Condition With MERGE
A side note:
It is not strange that Commit in your last test is very fast. It doesn't do much, because everything is already written to the database. If you tried to do Rollback, that would take some time.
It looks like the difference is that you are using OPTION (RECOMPILE) in your EF code only, but when you run the code in SSMS there is no recompile.
Recompiling 100 times would certainly add some time to execution.
Related to the answer from #Vladimir Baranov I take a look how Entity Framework supports bulk operations (for me bulk merge).
Short answer... no! Bulk delete, update and merge operations are not supported.
Then I found this sweet little library!
zzzproject Entity Framework Extensions
It provides a finished extension that works like #Vladimir Baranov descripted.
Create a temporary table in SQL Server.
Bulk Insert data with .NET SqlBulkCopy into the temporary table.
Perform a SQL statement between the temporary table and the destination table.
Drop the temporary table from the SQL Server.
You can find a short interview by Jonathan Allen about the functions and how it can dramatically improve Entity Framework performance with bulk operations.
For me it really kicks up the performance from 10 seconds to under 1 second. Nice!
I want to be clear, I'm not working or be related to them. This is no advertisement. Yes the library is commercial, but basic version is free.
I have got a MSSQL database table with a few million entries.
Every new entry got an ID +1 from the last entry. That means that lower ID numbers are older entries. Now I want to delete old entries in the database with the help of it's ID.
I delete every "Entry" that is lower than the "maxID".
while (true)
{
List<Entry> entries = entity.Entry.Where(z => z.id < maxID).Take(1000).ToList();
foreach (var entry in entries)
{
entity.Entry.DeleteObject(entry);
}
if (entries < 1000)
{
break;
}
}
I can't take all entries with one query because this would raise a System.OutOfMemoryException. So I only took 1000 entries and repeat the delete function until every entry is deleted.
My question is: What would be the best number of entries to ".Take()" in performance?
It's faster to drop and recreate the tables in the database,
You can directly execute commands against the database by calling your stored procedure using
ExecuteStoreQuery method.
Any commands automatically generated by the Entity Framework may be more complex than similar commands written explicitly by a database developer. If you need explicit control over the commands executed against your data source, consider defining a mapping to a table-valued function or stored procedure. -MSDN
As i can see your code (Please correct me if i am wrong or improve the answer), Your code is actually loading entities in memory which is an overhead cost because you need to perform delete operation ,and your query will create delete query for each entity marked by
DeleteObject. So in terms of performance it will be better to call a stored procedure and execute your query directly against the database.
ExecuteStoreCommand Method
Directly Execute commands
Try this...
entity.Entry.Where(z => z.id < maxID).ToList().ForEach(entity.Entry.DeleteObject);
entity.SaveChanges();
I am building a batch processing system. Batches of Units come in quantities from 20-1000. Each Unit is essentially a hierarchy of models (one main model and many child models). My task involves saving each model hierarchy to a database as a single transaction (either each hierarchy commits or it rolls back). Unfortunately EF was unable to handle two portions of the model hierarchy due to their potential to contain thousands of records.
What I've done to resolve this is set up SqlBulkCopy to handle these two potentially high count models and let EF handle the rest of the inserts (and referential integrity).
Batch Loop:
foreach (var unitDetails in BatchUnits)
{
var unitOfWork = new Unit(unitDetails);
Task.Factory.StartNew(() =>
{
unitOfWork.ProcessX(); // data preparation
unitOfWork.ProcessY(); // data preparation
unitOfWork.PersistCase();
});
}
Unit:
class Unit
{
public PersistCase()
{
using (var dbContext = new CustomDbContext())
{
// Need an explicit transaction so that
// EF + SqlBulkCopy act as a single block
using (var scope = new TransactionScope(TransactionScopeOption.Required,
new TransactionOptions() {
IsolationLevel = System.Transaction.IsolationLevel.ReadCommitted
}))
{
// Let EF Insert most of the records
// Note Insert is all it is doing, no update or delete
dbContext.Units.Add(thisUnit);
dbContext.SaveChanges(); // deadlocks, DbConcurrencyExceptions here
// Copy Auto Inc Generated Id (set by EF) to DataTables
// for referential integrity of SqlBulkCopy inserts
CopyGeneratedId(thisUnit.AutoIncrementedId, dataTables);
// Execute SqlBulkCopy for potentially numerous model #1
SqlBulkCopy bulkCopy1 = new SqlBulkCopy(...);
...
bulkCopy1.WriteToServer(dataTables["#1"]);
// Execute SqlBulkCopy for potentially number model #2
SqlBulkCopy bulkCopy2 = new SqlBulkCopy(...);
...
bulkCopy2.WriteToServer(dataTables["#2"]);
// Commit transaction
scope.Complete();
}
}
}
}
Right now I'm essentially stuck between a rock and a hard place. If I leave the IsolationLevel set to ReadCommitted, I get deadlocks between EF INSERT statements in different Tasks.
If I set the IsolationLevel to ReadUncommitted (which I thought would be fine since I'm not doing any SELECTs) I get DbConcurrencyExceptions.
I've been unable to find any good information about DbConcurrencyExceptions and Entity Framework but I'm guessing that ReadUncommitted is essentially causing EF to receive invalid "rows inserted" information.
UPDATE
Here is some background information on what is actually causing my deadlocking issues while doing INSERTS:
http://connect.microsoft.com/VisualStudio/feedback/details/562148/how-to-avoid-using-scope-identity-based-insert-commands-on-sql-server-2005
Apparently this same issue was present a few years ago when Linq To SQL came out and Microsoft fixed it by changing how scope_identity() gets selected. Not sure why their position has changed to this being a SQL Server problem when the same issue came up with Entity Framework.
This issue is explained fairly well here: http://connect.microsoft.com/VisualStudio/feedback/details/562148/how-to-avoid-using-scope-identity-based-insert-commands-on-sql-server-2005
Essentially its an internal EF issue. I migrated my code to use Linq To SQL and it now works fine (no longer does the unnecessary SELECT for the identity value).
Relevant quote from the exact same issue in Linq To Sql which was fixed:
When a table has an identity column, Linq to SQL generates extremely
inefficient SQL for insertion into such a table. Assume the table is
Order and the identiy column is Id. The SQL generated is:
exec sp_executesql N'INSERT INTO [dbo].[Order]([Colum1], [Column2])
VALUES (#p0, #p1)
SELECT [t0].[Id] FROM [dbo].[Order] AS [t0] WHERE [t0].[Id] =
(SCOPE_IDENTITY()) ',N'#p0 int,#p1 int,#p0=124,#p1=432
As one can see instead of returning SCOPE_IDENTITY() directly by using
'SELECT SCOPE_IDENTITY()', the generated SQL performs a SELECT on the
Id column using the value returned by SCOPE_IDENTITY(). When the
number of the records in the table is large, this significantly slows
down the insertion. When the table is partitioned, the problem gets
even worse.