Entity Framework bulk insert with updating rows from another table

Entity Framework bulk insert with updating rows from another table - c#

I'm using Microsoft SQL Server and Entity Framework. I have N (for example 10 000) items to insert. Before inserting each item I need to insert or update existing group. It doesn't work well because of low performance. It's because I'm generating too many queries. Each time in loop I'm looking for group by querying Groups table by three (already indexed) parameters.
I was thinking about querying first all groups by using WHERE IN query (Groups.Where(g => owners.Contains(g.OwnerId) && .. ), but as I remember such queries are limited by number of parameters.
Maybe I should write a stored procedure?
Here is my example code. I'm using IUnitOfWork pattern for wrapping the EF DbContext:
public async Task InsertAsync(IItem item)
{
var existingGroup = await this.unitOfWork.Groups.GetByAsync(item.OwnerId, item.Type, item.TypeId);
if (existingGroup == null)
{
existingGroup = this.unitOfWork.Groups.CreateNew();
existingGroup.State = GroupState.New;
existingGroup.Type = item.Code;
existingGroup.TypeId = item.TypeId;
existingGroup.OwnerId = item.OwnerId;
existingGroup.UpdatedAt = item.CreatedAt;
this.unitOfWork.Groups.Insert(existingGroup);
}
else
{
existingGroup.UpdatedAt = item.CreatedAt;
existingGroup.State = GroupState.New;
this.unitOfWork.Groups.Update(existingGroup);
}
this.unitOfWork.Items.Insert(item);
}
foreach(var item in items)
{
InsertAsync(item);
}
await this.unitOfWork.SaveChangesAsync();

There are three key elements to improve performance when bulk inserting:
Set AutoDetectChangesEnabled and ValidateOnSaveEnabled to false:
_db.Configuration.AutoDetectChangesEnabled = false;
_db.Configuration.ValidateOnSaveEnabled = false;
Break up your inserts into segments, wich use the same DbContext, then recreate it. How large the segment should be varies from use-case to use-case, I made best performance at around 100 Elements before recreating the Context. This is due to the observing of the elements in the DbContext.
Also make sure not to recreate the context for every insert.
(See Slauma's answer here Fastest Way of Inserting in Entity Framework)
When checking other tables, make sure to use IQueryable where possible and to work only where necessary with ToList() or FirstOrDefault(). Since ToList() and FirstOrDefault() loads the objects. (See Richard Szalay's answer here What's the difference between IQueryable and IEnumerable)
These tricks helped me out the most when bulk inserting in a scenario as you described. There are also other possibilities. For example SP's, and the BulkInsert function.

Related

Linq ToList() and Count() performance problem

I have 200k rows in my table and I need to filter the table and then show in datatable. When I try to do that, my sql run fast. But when I want to get row count or run the ToList(), it takes long time. Also when I try to convert it to list it has 15 rows after filter, it has not huge data.
public static List<Books> GetBooks()
{
List<Books> bookList = new List<Books>();
var v = from a in ctx.Books select a);
int allBooksCount = v.Count(); // I need all books count before filter. but it is so slow is my first problem
if (isFilter)
{
v = v.Where(a => a.startdate <= DateTime.Now && a.enddate>= DateTime.Now);
}
.
.
bookList = v.ToList(); // also toList is so slow is my second problem
}

There's nothing wrong with the code you've shown. So either you have some trouble in the database itself, or you're ruining the query by using IEnumerable instead of IQueryable.
My guess is that either ctx.Books is IEnumerable<Books> (instead of IQueryable<Books>), or that the Count (and Where etc.) method you're calling is the Enumerable version, rather than the Queryable version.
Which version of Count are you actually calling?

First, to get help you need to provide quantitative values for "fast" vs. "too long". Loading entities from EF will take longer than running a raw SQL statement in a client tool like TOAD etc. Are you seeing differences of 15ms vs. 15 seconds, or 15ms vs. 150ms?
To help identify and eliminate possible culprits:
Eliminate the possibility of a long-running DbContext instance tracking too many entities bogging down performance. The longer a DbContext is used and the more entities it tracks, the slower it gets. Temporarily change the code to:
List<Books> bookList = new List<Books>();
using (var context = new YourDbContext())
{
var v = from a in context.Books select a);
int allBooksCount = v.Count(); // I need all books count before filter. but it is so slow is my first problem
if (isFilter)
{
v = v.Where(a => a.startdate <= DateTime.Now && a.enddate>= DateTime.Now);
}
.
.
bookList = v.ToList();
}
Using a fresh DbContext ensures queries are not sifting through in-memory entities after running a query to find tracked instances to return. This also ensures we are running against IQueryable off the Books DbSet within the context. We can only guess what "ctx" in your code actually represents.
Next: Look at a profiler for MySQL, or have your database log out SQL statements to capture exactly what EF is requesting. Check that the Count and ToList each trigger just one query against the database, and then run these exact statements against the database. If there are more than these two queries being run then something odd is happening behind the scenes that you need to investigate, such as that your example doesn't really represent what your real code is doing. You could be tripping client side evaluation (if using EF Core 2) or lazy loading. The next thing I would look at is if possible to look at the execution plan for these queries for hints like missing indexes or such. (my DB experience is primarily SQL Server so I cannot provide advice on tools to use for MySQL)

I would log the actual SQL queries here. You can then use DESCRIBE to look at how many rows it hits. There are various tools that can further analyse the queries if DESCRIBE isn't sufficient. This way you can see whether it's the queries or the (lack of) indices that is the problem. Next step has to be guided by that.

How do I update multiple Entity models in one SQL statement?

I had the following:
List<Message> unreadMessages = this.context.Messages
.Where( x =>
x.AncestorMessage.MessageID == ancestorMessageID &&
x.Read == false &&
x.SentTo.Id == userID ).ToList();
foreach(var unreadMessage in unreadMessages)
{
unreadMessage.Read = true;
}
this.context.SaveChanges();
But there must be a way of doing this without having to do 2 SQL queries, one for selecting the items, and one for updating the list.
How do i do this?

Current idiomatic support in EF
As far as I know, there is no direct support for "bulk updates" yet in Entity Framework (there has been an ongoing discussion for bulk operation support for a while though, and it is likely it will be included at some point).
(Why) Do you want to do this?
It is clear that this is an operation that, in native SQL, can be achieved in a single statement, and provides some significant advantages over the approach followed in your question. Using the single SQL statement, only a very small amount of I/O is required between client and DB server, and the statement itself can be completely executed and optimized by the DB server. No need to transfer to and iterate through a potentially large result set client side, just to update one or two fields and send this back the other way.
How
So although not directly supported by EF, it is still possible to do this, using one of two approaches.
Option A. Handcode your SQL update statement
This is a very simple approach, that does not require any other tools/packages and can be performed Async as well:
var sql = "UPDATE TABLE x SET FIELDA = #fieldA WHERE FIELDB = #fieldb";
var parameters = new SqlParameter[] { ..., ... };
int result = db.Database.ExecuteSqlCommand(sql, parameters);
or
int result = await db.Database.ExecuteSqlCommandAsync(sql, parameters);
The obvious downside is, well breaking the nice linqy paradigm and having to handcode your SQL (possibly for more than one target SQL dialect).
Option B. Use one of the EF extension/utility packages
Since a while, a number of open source nuget packages are available that offer specific extensions to EF. A number of them do provide a nice "linqy" way to issue a single update SQL statement to the server. Two examples are:
Entity Framework Extended Library that allows performing a bulk update using a statement like:
context.Messages.Update(
x => x.Read == false && x.SentTo.Id == userID,
x => new Message { Read = true });
It is also available on github
EntityFramework.Utilities that allows performing a bulk update using a statement like:
EFBatchOperation
.For(context, context.Messages)
.Where(x => x.Read == false && x.SentTo.Id == userID)
.Update(x => x.Read, x => x.Read = true);
It is also available on github
And there are definitely other packages and libraries out there that provide similar support.

Even SQL has to do this in two steps in a sense, in that an UPDATE query with a WHERE clause first runs the equivalent of a SELECT behind the scenes, filtering via the WHERE clause, then applying the update. So really, I don't think you need to be worried about improving this.
Further, the reason why it's broken into two steps like this in LINQ is precisely for performance reasons. You want that "select" to be as minimal as possible, i.e. you don't want to load any more objects from the database into in memory objects than you have to. Only then do you alter objects (in the foreach).
If you really want to run a native UPDATE on the SQL side, you could use a System.Data.SqlClient.SqlCommand to issue the update, instead of having LINQ give you back objects that you then update. That will be faster, but then you conceptually move some of your logic out of your C# code object model space into the database model space (you are doing things in the database, not in your object space), even if the SqlCommand is being issued from your code.

How to boost Entity Framework Unit Of Work Performance

Please see the following situation:
I do have a CSV files of which I import a couple of fields (not all in SQL server using Entity Framework with the Unit Of Work and Repository Design Pattern).
var newGenericArticle = new GenericArticle
{
GlnCode = data[2],
Description = data[5],
VendorId = data[4],
ItemNumber = data[1],
ItemUOM = data[3],
VendorName = data[12]
};
var unitOfWork = new UnitOfWork(new AppServerContext());
unitOfWork.GenericArticlesRepository.Insert(newGenericArticle);
unitOfWork.Commit();
Now, the only way to uniquely identify a record, is checking on 4 fields: GlnCode, Description, VendorID and Item Number.
So, before I can insert a record, I need to check whether or not is exists:
var unitOfWork = new UnitOfWork(new AppServerContext());
// If the article is already existing, update the vendor name.
if (unitOfWork.GenericArticlesRepository.GetAllByFilter(
x => x.GlnCode.Equals(newGenericArticle.GlnCode) &&
x.Description.Equals(newGenericArticle.Description) &&
x.VendorId.Equals(newGenericArticle.VendorId) &&
x.ItemNumber.Equals(newGenericArticle.ItemNumber)).Any())
{
var foundArticle = unitOfWork.GenericArticlesRepository.GetByFilter(
x => x.GlnCode.Equals(newGenericArticle.GlnCode) &&
x.Description.Equals(newGenericArticle.Description) &&
x.VendorId.Equals(newGenericArticle.VendorId) &&
x.ItemNumber.Equals(newGenericArticle.ItemNumber));
foundArticle.VendorName = newGenericArticle.VendorName;
unitOfWork.GenericArticlesRepository.Update(foundArticle);
}
If it's existing, I need to update it, which you see in the code above.
Now, you need to know that I'm importing around 1.500.000 records, so quite a lot.
And it's the filter which causes the CPU to reach almost 100%.
The `GetAllByFilter' method is quite simple and does the following:
return !Entities.Any() ? null : !Entities.Where(predicate).Any() ? null : Entities.Where(predicate).AsQueryable();
Where predicate equals Expression<Func<TEntity, bool>>
Is there anything that I can do to make sure that the server's CPU doesn't reach 100%?
Note: I'm using SQL Server 2012
Kind regards

Wrong tool for the task. You should never process a million+ records one at at time. Insert the records to a staging table using bulk insert and clean (if need be) and then use a stored proc to do the processing in a set-based way or use the tool designed for this, SSIS.

I've found another solution which wasn't proposed here, so I'll be answering my own question.
I will have a temp table in which I will import all the data, and after the import, I'll execute a stored procedure which will execute a Merge command to populate the destinatio table. I do believe that this is the most performant.

Have you indexed on those four fields in your database? That is the first thing that I would do.
Ok, I would recommend trying the following:
Improving bulk insert performance in Entity framework
To summarize,
Do not call SaveChanges() after every insert or update. Instead, call every 1-2k records so that the inserts/updates are made in batches to the database.
Also, optionally change the following parameters on your context:
yourContext.Configuration.AutoDetectChangesEnabled = false;
yourContext.Configuration.ValidateOnSaveEnabled = false;

EF Codefirst Bulk Insert

I need to insert around 2500 rows using EF Code First.
My original code looked something like this:
foreach(var item in listOfItemsToBeAdded)
{
//biz logic
context.MyStuff.Add(i);
}
This took a very long time. It was around 2.2 seconds for each DBSet.Add() call, which equates to around 90 minutes.
I refactored the code to this:
var tempItemList = new List<MyStuff>();
foreach(var item in listOfItemsToBeAdded)
{
//biz logic
tempItemList.Add(item)
}
context.MyStuff.ToList().AddRange(tempItemList);
This only takes around 4 seconds to run. However, the .ToList() queries all the items currently in the table, which is extremely necessary and could be dangerous or even more time consuming in the long run. One workaround would be to do something like context.MyStuff.Where(x=>x.ID = *empty guid*).AddRange(tempItemList) because then I know there will never be anything returned.
But I'm curious if anyone else knows of an efficient way to to a bulk insert using EF Code First?

Validation is normally a very expensive portion of EF, I had great performance improvements by disabling it with:
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;
I believe I found that in a similar SO question--perhaps it was this answer
Another answer on that question rightly points out that if you really need bulk insert performance you should look at using System.Data.SqlClient.SqlBulkCopy. The choice between EF and ADO.NET for this issue really revolves around your priorities.

I have a crazy idea but I think it will help you.
After each adding 100 items call SaveChanges. I have a feeling Track Changes in EF have a very bad performance with huge data.

I would recommend this article on how to do bulk inserts using EF.
Entity Framework and slow bulk INSERTs
He explores these areas and compares perfomance:
Default EF (57 minutes to complete adding 30,000 records)
Replacing with ADO.NET Code (25 seconds for those same 30,000)
Context Bloat- Keep the active Context Graph small by using a new context for each Unit of Work (same 30,000 inserts take 33 seconds)
Large Lists - Turn off AutoDetectChangesEnabled (brings the time down to about 20 seconds)
Batching (down to 16 seconds)
DbTable.AddRange() - (performance is in the 12 range)

As STW pointed out, the DetectChanges method called every time you call the Add method is VERY expensive.
Common solution are:
Use AddRange over Add
SET AutoDetectChanges to false
SPLIT SaveChanges in multiple batches
See: Improve Entity Framework Add Performance
It's important to note that using AddRange doesn't perform a BulkInsert, it's simply invoke the DetecthChanges method once (after all entities is added) which greatly improve the performance.
But I'm curious if anyone else knows of an efficient way to to a bulk
insert using EF Code First
There is some third party library supporting Bulk Insert available:
See: Entity Framework Bulk Insert library
Disclaimer: I'm the owner of Entity Framework Extensions
This library allows you to perform all bulk operations you need for your scenarios:
Bulk SaveChanges
Bulk Insert
Bulk Delete
Bulk Update
Bulk Merge
Example
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);
// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);
// Customize Primary Key
context.BulkMerge(customers, operation => {
operation.ColumnPrimaryKeyExpression =
customer => customer.Code;
});

EF is not really usable for batch/bulk operations (I think in general ORMs are not).
The particular reason why this is running so slowly is because of the change tracker in EF. Virtually every call to the EF API results in a call to TrackChanges() internally, including DbSet.Add(). When you add 2500, this function gets called 2500 times. And each call gets slower and slower, the more data you have added. So disabling the change tracking in EF should help a lot:
dataContext.Configuration.AutoDetectChangesEnabled = false;
A better solution would be to split your big bulk operation into 2500 smaller transactions, each running with their own data context. You could use msmq, or some other mechanism for reliable messaging, for initiating each of these smaller transactions.
But if your system is build around a lot a bulk operations, I would suggest finding a different solution for your data access layer than EF.

While this is a bit late and the answers and comments posted above are very useful, I will just leave this here and hope it proves useful for people who
had the same problem as I did and come to this post for answers. This post
still ranks high on Google (at the time of posting this answer) if you search for a way
to bulk-insert records using Entity Framework.
I had a similar problem using Entity Framework and Code First in an MVC 5 application. I had a user submit a form that caused tens of thousands of records
to be inserted into a table. The user had to wait for more than 2 and a half minutes while 60,000 records were being inserted.
After much googling, I stumbled upon BulkInsert-EF6 which is also available
as a NuGet package. Reworking the OP's code:
var tempItemList = new List<MyStuff>();
foreach(var item in listOfItemsToBeAdded)
{
//biz logic
tempItemList.Add(item)
}
using (var transaction = context.Transaction())
{
try
{
context.BulkInsert(tempItemList);
transaction.Commit();
}
catch (Exception ex)
{
// Handle exception
transaction.Rollback();
}
}
My code went from taking >2 minutes to <1 second for 60,000 records.

Although a late reply, but I'm posting the answer because I suffered the same pain.
I've created a new GitHub project just for that, as of now, it supports Bulk insert/update/delete for Sql server transparently using SqlBulkCopy.
https://github.com/MHanafy/EntityExtensions
There're other goodies as well, and hopefully, It will be extended to do more down the track.
Using it is as simple as
var insertsAndupdates = new List<object>();
var deletes = new List<object>();
context.BulkUpdate(insertsAndupdates, deletes);
Hope it helps!

EF6 beta 1 has an AddRange function that may suit your purpose:
INSERTing many rows with Entity Framework 6 beta 1
EF6 will be released "this year" (2013)

public static void BulkInsert(IList list, string tableName)
{
var conn = (SqlConnection)Db.Connection;
if (conn.State != ConnectionState.Open) conn.Open();
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = list.Count;
bulkCopy.DestinationTableName = tableName;
var table = ListToDataTable(list);
bulkCopy.WriteToServer(table);
}
}
public static DataTable ListToDataTable(IList list)
{
var dt = new DataTable();
if (list.Count <= 0) return dt;
var properties = list[0].GetType().GetProperties();
foreach (var pi in properties)
{
dt.Columns.Add(pi.Name, Nullable.GetUnderlyingType(pi.PropertyType) ?? pi.PropertyType);
}
foreach (var item in list)
{
DataRow row = dt.NewRow();
properties.ToList().ForEach(p => row[p.Name] = p.GetValue(item, null) ?? DBNull.Value);
dt.Rows.Add(row);
}
return dt;
}

IQueryable<>.ToString() too slow

I'm using BatchDelete found on the answer to this question: EF Code First Delete Batch From IQueryable<T>?
The method seems to be wasting too much time building the delete clause from the IQueryable. Specifically, deleting 20.000 elements using the IQueryable below is taking almost two minutes.
context.DeleteBatch(context.SomeTable.Where(x => idList.Contains(x.Id)));
All the time is spent on this line:
var sql = clause.ToString();
The line is part of this method, available on the original question linked above but pasted here for convenience:
private static string GetClause<T>(DbContext context, IQueryable<T> clause) where T : class
{
const string Snippet = "FROM [dbo].[";
var sql = clause.ToString();
var sqlFirstPart = sql.Substring(sql.IndexOf(Snippet, System.StringComparison.OrdinalIgnoreCase));
sqlFirstPart = sqlFirstPart.Replace("AS [Extent1]", string.Empty);
sqlFirstPart = sqlFirstPart.Replace("[Extent1].", string.Empty);
return sqlFirstPart;
}
I imagine making context.SomeTable.Where(x => idList.Contains(x.Id)) into a compiled query could help, but AFAIK you can't compile queries while using DbContext on EF 5. In thesis they should be cached but I see no sign of improvement on a second execution of the same BatchDelete.
Is there a way to make this faster? I would like to avoid manually building the SQL delete statement.

The IQueryable isn't cached and each time you evaluate it you're going out to SQL. Running ToList() or ToArray() on it will evaluate it once and then you can work with the list as the cached version.
If you want to preserve you're interfaces, you'd use ToList().AsQueryable() and this would pass in a cached version.
Related post.
How do I cache an IQueryable object?

It seems there is no way to cache the IQueryable in this case, because the query contains a list of ids to check against and the list changes in every call.
The only way I found to avoid the two minute delay in building the query every time I had to mass-delete objects was to use ExecuteSqlCommand as below:
var list = string.Join("','", ids.Select(x => x.ToString()));
var qry = string.Format("DELETE FROM SomeTable WHERE Id IN ('{0}')", list);
context.Database.ExecuteSqlCommand(qry);
I'll mark this as the answer for now. If any other technique is suggested that doesn't rely on ExecuteSqlCommand, I'll gladly change the answer.

There is a EF pattern that works Ok.
it uses projection. to return ONLY keys from DB. (projections are not added to context,
So this is pretty quick.
Then You build the context with KEY only stub POCOs, and light the fuse....
basically.
var deleteMagazine = Context.Set<DeadMeat>.Where(t=>t.IhateYou == true).Select(t=>t.THEKEY).toList
//Now instantiate a dummy POCO with KEY only for the list,
foreach ( var bullet in deleteMagazine)
{
context.Set<deadmeat>.attach(bullet);
context.set<deadmeat>.remove(bullet);
// consider saving chnages every 1000 records .... performance, trial different values
if (magazineisEmpty) // your counter logic here :-)
context.SaveChanges
}
// shoot anyone still moving
context.SaveChanges
check SQL server profiler....

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Entity Framework bulk insert with updating rows from another table - c#

Related

Linq ToList() and Count() performance problem

How do I update multiple Entity models in one SQL statement?

How to boost Entity Framework Unit Of Work Performance

EF Codefirst Bulk Insert

IQueryable<>.ToString() too slow

Categories

Resources