How do I speed up DbSet.Add()?

How do I speed up DbSet.Add()? - c#

I have to import about 30k rows from a CSV file to my SQL database, this sadly takes 20 minutes.
Troubleshooting with a profiler shows me that DbSet.Add is taking the most time, but why?
I have these Entity Framework Code-First classes:
public class Article
{
// About 20 properties, each property doesn't store excessive amounts of data
}
public class Database : DbContext
{
public DbSet<Article> Articles { get; set; }
}
For each item in my for loop I do:
db.Articles.Add(article);
Outside the for loop I do:
db.SaveChanges();
It's connected with my local SQLExpress server, but I guess there isn't anything written till SaveChanges is being called so I guess the server won't be the problem....

As per Kevin Ramen's comment (Mar 29)
I can confirm that setting db.Configuration.AutoDetectChangesEnabled = false makes a huge difference in speed
Running Add() on 2324 items by default ran 3min 15sec on my machine, disabling the auto-detection resulted in the operation completing in 0.5sec.
http://blog.larud.net/archive/2011/07/12/bulk-load-items-to-a-ef-4-1-code-first-aspx

I'm going to add to Kervin Ramen's comment by saying that if you are only doing inserts (no updates or deletes) then you can, in general, safely set the following properties before doing any inserts on the context:
DbContext.Configuration.AutoDetectChangesEnabled = false;
DbContext.Configuration.ValidateOnSaveEnabled = false;
I was having a problem with a once-off bulk import at my work. Without setting the above properties, adding about 7500 complicated objects to the context was taking over 30 minutes. Setting the above properties (so disabling EF checks and change tracking) reduced the import down to seconds.
But, again, I stress only use this if you are doing inserts. If you need to mix inserts with updates/deletes you can split your code into two paths and disable the EF checks for the insert part and then re-enable the checks for the update/delete path. I have used this approach succesfully to get around the slow DbSet.Add() behaviour.

Each item in a unit-of-work has overhead, as it must check (and update) the identity manager, add to various collections, etc.
The first thing I would try is batching into, say, groups of 500 (change that number to suit), starting with a fresh (new) object-context each time - as otherwise you can reasonably expect telescoping performance. Breaking it into batches also prevents a megalithic transaction bringing everything to a stop.
Beyond that; SqlBulkCopy. It is designed for large imports with minimal overhead. It isn't EF though.

There is an extremely easy to use and very fast extension here:
https://efbulkinsert.codeplex.com/
It's called "Entity Framework Bulk Insert".
Extension itself is in namespace EntityFramework.BulkInsert.Extensions. So to reveal the extension method add using
using EntityFramework.BulkInsert.Extensions;
And then you can do this
context.BulkInsert(entities);
BTW - If you do not wish to use this extension for some reason, you could also try instead of running db.Articles.Add(article) for each article, to create each time a list of several articles and then use AddRange (new in EF version 6, along with RemoveRange) to add them together to the dbcontext.

I haven't really tried this, but my logic would be to hold on to ODBC driver to load file into datatable and then to use sql stored procedure to pass table to procedure.
For the first part, try:
http://www.c-sharpcorner.com/UploadFile/mahesh/AccessTextDb12052005071306AM/AccessTextDb.aspx
For the second part try this for SQL procedure:
http://www.builderau.com.au/program/sqlserver/soa/Passing-table-valued-parameters-in-SQL-Server-2008/0,339028455,339282577,00.htm
And create SqlCommnand object in c# and add to its Parameters collection SqlParameter that is SqlDbType.Structured
Well, I hope it helps.

Related

Linq to SQL and using ToList, how many DB calls are made?

I got a function like this
public List<Code> GetCodes()
{
return _db.Codes.Select(p => p).ToList();
}
I have 160,000 records in this table which contains 2 columns.
Then I also loop through it as follows..
List<Code> CodeLastDigits = CodeDataProvider.GetCodes();
And then loop through it
foreach (var postCodeLastDigit in PostCodeLastDigits)
I am just trying to understand how many times a call is made to the database to retrieve those records, and want to make sure it only happens once.

Linq will delay the call to the database until you give it a reason to go.
In your case, you are giving a reason in the ToList() method.
Once you call ToList() you will have all the items in memory and wont be hitting the database again.
You didnt mention which DB platform you are using, but if it is SQL Server, you can use SQL Server profiler to watch your queries to the database. This is a good way to see how many calls and what sql is being run by linq to SQL. As #acfrancis notes below, LinqPad can also do this.
For SQL Server, here is a good tutorial on creating a trace

When you call ToList(), it's going to hit the database. In this case, it appears like you'll just be hitting the database once to populate your CodeLastDigits. As long as you aren't hitting the database again in your foreach, you should be good.
As long as you have the full version of SQl Server, you can run Sql Server Profiler while going through your code to see what's happening on the database.

Probably once. But the more complete answer is that it depends, and you should be familiar with the case where even a simple access pattern like this one can result in many, many round trips to the database.
It's not likely a table named codes contains any complex types, but if it did, you'd want to watch out. Depending on how you access the properties of a code object, you could incur extra hits to the database if you don't use LoadWith properly.
Consider an example where you have a Code object, which contains a CodeType object (also mapped to a table) in a class structure like this:
class Code {
CodeType type;
}
If you don't load CodeType objects with Code objects, Linq to SQL would contact the database on every iteration of your loop if CodeType is referenced inside the loop, because it would lazy-load the object only when needed. Watch out for this, and do some research on the LoadWith<> method and its use to get comfortable that you're not running into this situation.
foreach (x in PostCodeDigits) {
Print(x.type);
}

Is checking rows affected count after database action (insert, update, delete) overkill?

Lately in apps I've been developing I have been checking the number of rows affected by an insert, update, delete to the database and logging an an error if the number is unexpected. For example on a simple insert, update, or delete of one row if any number of rows other than one is returned from an ExecuteNonQuery() call, I will consider that an error and log it. Also, I realize now as I type this that I do not even try to rollback the transaction if that happens, which is not the best practice and should definitely be addressed. Anyways, here's code to illustrate what I mean:
I'll have a data layer function that makes the call to the db:
public static int DLInsert(Person person)
{
Database db = DatabaseFactory.CreateDatabase("dbConnString");
using (DbCommand dbCommand = db.GetStoredProcCommand("dbo.Insert_Person"))
{
db.AddInParameter(dbCommand, "#FirstName", DbType.Byte, person.FirstName);
db.AddInParameter(dbCommand, "#LastName", DbType.String, person.LastName);
db.AddInParameter(dbCommand, "#Address", DbType.Boolean, person.Address);
return db.ExecuteNonQuery(dbCommand);
}
}
Then a business layer call to the data layer function:
public static bool BLInsert(Person person)
{
if (DLInsert(campusRating) != 1)
{
// log exception
return false;
}
return true;
}
And in the code-behind or view (I do both webforms and mvc projects):
if (BLInsert(person))
{
// carry on as normal with whatever other code after successful insert
}
else
{
// throw an exception that directs the user to one of my custom error pages
}
The more I use this type of code, the more I feel like it is overkill. Especially in the code-behind/view. Is there any legitimate reason to think a simple insert, update, or delete wouldn't actually modify the correct number of rows in the database? Is it more plausible to only worry about catching an actual SqlException and then handling that, instead of doing the monotonous check for rows affected every time?
Thanks. Hope you all can help me out.
UPDATE
Thanks everyone for taking the time to answer. I still haven't 100% decided what setup I will use going forward, but here's what I have taken away from all of your responses.
Trust the DB and .Net libraries to handle a query and do their job as they were designed to do.
Use transactions in my stored procedures to rollback the query on any errors and potentially use raiseerror to throw those exceptions back to the .Net code as a SqlException, which could handle these errors with a try/catch. This approach would replace the problematic return code checking.
Would there be any issue with the second bullet point that I am missing?

I guess the question becomes, "Why are you checking this?" If it's just because you don't trust the database to perform the query, then it's probably overkill. However, there could exist a logical reason to perform this check.
For example, I worked at a company once where this method was employed to check for concurrency errors. When a record was fetched from the database to be edited in the application, it would come with a LastModified timestamp. Then the standard CRUD operations in the data access layer would include a WHERE LastMotified=#LastModified clause when doing an UPDATE and check the record modified count. If no record was updated, it would assume a concurrency error had occurred.
I felt it was kind of sloppy for concurrency checking (especially the part about assuming the nature of the error), but it got the job done for the business.
What concerns me more in your example is the structure of how this is being accomplished. The 1 or 0 being returned from the data access code is a "magic number." That should be avoided. It's leaking an implementation detail from the data access code into the business logic code. If you do want to keep using this check, I'd recommend moving the check into the data access code and throwing an exception if it fails. In general, return codes should be avoided.
Edit: I just noticed a potentially harmful bug in your code as well, related to my last point above. What if more than one record is changed? It probably won't happen on an INSERT, but could easily happen on an UPDATE. Other parts of the code might assume that != 1 means no record was changed. That could make debugging very problematic :)

On the one hand, most of the time everything should behave the way you expect, and on those times the additional checks don't add anything to your application. On the other hand, if something does go wrong, not knowing about it means that the problem may become quite large before you notice it. In my opinion, the little bit of additional protection is worth the little bit of extra effort, especially if you implement a rollback on failure. It's kinda like an airbag in your car... it doesn't really serve a purpose if you never crash, but if you do it could save your life.

I've always prefered to raiserror in my sproc and handle exceptions rather than counting. This way, if you update a sproc to do something else, like logging/auditing, you don't have to worry about keeping the row counts in check.
Though if you like the second check in your code or would prefer not to deal with exceptions/raiserror, I've seen teams return 0 on successful sproc executions for every sproc in the db, and return another number otherwise.

It is absolutely overkill. You should trust that your core platform (.Net libraries, Sql Server) work correctly -you shouldn't be worrying about that.
Now, there are some related instances where you might want to test, like if transactions are correctly rolled back, etc.

If there's is a need for that check, why not do that check within the database itself? You save yourself from doing a round trip and it's done at a more 'centralized' stage - If you check in the database, you can be assured it's being applied consistently from any application that hits that database. Whereas if you put the logic in the UI, then you need to make sure that any UI application that hits that particular database applies the correct logic and does it consistently.

Multiple Execute Calls

I'm trying to update a field on a phone call entity, then close it. Current to do so as far as I can tell, takes two calls. But this is painfully slow as it's taken 30 minutes to process 60 phone calls and I have around 200,000 to do. Is there a way to combine both into one call?
Here's my current code -
foreach (phonecall phonepointer in _businessEntityCollection.BusinessEntities.Cast<phonecall>()
.Where(phonepointer => phonepointer.statecode.Value == PhoneCallState.Open))
{
//Update fiserv_contactstatus value
phonepointer.fiserv_contactstatus = Picklist;
crmService.Update(phonepointer);
//Cancel activity
setStatePhoneCallRequest.PhoneCallState = PhoneCallState.Canceled;
setStatePhoneCallRequest.PhoneCallStatus = 200011;
setStatePhoneCallRequest.EntityId = phonepointer.activityid.Value;
crmService.Execute(setStatePhoneCallRequest);
}

Unfortunately, there's little you can do.
You COULD try and use the new SDK and the XRM context (strongly typed classes) to batch-update the phone call entities (this should be faster), but you'll still need to use the old-fashioned CrmService to actually change the state of each entity, one by one.
EDIT:
You could also directly change the state of the entities in the database, but this should be your last resort, as manual changes to the CRM DB are unsupported and dangerous.
Seriously, last resort! No, I'm NOT joking!

new objects added during long loop

We currently have a production application that runs as a windows service. Many times this application will end up in a loop that can take several hours to complete. We are using Entity Framework for .net 4.0 for our data access.
I'm looking for confirmation that if we load new data into the system, after this loop is initialized, it will not result in items being added to the loop itself. When the loop is initialized we are looking for data "as of" that moment. Although I'm relatively certain that this will work exactly like using ADO and doing a loop on the data (the loop only cycles through data that was present at the time of initialization), I am looking for confirmation for co-workers.
Thanks in advance for your help.
//update : here's some sample code in c# - question is the same, will the enumeration change if new items are added to the table that EF is querying?
IEnumerable<myobject> myobjects = (from o in db.theobjects where o.id==myID select o);
foreach (myobject obj in myobjects)
{
//perform action on obj here
}

It depends on your precise implementation.
Once a query has been executed against the database then the results of the query will not change (assuming you aren't using lazy loading). To ensure this you can dispose of the context after retrieving query results--this effectively "cuts the cord" between the retrieved data and that database.
Lazy loading can result in a mix of "initial" and "new" data; however once the data has been retrieved it will become a fixed snapshot and not susceptible to updates.
You mention this is a long running process; which implies that there may be a very large amount of data involved. If you aren't able to fully retrieve all data to be processed (due to memory limitations, or other bottlenecks) then you likely can't ensure that you are working against the original data. The results are not fixed until a query is executed, and any updates prior to query execution will appear in results.

I think your best bet is to change the logic of your application such that when the "loop" logic is determining whether it should do another interation or exit you take the opportunity to load the newly added items to the list. see pseudo code below:
var repo = new Repository();
while (repo.HasMoreItemsToProcess())
{
var entity = repo.GetNextItem();
}
Let me know if this makes sense.

The easiest way to assure that this happens - if the data itself isn't too big - is to convert the data you retrieve from the database to a List<>, e.g., something like this (pulled at random from my current project):
var sessionIds = room.Sessions.Select(s => s.SessionId).ToList();
And then iterate through the list, not through the IEnumerable<> that would otherwise be returned. Converting it to a list triggers the enumeration, and then throws all the results into memory.
If there's too much data to fit into memory, and you need to stick with an IEnumerable<>, then the answer to your question depends on various database and connection settings.

I'd take a snapshot of ID's to be processed -- quickly and as a transaction -- then work that list in the fashion you're doing today.
In addition to accomplishing the goal of not changing the sample mid-stream, this also gives you the ability to extend your solution to track status on each item as it's processed. For a long-running process, this can be very helpful for progress reporting restart / retry capabilities, etc.

Why is inserting entities in EF 4.1 so slow compared to ObjectContext?

Basically, I insert 35000 objects within one transaction:
using(var uow = new MyContext()){
for(int i = 1; i < 35000; i++) {
var o = new MyObject()...;
uow.MySet.Add(o);
}
uow.SaveChanges();
}
This takes forever!
If I use the underlying ObjectContext (by using IObjectAdapter), it's still slow but takes around 20s. It looks like DbSet<> is doing some linear searches, which takes square amount of time...
Anyone else seeing this problem?

As already indicated by Ladislav in the comment, you need to disable automatic change detection to improve performance:
context.Configuration.AutoDetectChangesEnabled = false;
This change detection is enabled by default in the DbContext API.
The reason why DbContext behaves so different from the ObjectContext API is that many more functions of the DbContext API will call DetectChanges internally than functions of the ObjectContext API when automatic change detection is enabled.
Here you can find a list of those functions which call DetectChanges by default. They are:
The Add, Attach, Find, Local, or Remove members on DbSet
The GetValidationErrors, Entry, or SaveChanges members on DbContext
The Entries method on DbChangeTracker
Especially Add calls DetectChanges which is responsible for the poor performance you experienced.
I contrast to this the ObjectContext API calls DetectChanges only automatically in SaveChanges but not in AddObject and the other corresponding methods mentioned above. That's the reason why the default performance of ObjectContext is faster.
Why did they introduce this default automatic change detection in DbContext in so many functions? I am not sure, but it seems that disabling it and calling DetectChanges manually at the proper points is considered as advanced and can easily introduce subtle bugs into your application so use [it] with care.

Little empiric test with EF 4.3 CodeFirst:
Removed 1000 objects with AutoDetectChanges = true : 23 sec
Removed 1000 objects with AutoDetectChanges = false: 11 sec
Inserted 1000 objects with AutoDetectChanges = true : 21 sec
Inserted 1000 objects with AutoDetectChanges = false : 13 sec

In .netcore 2.0 this was moved to:
context.ChangeTracker.AutoDetectChangesEnabled = false;

Besides the answers you have found here. It is important to know that at the database level is is more work to insert than it is to add. The database has to extend/allocate new space. Then it has to update at least the primary key index. Although indexes may also be updated when updating, it is a lot less common. If there are any foreign keys it has to read those indexes as well to make sure referential integrity is maintained. Triggers can also play a role although those can affect updates the same way.
All that database work makes sense in daily insert activity originated by user entries. But if you are just uploading an existing database, or have a process that generates a lot of inserts. You may want to look at ways of speeding that up, by postponing it to the end. Normally disabling indexes while inserting is a common way. There is very complex optimizations that can be done depending on the case, they can be a bit overwhelming.
Just know that in general insert will take longer than updates.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.