I'm using Entity Framework to build a database. There's two models; Workers and Skills. Each Worker has zero or more Skills. I initially read this data into memory from a CSV file somewhere, and store it in a dictionary called allWorkers. Next, I write the data to the database as such:
// Populate database
using (var db = new SolverDbContext())
{
// Add all distinct skills to database
db.Skills.AddRange(allSkills
.Distinct(StringComparer.InvariantCultureIgnoreCase)
.Select(s => new Skill
{
Reference = s
}));
db.SaveChanges(); // Very quick
var dbSkills = db.Skills.ToDictionary(k => k.Reference, v => v);
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.Workers.AddRange(workforce);
db.SaveChanges(); // This call takes 00:05:00.0482197
}
The last db.SaveChanges(); takes over five minutes to execute, which I feel is far too long. I ran SQL Server Profiler as the call is executing, and basically what I found was thousands of calls to:
INSERT [dbo].[SkillWorkers]([Skill_SkillId], [Worker_WorkerId])
VALUES (#0, #1)
There are 16,027 rows being added to SkillWorkers, which is a fair amount of data but not huge by any means. Is there any way to optimize this code so it doesn't take 5min to run?
Update: I've looked at other possible duplicates, such as this one, but I don't think they apply. First, I'm not bulk adding anything in a loop. I'm doing a single call to db.SaveChanges(); after every row has been added to db.Workers. This should be the fastest way to bulk insert. Second, I've set db.Configuration.AutoDetectChangesEnabled to false. The SaveChanges() call now takes 00:05:11.2273888 (In other words, about the same). I don't think this really matters since every row is new, thus there are no changes to detect.
I think what I'm looking for is a way to issue a single UPDATE statement containing all 16,000 skills.
One easy method is by using the EntityFramework.BulkInsert extension.
You can then do:
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.BulkInsert(workforce);
Related
Updating a bunch of records is much slower using what I think are standard entity framework techniques than batching the same queries it would generate myself. For 250 records I see entity framework about 10 times as slow. For 1000 records it goes up to about 20 times slower.
When I log the database activity for entity framework, I see it is generating the same basic queries I would generate myself, but it seems to be running them one at a time instead of all at once, even though I only call SaveChanges once. Is there any way to ask it to run the queries all at once?
I can't do a simple mass SQL update because in my real use case each row needs to be processed separately to determine what to set the fields to.
Sample timing code is below:
var stopwatchEntity = new System.Diagnostics.Stopwatch();
var stopwatchUpdate = new System.Diagnostics.Stopwatch();
using (var dbo = new ProjDb.dbo("Server=server;Database=database;Trusted_Connection=True;"))
{
var resourceIds = dbo.Resources.Select(r => r.ResourceId).Take(250).ToList();
//dbo.Database.Log += (s) => System.Diagnostics.Debug.WriteLine(s);
stopwatchEntity.Start();
foreach (var resourceId in resourceIds)
{
var resource = new ProjDb.Models.dbo.Resource { ResourceId = resourceId };
dbo.Resources.Attach(resource);
resource.IsBlank = false;
}
dbo.SaveChanges();
stopwatchEntity.Stop();
stopwatchUpdate.Start();
var updateStr = "";
foreach (var resourceId in resourceIds)
updateStr += "UPDATE Resources SET IsBlank = 0 WHERE ResourceId = " + resourceId + ";";
dbo.Database.ExecuteSqlCommand(updateStr);
stopwatchUpdate.Stop();
MessageBox.Show(stopwatchEntity.Elapsed.TotalSeconds.ToString("f") + ", " + stopwatchUpdate.Elapsed.TotalSeconds.ToString("f"));
}
As #EricEJ and #Kirchner reported, EF6 doesn't support batch update. However, some third-party libraries do.
Disclaimer: I'm the owner of the project Entity Framework Plus
EF+ Batch Update allows updating multiples rows with the same value/formula.
For example:
context.Resources
.Where(x => resourceIds.Contains(x => x.ResourceId)
.Update(x => new Resource() { IsBlank = false });
Since entities are not loaded in the context, you should get the best performance available.
Read more: http://entityframework-plus.net/batch-update
Disclaimer: I'm the owner of the project Entity Framework Extensions
If the value must differ from a row to another, this library allows BulkUpdate features. This library is a paid library but pretty much supports everything you need for performance:
Bulk SaveChanges
Bulk Insert
Bulk Delete
Bulk Update
Bulk Merge
For example:
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);
// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);
context.BulkMerge(customers);
Entity Framework 6 does not support batching, EF Core does
I'd like to do a bulk upsert in Mongo. Basically I'm getting a list of objects from a vendor, but I don't know which ones I've gotten before (and need to be updated) vs which ones are new. One by one I could do an upsert, but UpdateMany doesn't work with upsert options.
So I've resorted to selecting the documents, updating in C#, and doing a bulk insert.
public async Task BulkUpsertData(List<MyObject> newUpsertDatas)
{
var usernames = newUpsertDatas.Select(p => p.Username);
var filter = Builders<MyObject>.Filter.In(p => p.Username, usernames);
//Find all records that are in the list of newUpsertDatas (these need to be updated)
var collection = Db.GetCollection<MyObject>("MyCollection");
var existingDatas = await collection.Find(filter).ToListAsync();
//loop through all of the new data,
foreach (var newUpsertData in newUpsertDatas)
{
//and find the matching existing data
var existingData = existingDatas.FirstOrDefault(p => p.Id == newUpsertData.Id);
//If there is existing data, preserve the date created (there are other fields I preserve)
if (existingData == null)
{
newUpsertData.DateCreated = DateTime.Now;
}
else
{
newUpsertData.Id = existingData.Id;
newUpsertData.DateCreated = existingData.DateCreated;
}
}
await collection.DeleteManyAsync(filter);
await collection.InsertManyAsync(newUpsertDatas);
}
Is there a more efficient way to do this?
EDIT:
I did some speed tests.
In preparation I inserted 100,000 records of a pretty simple object. Then I upserted 200,000 records into the collection.
Method 1 is as outlined in the question. SelectMany, update in code, DeleteMany, InsertMany. This took approximately 5 seconds.
Method 2 was making a list of UpdateOneModel with Upsert = true and then doing one BulkWriteAsync. This was super slow. I could see the count in the mongo collection increasing so I know it was working. But after about 5 minutes it had only climbed to 107,000 so I canceled it.
I'm still interested if anyone else has a potential solution
Given that you've said you could do a one-by-one upsert, you can achieve what you want with BulkWriteAsync. This allows you to create one or more instances of the abstract WriteModel, which in your case would be instances of UpdateOneModel.
In order to achieve this, you could do something like the following:
var listOfUpdateModels = new List<UpdateOneModel<T>>();
// ...
var updateOneModel = new UpdateOneModel<T>(
Builders<T>.Filter. /* etc. */,
Builders<T>.Update. /* etc. */)
{
IsUpsert = true;
};
listOfUpdateModels.Add(updateOneModel);
// ...
await mongoCollection.BulkWriteAsync(listOfUpdateModels);
The key to all of this is the IsUpsert property on UpdateOneModel.
I'm trying to insert a large set of objects into the table but I don't have any efficient way to check if some records aren't already there. Every time I use this:
using Z.EntityFramework.Extensions.Core;
...
await ac.BulkInsertAsync(query, (o) => { o.?? });
it just stops the insert each time it finds a duplicate. Is there a way to either run all queries at once without it just stopping at the first error, or outright applying IGNORE?
You should check the InsertIfNotExists options. Only records that don't already exists will be inserted.
using Z.EntityFramework.Extensions.Core;
...
await ac.BulkInsertAsync(query, (o) => { o.InsertIfNotExists = true });
ANSWER Sub-Question
I have an UNIQUE key in my table on one of the fields. How do I set it for bulk operations?
You can customize the key with the ColumnPrimaryKeyExpression option.
ctx.BulkInsert(list, options =>
{
options.ColumnPrimaryKeyExpression = x => new { x.ColumnKey1, x.ColumnKey2 };
options.InsertIfNotExists = true;
});
I am using GraphDiff, along with the latest version of the Entity Framework, following the code-first approach.
I am trying to update a Food entity this way:
public void Update(Food food)
{
using (var db = new DatabaseMappingContext())
{
food = db.UpdateGraph(food, map => map.OwnedEntity(f => f.FoodRecipe, withRecipe => withRecipe.
OwnedCollection(r => r.RecipeSteps, withRecipeStep => withRecipeStep.
OwnedCollection(rs => rs.StartObjectSlots, withStartObjectSlots => withStartObjectSlots.
AssociatedEntity(sos => sos.BelongingRecipeStepAsStart)
).
OwnedCollection(rs => rs.EndObjectSlots, withEndObjectSlots => withEndObjectSlots.
AssociatedEntity(eos => eos.BelongingRecipeStepAsEnd)
).
AssociatedEntity(rs => rs.ActionOfUser)
).
AssociatedCollection(r => r.InteractiveObjects)
).
AssociatedCollection(f => f.FoodVarieties));
//....
db.SaveChanges();
}
}
StartObjectSlots and EndObjectSlots are 2 lists containing some other, irrelevant data.
The InteractiveObjects contains objects of a InteractiveObject type, which is the base type for a number of object types that can be put there. One of those derived types (let's say IntObjDerived has a One-to-Many property).
Now, I am trying to update the following entity this way:
ServerAdapter sa = new ServerAdapter();
//Loading a food from DB.
Food food = sa.LoadAllFoods().First();
RecipeStep rs = new RecipeStep();
rs.Name = "This is a test recipe step";
//Adding a User Action from the database.
rs.ActionOfUser = sa.LoadAllUserActions().First();
//....
//Add the step in the recipe
food.FoodRecipe.RecipeSteps.Add(rs);
//Update the food.
sa.Update(food);
Now, when the code is executed, a new empty ActionOfUser entity is inserted into the database. Additionally, a new empty entity is inserted for each of the one-to-many navigation properties of the entities mentioned above three new recipes are inserted in the database, one of empty data, one half filled and this one supposed to be saved. Both situations are unwanted, and I am trying to find the solution. I experimented with some changes, but I have stuck with this. Any suggestions?
(I know that this seems to be 2 questions, but I thought to put it as one as it might be relevant-same problem in database).
EDIT: I downloaded and compiled GraphDiff in order to inspect what is going on, and I noticed the creation of some objects that are empty except from their Entity ID value.
I guess that those side effects are caused because practically I add a new node to the object graph (a new RecipeStep) and I am not sure if graphdiff fully supports this.
UPDATE (tl; dr version): I tried to apply a UpdateGraph call using Entity Framework's GraphDiff of an object with graph depth greater than 2.
By what I have tried, it seems that GraphDiff is applying double insertions in graphs of depth greater than 2 and it takes a lots of time, especially if a new node is added with subnodes loaded from the database. Should I follow a different approach, for example split the UpdateGraph call into multiple calls?
Thank you in advance!
What I finally applied as a workaround, was to perform the update operation by splitting it into multiple UpdateGraph calls with graph depth less than or equal to 2 and apply manually any sub-node addition to the graph:
//Update food in total graph depth <= 2.
db.UpdateGraph(food, map => map.AssociatedCollection(f => f.FoodVarieties));
//.... (Other UpdateGraph calls with graph depth <=2)
//Update recipe steps of recipe in total graph depth <= 2.
foreach (RecipeStep recipeStep in food.FoodRecipe.RecipeSteps)
{
recipeStep.ActionOfUser = db.UserActions.FirstOrDefault(ua => ua.EntityID == recipeStep.ActionOfUser.EntityID);
//If you have to do an inner node adding operation in the graph, do it manually.
if (recipeStep.EntityID == 0)
{
recipeStep.BelongingRecipe = db.Recipes.FirstOrDefault(r => r.EntityID == food.FoodRecipe.EntityID);
db.RecipeSteps.Add(recipeStep);
}
else
{
//Map slots & recipeSteps applied manually here.
recipeStep.StartObjectSlots.ForEach(sos => sos.BelongingRecipeStepAsStart = recipeStep);
recipeStep.EndObjectSlots.ForEach(eos => eos.BelongingRecipeStepAsEnd = recipeStep);
db.UpdateGraph(recipeStep, map => map.OwnedCollection(rs => rs.InteractiveObjectInstancesLists, withIOILists => withIOILists.
OwnedCollection(ioil => ioil.InteractiveObjectsInstances)
).
OwnedCollection(rs => rs.StartObjectSlots, withStartObjectSlots => withStartObjectSlots.
AssociatedEntity(sos => sos.BelongingRecipeStepAsStart)
).
OwnedCollection(rs => rs.EndObjectSlots, withEndObjectSlots => withEndObjectSlots.
AssociatedEntity(eos => eos.BelongingRecipeStepAsEnd)
).
AssociatedEntity(rs => rs.ActionOfUser)
);
}
}
Also, I noticed that the object's graph update was completed much faster than before. These might be indications of something going wrong in GraphDiff complex graph (>2 depth) updating process (or at least I was doing something terribly wrong).
I am in the process of improving a console app and at the moment I cant get it to update rows instead of just creating a new row with the newer information in it.
class Program
{
List<DriveInfo> driveList = DriveInfo.GetDrives().Where(x => x.IsReady).ToList<DriveInfo>(); //Get all the drive info
Server server = new Server(); //Create the server object
ServerDrive serverDrives = new ServerDrive();
public static void Main()
{
Program c = new Program();
c.RealDriveInfo();
c.WriteInToDB();
}
public void RealDriveInfo()
{
//Insert information of one server
server.ServerID = 0; //(PK) ID Auto-assigned by SQL
server.ServerName = string.Concat(System.Environment.MachineName);
//Inserts ServerDrives information.
for (int i = 0; i < driveList.Count; i++)
{
//All Information used in dbo.ServerDrives
serverDrives.DriveLetter = driveList[i].Name;
serverDrives.TotalSpace = driveList[i].TotalSize;
serverDrives.DriveLabel = driveList[i].VolumeLabel;
serverDrives.FreeSpace = driveList[i].TotalFreeSpace;
serverDrives.DriveType = driveList[i].DriveFormat;
server.ServerDrives.Add(serverDrives);
}
}
public void WriteInToDB()
{
//Add the information to an SQL Database using Linq.
DataClasses1DataContext db = new DataClasses1DataContext(#"sqlserver");
db.Servers.InsertOnSubmit(server);
db.SubmitChanges();
What I would like it to use to update the information would be the RealDriveInfo() Method so instead of creating new entries it updates the currently stored information by running the method then inserting the information from the method and if needed will enter a new entry instead of simply entering new entries every time it has newer information.
At the moment it is running the method, gathering the relevant data then entering it in as a new row in both tables.
Any help would be appreciated :)
It's creating a new db entry each time because you are making a new server object each time, then calling InsertOnSubmit() - which inserts (creates) a new record.
I'm not entirely sure what you are trying to do, but a db update would involve selecting an existing record, modifying it, then attaching it back to the data context and calling SubmitChanges().
This article on Updating Entities (Linq toSQL) might help.
The problem is that you are trying to achieve Update functionality with a tool that is designed to provide object-oriented quering. LINQ allows for updating exisitng records, but you have to use it in a proper way to achieve this.
The proper way is to fetch data you want to update from the DB, perform modifications and then flush it back to the DB. So, assuming there are table named Servers in your data context, here's an abstract example:
DataClasses1DataContext db = new DataClasses1DataContext(#"sqlserver");
var servers = db.Servers.Where(srv=>srv.ID>1000); //extracting all servers with ID > 100 using lambda expression
foreach (server in servers){
server.Memory *=2; //let's feed them up with memory
}
db.Servers.SubmitChanges();
Another way to achieve this is to create an entity, than attach it to the DataContext using Table.Attach method, but it's quite a slippery slope, so I wouldn't recommend you taking it unless you have your LINQ skills improved.
For a detailed description, see
SubmitChanges
Lambda Expressions
I understand what is being asked, and I do not have an easy answer.
Example, you have a form of values, several of the values are changed, maybe some calculated. Or the form can contain a new record.
You create a record of the values
myrecord = new MyRecord()
Then fill in myRecord. doing what ever validation/calculations you want before you even touch the database itself.
//GetID either returns an existing ID or it returns a zero if this is a new record.
myrecord.id = GetIDForRecordOrZeroIfANewRecord(uniqueName);
myrecord.value1 = txtValue1.text;
myrecord.value2 = (DateTime)dtDate.value;
and so on through the fields.
You now have a record, if id is zero you can add it as a new record. But if id is an existing record you seem to have no choice with Linq except to have a function that writes each value from myrecord, so you have to have a function that contains something like -
var thisRecord = from n in mydatacontext.MyTable
where n.id == myrecord.id
select n;
thisrecord.value1 = myrecord.value1;
thisrecord.value2 = myrecord.value2;
and so on through all fields.
I do it, but it seems long winded when I already have all of the information ready in myrecord. A simple function of
mydatacontext.MyTable.Update(myrecord);
Would be ideal. Simmilar in fact to what I do with stored SQL functions in other databases, it simplifies the transfer of a record that is an update rather than new.