I have a large collection of 12000 data entries for example and want to insert them via EF6 into a sqlite database. The most time consumes the instantiation of the data models:
at the moment I loop 12000 times 'new myItem()'
downloaded12000Items.foreach(result =>{
var myItem= new myItem
{
Id = result.Id,
Description = result.Description,
Property1 = result.Property1
}
resultList.add(myItem);
});
unitOfWork.ItemRepository.InsertRange(resultList);
How can I speed up the instantiation of the models or is there maybe another way to insert the data faster into the sqlite database?
EDIT: I have to explain my problem better. The bottleneck is NOT the insert() into the database. To use EF6 .insert(someModel) you have to create an instance of a modelclass of your entity. I have to do this 12000 times, the instantiation of all the 12000 modelclasses takes too much time.
My question was, is there a possibility to fasten up the instatiation process of the model classes, maybe by cloning or something else?
Or, is there maybe a chance to insert the data into the sqlite db without using .insert(someModel), maybe by using a direct sql command or something else? Obviously skipping the model instantiation could be helpful...
The bottleneck is probably the adding of the entities to the context.
unitOfWork.ItemRepository.Insert(myItem);
At first it doesn't take much time, but after 100s or 1000 records, it does.
See also this answer for other optimizations you might be able to add (read the comments of the linked answer!).
How can I speed up the instantiation of the models or is there maybe another way to insert the data faster into the sqlite database?
Use the equivalent of await Context.SaveChangesAsync() in your repo after you have finished looping and inserting "12000 data entries" . Tell me more
Note it is no longer necessary to perform the following in order to improve performance:
context.Configuration.AutoDetectChangesEnabled = false; // out of date
context.Configuration.ValidateOnSaveEnabled = false; // out of date
...such code has its own drawbacks but more importantly it is based on out-of-date philosophy and does not take advantage of await in EF.
Here's a snippet of production code that I use to save an requirement realisation matrix:
// create your objects
var matrix = // in my prod code I create in excess of 32,600+ matrix cells
foreach (var cell in cellsToAdd)
{
matrix.Cells.Add(cell);
}
using (var context = new MyDbContext())
{
context.Matrices.Add (newMatrix);
await context.SaveChangesAsync();
}
I find this works perfectly well when I insert 32,646 matrix cells in my production environment. Simply using await and SaveChangesAsync() improved performance 12 times. Other strategies, like batching were not as effective and disabling options such as AutoDetectChangesEnabled though somewhat useful, arguably defeat the purpose of using an ORM.
Related
Im testing using Entity Framework with a Azure Sql db.
When inserting 1 record, the action takes 400ms. When adding 20 it is 2500ms.
400ms for inserting 1 record via EF seems like a lot.
What is the normal performance rate for EF?
Am I doing something wrong?
Im aware that bulk insertion can be improved, but I thought that a single insert could be done a lot faster!?
var start = DateTime.Now;
testdbEntities testdbEntities = new testdbEntities();
for (int i = 0; i < 20; i++)
testdbEntities.Users.Add(new User{Name = "New user"});
testdbEntities.SaveChanges();
var end = DateTime.Now;
var timeElapsed = (end - start).TotalMilliseconds;
All common tricks like:
AutoDetectChangesEnabled = false
Use AddRange over Add
Etc.
Will not work like you already have noticed since the performance problem is not within Entity Framework but with SQL Azure
SQL Azure may look pretty cool at first but it's slow as hell unless you paid for a very good Premium Database Tier.
As Evk recommended, you should try to execute a simple SQL Command like "SELECT 1" and you will notice this probably take more than 100ms which is ridiculously slow.
Solution:
Move to a better SQL Azure Tier
Move away from SQL Azure
Disclaimer: I'm the owner of the project Entity Framework Extensions
Another solution is using this library which will batch multiple queries/bulk operations. However again, even if this library is very fast, you will need a better SQL Azure Tier since it look every database round-trip take more than 200ms in your case.
Each insert results in a commit and causes log harden (flush to disk). In case of writing in batches this may not result in one flush per insert (until log buffers full). So try to batch the results somehow, for example using TVFs
You can disable the auto detect changes during your insert. It can really improve performance. https://msdn.microsoft.com/en-us/data/jj556205.aspx
I hope it helps :)
Most EF applications make use of persistent ignorant POCO entities and snapshot change tracking. This means that there is no code in the entities themselves to keep track of changes or notify the context of changes.
When using most POCO entities the determination of how an entity has changed (and therefore which updates need to be sent to the database) is handled by the Detect Changes algorithm. Detect Changes works by detecting the differences between the current property values of the entity and the original property values that are stored in a snapshot when the entity was queried or attached.
Snapshot change detection takes a copy of every entity in the system when they are added to the Entity Framework tracking graph. Then as entities change each entity is compared to its snapshot to see any changes. This occurs by calling the DetectChanges method. Whats important to know about DetectChanges is that it has to go through all of your tracked entities each time its called, so the more stuff you have in your context the longer it takes to traverse.
What Auto Detect Changes does is plugs into events which happen on the context and calls detect changes as they occur.
Whenever you are adding a new User object, EF is internally tracking it & keeping the current state of newly added object in its snapshot.
For bulk insert operations, EF will first insert all records into the DB & then call DetectChanges function. So execution time required for bulk insert is (time required to insert all records + time required for updating EF context).
You can make your DB insertion relatively faster by disabling AutoDetectChanges. So your code will look like,
using (var context = new YourContext())
{
try
{
context.Configuration.AutoDetectChangesEnabled = false;
// do your DB operations
}
finally
{
context.Configuration.AutoDetectChangesEnabled = true;
}
}
I'm working on a project where we're receiving data from multiple sources, that needs to be saved into various tables in our database.
Fast.
I've played with various methods, and the fastest I've found so far is using a collection of TableValue parameters, filling them up and periodically sending them to the database via a corresponding collection of stored procedures.
The results are quite satisfying. However, looking at disk usage (% Idle Time in Perfmon), I can see that the disk is getting periodically 'thrashed' (a 'spike' down to 0% every 13-18 seconds), whilst in between the %Idle time is around 90%. I've tried varying the 'batch' size, but it doesn't have an enormous influence.
Should I be able to get better throughput by (somehow) avoiding the spikes while decreasing the overall idle time?
What are some things I should be looking out to work out where the spiking is happening? (The database is in Simple recovery mode, and pre-sized to 'big', so it's not the log file growing)
Bonus: I've seen other questions referring to 'streaming' data into the database, but this seems to involve having a Stream from another database (last section here). Is there any way I could shoe-horn 'pushed' data into that?
A very easy way of inserting loads of data into an SQL-Server is -as mentioned- the 'bulk insert' method. ADO.NET offers a very easy way of doing this without the need of external files. Here's the code
var bulkCopy = new SqlBulkCopy(myConnection);
bulkCopy.DestinationTableName = "MyTable";
bulkCopy.WriteToServer (myDataSet);
That's easy.
But: myDataSet needs to have exactly the same structure as MyTable, i.e. Names, field types and order of fields must be exactly the same. If not, well there's a solution to that. It's column mapping. And this is even easier to do:
bulkCopy.ColumnMappings.Add("ColumnNameOfDataSet", "ColumnNameOfTable");
That's still easy.
But: myDataSet needs to fit into memory. If not, things become a bit more tricky as we have need a IDataReader derivate which allows us to instantiate it with an IEnumerable.
You might get all the information you need in this article.
Building on the code referred to in alzaimar's answer, I've got a proof of concept working with IObservable (just to see if I can). It seems to work ok. I just need to put together some tidier code to see if this is actually any faster than what I already have.
(The following code only really makes sense in the context of the test program in code download in the aforementioned article.)
Warning: NSFW, copy/paste at your peril!
private static void InsertDataUsingObservableBulkCopy(IEnumerable<Person> people,
SqlConnection connection)
{
var sub = new Subject<Person>();
var bulkCopy = new SqlBulkCopy(connection);
bulkCopy.DestinationTableName = "Person";
bulkCopy.ColumnMappings.Add("Name", "Name");
bulkCopy.ColumnMappings.Add("DateOfBirth", "DateOfBirth");
using(var dataReader = new ObjectDataReader<Person>(people))
{
var task = Task.Factory.StartNew(() =>
{
bulkCopy.WriteToServer(dataReader);
});
var stopwatch = Stopwatch.StartNew();
foreach(var person in people) sub.OnNext(person);
sub.OnCompleted();
task.Wait();
Console.WriteLine("Observable Bulk copy: {0}ms",
stopwatch.ElapsedMilliseconds);
}
}
It's difficult to comment without knowing the specifics, but one of the fastest ways to get data into SQL Server is Bulk Insert from a file.
You could write the incoming data to a temp file and periodically bulk insert it.
Streaming data into SQL Server Table-Valued parameter also looks like a good solution for fast inserts as they are held in memory. In answer to your question, yes you could use this, you just need to turn your data into a IDataReader. There's various ways to do this, from a DataTable for example see here.
If your disk is a bottleneck you could always optimise your infrastructure. Put database on a RAM disk or SSD for example.
I am working the a very large data set, roughly 2 million records. I have the code below but get an out of memory exception after it has process around three batches, about 600,000 records. I understand that as it loops through each batch entity framework lazy loads, which is then trying to build up the full 2 million records into memory. Is there any way to unload the batch one I've processed it?
ModelContext dbContext = new ModelContext();
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.OrderBy(t => t.TownID).Batch(200000);
foreach (var batch in towns)
{
SearchClient.Instance.IndexMany(batch, SearchClient.Instance.Settings.DefaultIndex, "Town", new SimpleBulkParameters() { Refresh = false });
}
Note: The Batch method comes from this project: https://code.google.com/p/morelinq/
The search client is this: https://github.com/Mpdreamz/NEST
The issue is that when you get data from EF there are actually two copies of the data created, one which is returned to the user and a second which EF holds onto and uses for change detection (so that it can persist changes to the database). EF holds this second set for the lifetime of the context and its this set thats running you out of memory.
You have 2 options to deal with this
renew your context each batch
Use .AsNoTracking() in your query eg:
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.AsNoTracking().OrderBy(t => t.TownID).Batch(200000);
this tells EF not to keep a copy for change detection. You can read a little more about what AsNoTracking does and the performance impacts of this on my blog: http://blog.staticvoid.co.nz/2012/4/2/entity_framework_and_asnotracking
I wrote a migration routine that reads from one DB and writes (with minor changes in layout) into another DB (of a different type) and in this case, renewing the connection for each batch and using AsNoTracking() did not cut it for me.
Note that this problem occurs using a '97 version of JET. It may work flawlessly with other DBs.
However, the following algorithm did solve the Out-of-memory issue:
use one connection for reading and one for writing/updating
Read with AsNoTracking()
every 50 rows or so written/updated, check the memory usage, recover memory + reset output DB context (and connected tables) as needed:
var before = System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
if (before > 800000000)
{
dbcontextOut.SaveChanges();
dbcontextOut.Dispose();
GC.Collect();
GC.WaitForPendingFinalizers();
dbcontextOut = dbcontextOutFunc();
tableOut = Dynamic.InvokeGet(dbcontextOut, outputTableName);
}
History
I have a list of "records" (3,500) which I save to XML and compress on exit of the program. Since:
the number of the records increases
only around 50 records need to be updated on exit
saving takes about 3 seconds
I needed another solution -- embedded database. I chose SQL CE because it works with VS without any problems and the license is OK for me (I compared it to Firebird, SQLite, EffiProz, db4o and BerkeleyDB).
The data
The record structure: 11 fields, 2 of them make primary key (nvarchar + byte). Other records are bytes, datatimes, double and ints.
I don't use any relations, joins, indices (except for primary key), triggers, views, and so on. It is flat Dictionary actually -- pairs of Key+Value. I modify some of them, and then I have to update them in database. From time to time I add some new "records" and I need to store (insert) them. That's all.
LINQ approach
I have blank database (file), so I make 3500 inserts in a loop (one by one). I don't even check if the record already exists because db is blank.
Execution time? 4 minutes, 52 seconds. I fainted (mind you: XML + compress = 3 seconds).
SQL CE raw approach
I googled a bit, and despite such claims as here:
LINQ to SQL (CE) speed versus SqlCe
stating it is SQL CE itself fault I gave it a try.
The same loop but this time inserts are made with SqlCeResultSet (DirectTable mode, see: Bulk Insert In SQL Server CE) and SqlCeUpdatableRecord.
The outcome? Do you sit comfortably? Well... 0.3 second (yes, fraction of the second!).
The problem
LINQ is very readable, and raw operations are quite contrary. I could write a mapper which translates all column indexes to meaningful names, but it seems like reinventing the wheel -- after all it is already done in... LINQ.
So maybe it is a way to tell LINQ to speed things up? QUESTION -- how to do it?
The code
LINQ
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
PrimLibrary.Database.Progress record = null;
record = new PrimLibrary.Database.Progress();
record.Text = entry.Text;
record.Direction = (byte)entry.dir;
db.Progress.InsertOnSubmit(record);
record.Status = (byte)entry.LastLearningInfo.status.Value;
// ... and so on
db.SubmitChanges();
}
Raw operations
SqlCeCommand cmd = conn.CreateCommand();
cmd.CommandText = "Progress";
cmd.CommandType = System.Data.CommandType.TableDirect;
SqlCeResultSet rs = cmd.ExecuteResultSet(ResultSetOptions.Updatable);
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
SqlCeUpdatableRecord record = null;
record = rs.CreateRecord();
int col = 0;
record.SetString(col++, entry.Text);
record.SetByte(col++,(byte)entry.dir);
record.SetByte(col++,(byte)entry.LastLearningInfo.status.Value);
// ... and so on
rs.Insert(record);
}
Do more work per transaction.
Commits are generally very expensive operations for a typical relational database as the database must wait for disk flushes to ensure data is not lost (ACID guarantees and all that). Conventional HDD disk IO without specialty controllers is very slow in this sort of operation: the data must be flushed to the physical disk -- perhaps only 30-60 commits can occur a second with an IO sync between!
See the SQLite FAQ: INSERT is really slow - I can only do few dozen INSERTs per second. Ignoring the different database engine, this is the exact same issue.
Normally, LINQ2SQL creates a new implicit transaction inside SubmitChanges. To avoid this implicit transaction/commit (commits are expensive operations) either:
Call SubmitChanges less (say, once outside the loop) or;
Setup an explicit transaction scope (see TransactionScope).
One example of using a larger transaction context is:
using (var ts = new TransactionScope()) {
// LINQ2SQL will automatically enlist in the transaction scope.
// SubmitChanges now will NOT create a new transaction/commit each time.
DoImportStuffThatRunsWithinASingleTransaction();
// Important: Make sure to COMMIT the transaction.
// (The transaction used for SubmitChanges is committed to the DB.)
// This is when the disk sync actually has to happen,
// but it only happens once, not 3500 times!
ts.Complete();
}
However, the semantics of an approach using a single transaction or a single call to SubmitChanges are different than that of the code above calling SubmitChanges 3500 times and creating 3500 different implicit transactions. In particular, the size of the atomic operations (with respect to the database) is different and may not be suitable for all tasks.
For LINQ2SQL updates, changing the optimistic concurrency model (disabling it or just using a timestamp field, for instance) may result in small performance improvements. The biggest improvement, however, will come from reducing the number of commits that must be performed.
Happy coding.
i'm not positive on this, but it seems like the db.SubmitChanges() call should be made outside of the loop. maybe that would speed things up?
I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?
It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.
IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.
You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.
Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()