Yet another EF performance question... I have an XLSX file which I'm processing and inserting the records from it to the database. The problem is that the XLSX data is un-normalized and I have to normalize it within the DB which means to do a lot of DB calls while processing. So I have a main for loop which goes through the XLSX and then checks within the DB if records already exist. It goes something like this:
List<MainEntity> mainEntities= new List<MainEntity>();
for (int xlsIowIterator = 2; xlsIterator <= 300; xlsIterator++) {
MainEntity mainEntity = new MainEntity();
mainEntity.data1 = XLSData1;
mainEntity.data2 = XLSData2;
RelatedEntity relatedEntity= repository.FindSingleBy(x => x.MyId.Equals(data));
mainEntity.RelatedEntity = relatedEntity;
mainEntities.Add(mainEntity);
}
mainEntities.ForEach(s => context.MainEntities.Add(s));
context.SaveChanges();
Saving the changes is fine - the problem is with that call from the repository to the database within the for loop:
RelatedEntity relatedEntity= repository.FindSingleBy(x => x.MyId.Equals(data));
For each record form the XLSX file, I'm querying the database and it kills the performance. 300 records take about a minute to process and I have other related entities that I need to retrieve and attach to the main entity... Is there any way to optimize this approach or I should just give up on Entity Framework and move this functionality to a stored procedure on the database side? I really would like to keep all logic in the business layer, but I will be processing thousands of records and waiting more than 10 minutes for a page to finish is ridiculous. I'm thinking of just Bulk Loading the entire Excel file in a table and then writing a stored procedure to process it...
Any help/insights are highly appreciated! Thanks!
Related
I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified
I have a large collection of 12000 data entries for example and want to insert them via EF6 into a sqlite database. The most time consumes the instantiation of the data models:
at the moment I loop 12000 times 'new myItem()'
downloaded12000Items.foreach(result =>{
var myItem= new myItem
{
Id = result.Id,
Description = result.Description,
Property1 = result.Property1
}
resultList.add(myItem);
});
unitOfWork.ItemRepository.InsertRange(resultList);
How can I speed up the instantiation of the models or is there maybe another way to insert the data faster into the sqlite database?
EDIT: I have to explain my problem better. The bottleneck is NOT the insert() into the database. To use EF6 .insert(someModel) you have to create an instance of a modelclass of your entity. I have to do this 12000 times, the instantiation of all the 12000 modelclasses takes too much time.
My question was, is there a possibility to fasten up the instatiation process of the model classes, maybe by cloning or something else?
Or, is there maybe a chance to insert the data into the sqlite db without using .insert(someModel), maybe by using a direct sql command or something else? Obviously skipping the model instantiation could be helpful...
The bottleneck is probably the adding of the entities to the context.
unitOfWork.ItemRepository.Insert(myItem);
At first it doesn't take much time, but after 100s or 1000 records, it does.
See also this answer for other optimizations you might be able to add (read the comments of the linked answer!).
How can I speed up the instantiation of the models or is there maybe another way to insert the data faster into the sqlite database?
Use the equivalent of await Context.SaveChangesAsync() in your repo after you have finished looping and inserting "12000 data entries" . Tell me more
Note it is no longer necessary to perform the following in order to improve performance:
context.Configuration.AutoDetectChangesEnabled = false; // out of date
context.Configuration.ValidateOnSaveEnabled = false; // out of date
...such code has its own drawbacks but more importantly it is based on out-of-date philosophy and does not take advantage of await in EF.
Here's a snippet of production code that I use to save an requirement realisation matrix:
// create your objects
var matrix = // in my prod code I create in excess of 32,600+ matrix cells
foreach (var cell in cellsToAdd)
{
matrix.Cells.Add(cell);
}
using (var context = new MyDbContext())
{
context.Matrices.Add (newMatrix);
await context.SaveChangesAsync();
}
I find this works perfectly well when I insert 32,646 matrix cells in my production environment. Simply using await and SaveChangesAsync() improved performance 12 times. Other strategies, like batching were not as effective and disabling options such as AutoDetectChangesEnabled though somewhat useful, arguably defeat the purpose of using an ORM.
Im testing using Entity Framework with a Azure Sql db.
When inserting 1 record, the action takes 400ms. When adding 20 it is 2500ms.
400ms for inserting 1 record via EF seems like a lot.
What is the normal performance rate for EF?
Am I doing something wrong?
Im aware that bulk insertion can be improved, but I thought that a single insert could be done a lot faster!?
var start = DateTime.Now;
testdbEntities testdbEntities = new testdbEntities();
for (int i = 0; i < 20; i++)
testdbEntities.Users.Add(new User{Name = "New user"});
testdbEntities.SaveChanges();
var end = DateTime.Now;
var timeElapsed = (end - start).TotalMilliseconds;
All common tricks like:
AutoDetectChangesEnabled = false
Use AddRange over Add
Etc.
Will not work like you already have noticed since the performance problem is not within Entity Framework but with SQL Azure
SQL Azure may look pretty cool at first but it's slow as hell unless you paid for a very good Premium Database Tier.
As Evk recommended, you should try to execute a simple SQL Command like "SELECT 1" and you will notice this probably take more than 100ms which is ridiculously slow.
Solution:
Move to a better SQL Azure Tier
Move away from SQL Azure
Disclaimer: I'm the owner of the project Entity Framework Extensions
Another solution is using this library which will batch multiple queries/bulk operations. However again, even if this library is very fast, you will need a better SQL Azure Tier since it look every database round-trip take more than 200ms in your case.
Each insert results in a commit and causes log harden (flush to disk). In case of writing in batches this may not result in one flush per insert (until log buffers full). So try to batch the results somehow, for example using TVFs
You can disable the auto detect changes during your insert. It can really improve performance. https://msdn.microsoft.com/en-us/data/jj556205.aspx
I hope it helps :)
Most EF applications make use of persistent ignorant POCO entities and snapshot change tracking. This means that there is no code in the entities themselves to keep track of changes or notify the context of changes.
When using most POCO entities the determination of how an entity has changed (and therefore which updates need to be sent to the database) is handled by the Detect Changes algorithm. Detect Changes works by detecting the differences between the current property values of the entity and the original property values that are stored in a snapshot when the entity was queried or attached.
Snapshot change detection takes a copy of every entity in the system when they are added to the Entity Framework tracking graph. Then as entities change each entity is compared to its snapshot to see any changes. This occurs by calling the DetectChanges method. Whats important to know about DetectChanges is that it has to go through all of your tracked entities each time its called, so the more stuff you have in your context the longer it takes to traverse.
What Auto Detect Changes does is plugs into events which happen on the context and calls detect changes as they occur.
Whenever you are adding a new User object, EF is internally tracking it & keeping the current state of newly added object in its snapshot.
For bulk insert operations, EF will first insert all records into the DB & then call DetectChanges function. So execution time required for bulk insert is (time required to insert all records + time required for updating EF context).
You can make your DB insertion relatively faster by disabling AutoDetectChanges. So your code will look like,
using (var context = new YourContext())
{
try
{
context.Configuration.AutoDetectChangesEnabled = false;
// do your DB operations
}
finally
{
context.Configuration.AutoDetectChangesEnabled = true;
}
}
I am working the a very large data set, roughly 2 million records. I have the code below but get an out of memory exception after it has process around three batches, about 600,000 records. I understand that as it loops through each batch entity framework lazy loads, which is then trying to build up the full 2 million records into memory. Is there any way to unload the batch one I've processed it?
ModelContext dbContext = new ModelContext();
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.OrderBy(t => t.TownID).Batch(200000);
foreach (var batch in towns)
{
SearchClient.Instance.IndexMany(batch, SearchClient.Instance.Settings.DefaultIndex, "Town", new SimpleBulkParameters() { Refresh = false });
}
Note: The Batch method comes from this project: https://code.google.com/p/morelinq/
The search client is this: https://github.com/Mpdreamz/NEST
The issue is that when you get data from EF there are actually two copies of the data created, one which is returned to the user and a second which EF holds onto and uses for change detection (so that it can persist changes to the database). EF holds this second set for the lifetime of the context and its this set thats running you out of memory.
You have 2 options to deal with this
renew your context each batch
Use .AsNoTracking() in your query eg:
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.AsNoTracking().OrderBy(t => t.TownID).Batch(200000);
this tells EF not to keep a copy for change detection. You can read a little more about what AsNoTracking does and the performance impacts of this on my blog: http://blog.staticvoid.co.nz/2012/4/2/entity_framework_and_asnotracking
I wrote a migration routine that reads from one DB and writes (with minor changes in layout) into another DB (of a different type) and in this case, renewing the connection for each batch and using AsNoTracking() did not cut it for me.
Note that this problem occurs using a '97 version of JET. It may work flawlessly with other DBs.
However, the following algorithm did solve the Out-of-memory issue:
use one connection for reading and one for writing/updating
Read with AsNoTracking()
every 50 rows or so written/updated, check the memory usage, recover memory + reset output DB context (and connected tables) as needed:
var before = System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
if (before > 800000000)
{
dbcontextOut.SaveChanges();
dbcontextOut.Dispose();
GC.Collect();
GC.WaitForPendingFinalizers();
dbcontextOut = dbcontextOutFunc();
tableOut = Dynamic.InvokeGet(dbcontextOut, outputTableName);
}
I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?
It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.
IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.
You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.
Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()