LINQ-to-SQL performance issue for mass inserts

LINQ-to-SQL performance issue for mass inserts - c#

I have identified a problem within my application; basically, one sub-routine prepares (lots) of data that is later on inserted into my local database via a LINQ-to-SQL data context. However, even a relatively modest amount of new data (100,000-ish) takes a tremendous amount of time to be saved into the database when SubmitChanges() is called. Most of the time, however, it is more likely that the application has to save around 200,000 to 300,000 rows.
According to SQL Server's profiler, all generated queries look like the one below, and there's one for each item the application inserts.
exec sp_executesql N'INSERT INTO [dbo].[AdjectivesExpanded]([Adjective], [Genus], [Casus], [SingularOrPlural], [Kind], [Form])
VALUES (#p0, #p1, #p2, #p3, #p4, #p5)
SELECT CONVERT(BigInt,SCOPE_IDENTITY()) AS [value]',N'#p0 bigint,#p1 char(1),#p2 tinyint,#p3 bit,#p4 tinyint,#p5 nvarchar(4000)',#p0=2777,#p1='n',#p2=4,#p3=0,#p4=3,#p5=N'neugeborener'
Does anyone have an idea how to increase the performance of mass inserts with LINQ-to-SQL data contexts, ideally without getting rid of the stronlgy-typed DataContext and falling back to hand-written queries per se? Plus, there's little opportunity or room to tune the underlying database. If anything at all, I could disable integrity constraints, if it helps.

Are you doing something like this:
foreach (var adjective in adjectives) {
dataContext.AdjectivesExpanding.InsertOnSubmit(adjective)
dataContext.SubmitChanges();
}
Or:
foreach (var adjective in adjectives) {
dataContext.AdjectivesExpanding.InsertOnSubmit(adjective);
}
dataContext.SubmitChanges();
If it is similar to the first, I would recommend changing it to something like the second. Each call to SubmitChanges is a look through all the tracked objects to see what has changed.
Either way, I'm not convinced that inserting that volume of items is a good idea for Linq-to-Sql because it has to generate and evaluate the SQL each time.
Could you script a stored procedure and add as a DataContext method for the designer?

Have a look at the following page for a simple walk-through of how to change your code to use a Bulk Insert.
You just need to add the (provided) BulkInsert class to your code, make a couple of changes, and you'll see a huge improvement in performance.
Mikes Knowledge Base - BulkInserts with LINQ

ORM is usually not a good idea for mass operations. I'd recommend an old fashioned bulk insert to get the best performance.

Related

Enumeration of EF stored procedure results

I'm calling a simple stored procedure that returns around 650 rows. There are several joins and the procedure takes about 5-6 seconds. No problem.
Enumerating the results, however, is taking about a minute to complete.
using (var context = new DBContext())
{
var results = context.GetResults(param); //5-6 seconds
var resultList = results.ToList(); //1 minute+
}
I don't use Entity Framework much, but this seems abnormal. Am I doing something wrong? Is there something I can look at to speed this up? The table is huge, but the way I read it, this code should only be enumerating the 650 results... which should take no time at all.
Note: Not sure if this is related, but the time it takes to select all rows from said table is about the same (around a minute)

The solution to my problem was to disable parameter sniffing by creating a copy of the input parameter.
alter procedure dbo.procedure
#param int
as
begin
set nocount on;
declare #paramCopy int
set #paramCopy = #param
...

Based on your recent edit, I have an idea of what's happening. I think that the .GetResults() call is simply getting the query ready to be run, utilizing deferred execution. Only when you are calling .ToList() in the next line is it actually going out and trying to build the entities themselves (hence the time difference).
So why is it taking so long to load? That could be for a number of reasons, including:
You might have lazy loading disabled. This will cause all of the records to be fully loaded, with all of their respective navigational properties as well, and have all of that be tracked by the DbContext. That makes for a lot of memory consumption. You might want to consider turning it on (but not everyone likes having lazy loading enabled).
You are allowing the tracker to track all of the records, which takes up memory. Instead of this, if the data you're grabbing is going to be read-only anyway, you might want to consider the use of AsNoTracking, like in this blog post. That should reduce the load time.
You could be grabbing a lot of columns. I don't know what your procedure returns, but if it's a lot of rows, with lots of different columns, all of that data being shoved into memory will take a loooong time to process. Instead, you might want to consider only selecting as few columns as needed (by using a .Select() before the call to .ToList()) to only grab what you need.

Where to do pagination/filtering? In the database or in the code?

I have to write the code for the following method:
public IEnumerable<Product> GetProducts(int pageNumber, int pageSize, string sortKey, string sortDirection, string locale, string filterKey, string filterValue)
The method will be used by a web UI and must support pagination, sorting and filtering. The database (SQL Server 2008) has ~250,000 products. My question is the following: where do I implement the pagination, sorting and filtering logic? Should I do it in a T-SQL stored procedure or in the C# code?
I think that it is better if I do it in T-SQL but I will end up with a very complex query. On the other hand, doing that in C# implies that I have to load the entire list of products, which is also bad...
Any idea what is the best option here? Am I missing an option?

You would definitely want to have the DB do this for you. Moving ~250K records up from the database for each request will be a huge overhead. If you are using LINQ-to-SQL, the Skip and Take methods will do this (here is an example), but I don't know exactly how efficient they are.

I think other (and potentionaly best) option is to use some higher level framework that shield you from complexity of query writing. EntityFramework, NHibernate and LINQ(toSQL) help you a lot. That said database is typically best place to do it in your case.

today itself I implement pagination for my website. I have done with stored procedure though I am using Entity-Framework. I found that executing a complex query is better then fetching all records and doing pagination with code. So do it with stored procedure.
And I see your code line, which you have attached, I have implemented in same way only.

I would definatly do it in a stored procedure something along the lines of :
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY Quantity) AS row, *
FROM Products
) AS a WHERE row BETWEEN 11 AND 20
If you are using linq then the Take and Skip methods will take care of this for you.

Definitely in the DB for preference, if at all possible.
Sometimes you can mix things up a bit such as if you have the results returned from a database function (not a stored procedure, functions can be parts of larger queries in ways that stored procedures cannot), then you can have another function order and paginate, or perhaps have Linq2SQL or similar call for a page of results from said function, producing the correct SQL as needed.
If you can at least get the ordering done in the database, and will usually only want the first few pages (quite often happens in real use), then you can at least have reasonable performance for those cases, as only enough rows to skip to, and then take, the wanted rows need be loaded from the db. You of course still need to test that performance is reasonable in those rare cases where someone really does look for page 1,2312!
Still, that's only a compromise for cases where paging is very difficult indeed, as a rule always page in the DB unless it's either extremely difficult for some reason, or the total number of rows is guaranteed to be low.

Can I do a very large insert with Linq-to-SQL?

I've got some text data that I'm loading into a SQL Server 2005 database using Linq-to-SQL using this method (psuedo-code):
Create a DataContext
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record);
}
dataContext.SubmitChanges();
The code is a little C# console application. This works fine so far, but I'm about to do an import of the real data (rather than a test subset) and this contains about 2 million rows instead of the 1000 I've tested. Am I going to have to do some clever batching or something similar to avoid the code falling over or performing woefully, or should Linq-to-SQL handle this gracefully?

It looks like this would work however the changes (and thus memory) that are kept by the DataContext are going to grow with each InsertOnSubmit. Maybe it's adviseable to perform a SubmitChanges every 100 records?
I would also take a look at SqlBulkCopy to see if it doesn't fit your usecase better.

IF you need to do bulk inserts, you should check out SqlBulkCopy
Linq-to-SQL is not really suited for doing large-scale bulk inserts.

You would want to call SubmitChanges() every 1000 records or so to flush the changes so far otherwise you'll run out of memory.
If you want performance, you might want to bypass Linq-To-SQL and go for System.Data.SqlClient.SqlBulkCopy instead.

Just for the record I did as marc_s and Peter suggested and chunked the data. It's not especially fast (it took about an hour and a half as Debug configuration, with the debugger attached and quite a lot of console progress output), but it's perfectly adequate for our needs:
Create a DataContext
numRows = 0;
While (new data exists)
{
Read a record from the text file
Create a new Record
Populate the record
dataContext.InsertOnSubmit(record)
// Submit the changes in thousand row batches
if (numRows % 1000 == 999)
dataContext.SubmitChanges()
numRows++
}
dataContext.SubmitChanges()

What's a good alternative to firing a stored procedure 368 times to update the database?

I'm working on a .NET component that gets a set of data from the database, performs some business logic on that set of data, and then updates single records in the database via a stored procedure that looks something like spUpdateOrderDetailDiscountedItem.
For small sets of data, this isn't a problem, but when I had a very large set of data that required an iteration of 368 stored proc calls to update the records in the database, I realized I had a problem. A senior dev looked at my stored proc code and said it looked fine, but now I'd like to explore a better method for sending "batch" data to the database.
What options do I have for updating the database in batch? Is this possible with stored procs? What other options do I have?
I won't have the option of installing a full-fledged ORM, but any advice is appreciated.
Additional Background Info:
Our current data access model was built 5 years ago and all calls to the db currently get executed via modular/static functions with names like ExecQuery and GetDataTable. I'm not certain that I'm required to stay within that model, but I'd have to provide a very good justification for going outside of our current DAL to get to the DB.
Also worth noting, I'm fairly new when it comes to CRUD operations and the database. I much prefer to play/work in the .NET side of code, but the data has to be stored somewhere, right?
Stored Proc contents:
ALTER PROCEDURE [dbo].[spUpdateOrderDetailDiscountedItem]
-- Add the parameters for the stored procedure here
#OrderDetailID decimal = 0,
#Discount money = 0,
#ExtPrice money = 0,
#LineDiscountTypeID int = 0,
#OrdersID decimal = 0,
#QuantityDiscounted money = 0,
#UpdateOrderHeader int = 0,
#PromoCode varchar(6) = '',
#TotalDiscount money = 0
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
Update OrderDetail
Set Discount = #Discount, ExtPrice = #ExtPrice, LineDiscountTypeID = #LineDiscountTypeID, LineDiscountPercent = #QuantityDiscounted
From OrderDetail with (nolock)
Where OrderDetailID = #OrderDetailID
if #UpdateOrderHeader = -1
Begin
--This code should get code the last time this query is executed, but only then.
exec spUpdateOrdersHeaderForSkuGroupSourceCode #OrdersID, 7, 0, #PromoCode, #TotalDiscount
End

If you are using SQL 2008, then you can use a table-valued parameter to push all of the updates in one s'proc call.
update
Incidentally, we are using this in combination with the merge statement. That way sql server takes care of figuring out if we are inserting new records or updating existing ones. This mechanism is used at several major locations in our web app and handles hundreds of changes at a time. During regular load we will see this proc get called around 50 times a second and it is MUCH faster than any other way we've found... and certainly a LOT cheaper than buying bigger DB servers.

An easy and alternative way I've seen in use is to build a SQL statement consisting of sql_execs calling the sproc with the parameters in the string. Not sure if this is advised or not, but from the .NET perspective, you are only populating one SqlCommand and calling ExecuteNonQuery once...
Note if you choose this then please, please use the StringBuilder! :-)
Update: I much prefer Chris Lively's answer, didn't know about table-valued parameters until now... unfortunately the OP is using 2005.

You can send the full set of data as XML input to the stored procedure. Then you can perform Set operations to modify the database. Set based will beat RBARs on performance almost every single time.

If you are using a version of SQL Server prior to 2008, you can move your code entirely into the stored procedure itself.
There are good and "bad" things about this.
Good
No need to pull the data across a network wire.
Faster if your logic is set based
Scales up
Bad
If you have rules against any logic in the database, this would break your design.
If the logic cannot be set based then you might end up with a different set of performance problems
If you have outside dependencies, this might increase difficulty.
Without details on exactly what operations you are performing on the data it's hard to give a solid recommendation.
UPDATE
Ben asked what I meant in one of my comments about the CLR and SQL Server. Read Using CLR Integration in SQL Server 2005. The basic idea is that you can write .Net code to do your data manipulation and have that code live inside the SQL server itself. This saves you from having to read all of the data across the network and send updates back that way.
The code is callable by your existing proc's and gives you the entire power of .net so that you don't have to do things like cursors. The sql will stay set based while the .net code can perform operations on individual records.
Incidentally, this is how things like heirarchyid were implemented in SQL 2008.
The only real downside is that some DBA's don't like to introduce developer code like this into the database server. So depending on your environment, this may not be an option. However, if it is, then it is a very powerful way to take care of your problem while leaving the data and processing within your database server.

Can you create batched statement with 368 calls to your proc, then at least you will not have 368 round trips. ie pseudo code
var lotsOfCommands = "spUpdateOrderDetailDiscountedItem 1; spUpdateOrderDetailDiscountedItem 2;spUpdateOrderDetailDiscountedItem ... 368'
var new sqlcommand(lotsOfCommands)
command.CommandType = CommandType.Text;
//execute command

I had issues when trying to the same thing (via inserts, updates, whatever). While using an OleDbCommand with parameters, it took a bunch of time to constantly re-create the object and parameters each time I called it. So, I made a property on my object for handling such call and also added the appropriate "parameters" to the function. Then, when I needed to actually call/execute it, I would loop through each parameter in the object, set it to whatever I needed it to be, then execute it. This created SIGNIFICANT performance improvement... Such pseudo-code of my operation:
protected OleDbCommand oSQLInsert = new OleDbCommand();
// the "?" are place-holders for parameters... can be named parameters,
// just for visual purposes
oSQLInsert.CommandText = "insert into MyTable ( fld1, fld2, fld3 ) values ( ?, ?, ? )";
// Now, add the parameters
OleDbParameter NewParm = new OleDbParameter("parmFld1", 0);
oSQLInsert.Parameters.Add( NewParm );
NewParm = new OleDbParameter("parmFld2", "something" );
oSQLInsert.Parameters.Add( NewParm );
NewParm = new OleDbParameter("parmFld3", 0);
oSQLInsert.Parameters.Add( NewParm );
Now, the SQL command, and place-holders for the call are all ready to go... Then, when I'm ready to actuall call it, I would do something like..
oSQLInsert.Parameters[0].Value = 123;
oSQLInsert.Parameters[1].Value = "New Value";
oSQLInsert.Parameters[2].Value = 3;
Then, just execute it. The repetition of 100's of calls could be killed by time by creating your commands over and over...
good luck.

Is this a one-time action (like "just import those 368 new customers once") or do you regularly have to do 368 sproc calls?
If it's a one-time action, just go with the 368 calls.
(if the sproc does much more than just updates and is likely to drag down the performance, run it in the evening or at night or whenever no one's working).
IMO, premature optimization of database calls for one-time actions is not worth the time you spend with it.

Bulk CSV Import
(1) Build data output via string builder as CSV then do a Bulk CSV import:
http://msdn.microsoft.com/en-us/library/ms188365.aspx

Table-valued parameters would be best, but since you're on SQL 05, you can use the SqlBulkCopy class to insert batches of records. In my experience, this is very fast.

Programming pattern using typed datasets in VS 2008

I'm currently doing the following to use typed datasets in vs2008:
Right click on "app_code" add new dataset, name it tableDS.
Open tableDS, right click, add "table adapter"
In the wizard, choose a pre defined connection string, "use SQL statements"
select * from tablename and next + next to finish. (I generate one table adapter for each table in my DB)
In my code I do the following to get a row of data when I only need one:
cpcDS.tbl_cpcRow tr = (cpcDS.tbl_cpcRow)(new cpcDSTableAdapters.tbl_cpcTableAdapter()).GetData().Select("cpcID = " + cpcID)[0];
I believe this will get the entire table from the database and to the filtering in dotnet (ie not optimal), is there any way I can get the tableadapter to filer the result set on the database instead (IE what I want to is send select * from tbl_cpc where cpcID = 1 to the database)
And as a side note, I think this is a fairly ok design pattern for getting data from a database in vs2008. It's fairly easy to code with, read and mantain. But I would like to know it there are any other design patterns that is better out there? I use the datasets for read/update/insert and delete.

A bit of a shift, but you ask about different patterns - how about LINQ? Since you are using VS2008, it is possible (although not guaranteed) that you might also be able to use .NET 3.5.
A LINQ-to-SQL data-context provides much more managed access to data (filtered, etc). Is this an option? I'm not sure I'd go "Entity Framework" at the moment, though (see here).
Edit per request:
to get a row from the data-context, you simply need to specify the "predicate" - in this case, a primary key match:
int id = ... // the primary key we want to look for
using(var ctx = new MydataContext()) {
SomeType record = ctx.SomeTable.Single(x => x.SomeColumn == id);
//... etc
// ctx.SubmitChanges(); // to commit any updates
}
The use of Single above is deliberate - this particular usage [Single(predicate)] allows the data-context to make full use of local in-memory data - i.e. if the predicate is just on the primary key columns, it might not have to touch the database at all if the data-context has already seen that record.
However, LINQ is very flexible; you can also use "query syntax" - for example, a slightly different (list) query:
var myOrders = from row in ctx.Orders
where row.CustomerID = id && row.IsActive
orderby row.OrderDate
select row;
etc

There is two potential problem with using typed datasets,
one is testability. It's fairly hard work to set up the objects you want to use in a unit test when using typed datasets.
The other is maintainability. Using typed datasets is typically a symptom of a deeper problem, I'm guessing that all you business rules live outside the datasets, and a fair few of them take datasets as input and outputs some aggregated values based on them. This leads to business logic leaking all over the place, and though it will all be honky-dory the first 6 months, it will start to bite you after a while. Such a use of DataSets are fundamentally non-object oriented
That being said, it's perfectly possible to have a sensible architecture using datasets, but it doesn't come naturally. An ORM will be harder to set up initially, but will lend itself nicely to writing maintainable and testable code, so you don't have to look back on the mess you made 6 months from now.

You can add a query with a where clause to the tableadapter for the table you're interested in.
LINQ is nice, but it's really just shortcut syntax for what the OP is already doing.
Typed Datasets make perfect sense unless your data model is very complex. Then writing your own ORM would be the best choice. I'm a little confused as to why Andreas thinks typed datasets are hard to maintain. The only annoying thing about them is that the insert, update, and delete commands are removed whenever the select command is changed.
Also, the speed advantage of creating a typed dataset versus your own ORM lets you focus on the app itself and not the data access code.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.