Better ways for merging data from two sources using Linq

Better ways for merging data from two sources using Linq - c#

I have some C#/Linq code used to merge data from excel file into db, which needs better performance.
There are
1. A List read from excel file: List<Score> newScoreList
2. A DB table named Scores, primary keys peopleId and testDate
I need to merge data from the list to the table, and if there is any duplicate data, update it.
My current solution is:
1) Find the duplicate data with this LINQ expression:
var dupliData =
from newScore in newScoreList
from oldScore in db.Scores
where newScore.peopleId == oldScore.peopleId && newScore.testDate == oldScore.testDate
select oldScore;
2) Delete the duplicate data.
db.Scores.DeleteAllOnSubmit(dupliData);
3) Insert the new data from list.
db.Scores.InsertAllOnSubmit(newScoreList);
Could anybody give me a better solution?

I really hate stored procedures in general, but this is probably a perfect case for using one. My TSQL is rusty, but this should give an idea.
CREATE PROCEDURE dbo.InsertOrUpdateScore
(
#id as Int,
#date as DateTime,
#result as varchar(20)
)
AS
if not exists(SELECT id FROM Scores WHERE id = #id AND date = #date)
begin
INSERT INTO Scores (id, date, result) values (#id, #date, #result)
end
else
begin
UPDATE Scores
SET result = #result
WHERE id = #id AND date = #date
end
GO
Now in your LINQ server browser, select the Score entity, and change its INSERT and UPDATE behaviour to use the stored procedure you just created. Make sure the user accessing the database has EXECUTE permission to the SPROC.
This should perform quite a bit quicker than your version. You're trading an IN clause for N SELECTs on an index which may be quicker. However, the result set of the IN clause is not transported back to the client over the network, which could save quite a bit of time.
Profile exactly how long your method is taking before implementing this, so you can gauge if this is truly quicker.
I'm not sure if this is the only way to create a Score in your application, but you might want to consider the case where you're INSERTing a record that doesn't yet have an ID. You'll need to modify the SPROC to allow #id as null, and handle the INSERT appropriately.
Then it should just be:
db.Scores.InsertAllOnSubmit(newScoreList);

If you are using SQL 2008 you can use the Merge command
http://www.builderau.com.au/program/sqlserver/soa/Using-SQL-Server-2008-s-MERGE-statement/0,339028455,339283059,00.htm

Related

Explain Code First CRUD auto-generated SQL for Identity column

Code-first auto generates an insert procedure code as below for a table that has ProductID as primary key (identity column).
CREATE PROCEDURE [dbo].[InsertProducts]
#ProductName [nvarchar](max),
#Date [datetime],
AS
BEGIN
INSERT dbo.ProductsTable([ProductName], [Date])
VALUES (#ProductName, #Date)
-- identity stuff starts here
DECLARE #ProductID int
SELECT #ProductID = [ProductID]
FROM dbo.FIT_StorageLocations
WHERE ##ROWCOUNT > 0 AND [ProductID] = scope_identity()
SELECT t0.[ProductID]
FROM dbo.ProductsTable AS t0
WHERE ##ROWCOUNT > 0 AND t0.[ProductID] = #ProductID
END
GO
Could you please explain the code that handles the identity column? Also, if an insert procedure is to be manually written from scratch, would it be handled differently?
If for example I would remove this auto generated code, I would encounter one of the following errors:
Procedure ....expects parameter '#ProductID', which was not supplied
Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. See http://go.microsoft.com/fwlink/?LinkId=472540 for information on understanding and handling optimistic concurrency exceptions.
In the app, this is how I call the procedure which works fine until I try to mess with the code first auto generated SQL:
using (var db = new AppContext())
{
var record = new ProductObj()
{
ProductName= this.ProductName,
Date = DateTime.UtcNow
};
db.ProductDbSet.Add(record);
db.SaveChanges();
}

I guess there are two things to be explained here.
Why a SELECT statement when I insert stuff?
Let's first see what a regular insert by Entity Framework looks like. By "regular" I mean an insert without mapping CUD actions to stored procedures. The normal pattern is:
INSERT [dbo].[Product]([Name], ...)
VALUES (#0, ...)
SELECT [Id]
FROM [dbo].[Product]
WHERE ##ROWCOUNT > 0 AND [Id] = scope_identity()
So the INSERT is followed by a SELECT. This is because EF needs to know the identity value that the database assigns to the new Product to assign it to the entity object's Product.ProductId property and to track the entity. If for some reason you'd decide to do an update immediately after the insert, EF will be able to generate an update statement like UPDATE ... WHERE Id = #0.
When the insert is handled by a stored procedure, the sproc should return the new Id value in a way that looks like the regular insert. It expects to receive a one-column result set of which the column is named after the identity column. It should contain one row, the new identity value.
So that's why there is a SELECT statement in there, and why EF complains if you remove it. But, you might ask, does EF really need 7 lines of code to get an assigned identity value?
Why so much code?
Honestly, I have to speculate a bit here, because it isn't documented as far as I can find. But let's look at a minimal working version:
INSERT [dbo].[Products]([Name])
VALUES (#Name)
SELECT scope_identity() AS ProductId;
This does the job. It's even the standard example of many tutorials, including official ones, on mapping CUD actions to stored procedures.
But a database can be stuffed with triggers, constraints, defaults, etc. It's hard to predict their influence on the returned scope_identity() under the wide range of circumstances EF may encounter. So EF wants to guarantee that the returned value really belongs to the newly inserted record. And that a record has actually been inserted in the first place. That's why it adds the SELECT from the Product table, including the ##ROWCOUNT.
To implement these safeguards, a minimal version would be:
INSERT [dbo].[Products]([Name])
VALUES (#Name)
SELECT t0.[ProductId]
FROM [dbo].[Products] AS t0
WHERE ##ROWCOUNT > 0 AND t0.[ProductId] = scope_identity()
Same as in the regular insert.
That's as far as I can follow EF. It puzzles me a bit that this single SELECT apparently is enough for a regular INSERT but not for a stored procedure. I can't explain why there are two SELECTs in the generated code.

Insert multiple sql rows via stored proc

I have looked a some related topics but my question isn't quite answered:
C# - Inserting multiple rows using a stored procedure
Insert Update stored proc on SQL Server
Efficient Multiple SQL insertion
I have the following kind of setup when running my stored procedure in the code behind for my web application. The thing is I am now faced with the possibility of inserting multiple products and I would like to do it all in one ExecuteNonQuery rather than do a foreach loop and run it n number of times.
I am not sure how to do this, or if it can be, with my current setup.
The code should be somewhat self explanatory but if clarification is needed let me know. Thanks.
SqlDatabase database = new SqlDatabase(transMangr.ConnectionString);
DbCommand commandWrapper = StoredProcedureProvider.GetCommandWrapper(database, "proc_name", useStoredProc);
database.AddInParameter(commandWrapper, "#ProductID", DbType.Int32, entity._productID);
database.AddInParameter(commandWrapper, "#ProductDesc", DbType.String, entity._desc);
...more parameters...
Utility.ExecuteNonQuery(transMangr, commandWrapper);
Proc
ALTER PROCEDURE [dbo].[Products_Insert]
-- Add the parameters for the stored procedure here
#ProductID int,
#Link varchar(max)
#ProductDesc varchar(max)
#Date DateTime
AS BEGIN
SET NOCOUNT ON;
INSERT INTO [dbo].[Prodcuts]
(
[CategoryID],
[Link],
[Desc],
[Date]
)
VALUES
(
#ProductID,
#Link,
#ProductDesc,
#Date
)
END

You should be fine running your stored procedure in a loop. Just make sure that you commit rarely, not after every insert.
For alternatives, you have already found the discussion about loading data.
Personally, I like SQL bulk insert of the form insert into myTable (select *, literalValue from someOtherTable);
But that will probably not do in your case.

You could pass all your data as a table value parameter - MSDN has a pretty good write up about it here
Something along the lines of the following should work
CREATE TABLE dbo.tSegments
(
SegmentID BIGINT NOT NULL CONSTRAINT pkSegment PRIMARY KEY CLUSTERED,
SegCount BIGINT NOT NULL
);
CREATE TYPE dbo.SegmentTableType AS TABLE
(
SegmentID BIGINT NOT NULL
);
CREATE PROCEDURE dbo.sp_addSegments
#Segments dbo.SegmentTableType READONLY
AS
BEGIN
MERGE INTO dbo.tSegments AS tSeg
USING #Segments AS S
ON tSeg.SegmentID = S.SegmentID
WHEN MATCHED THEN UPDATE SET T.SegCount = T.SegCount + 1
WHEN NOT MATCHED THEN INSERT VALUES(tSeg.SegmentID, 1);
END

Define the commandWrapper and parameters for the command outside of the loop and then with in the loop you just assign parameter values and execute the proc.
SqlDatabase database = new SqlDatabase(transMangr.ConnectionString);
DbCommand commandWrapper = StoredProcedureProvider.GetCommandWrapper(database, "proc_name", useStoredProc);
database.AddInParameter(commandWrapper, "#ProductID", DbType.Int32 );
database.AddInParameter(commandWrapper, "#ProductDesc", DbType.String);
...more parameters...
foreach (var entity in entitties)
{
database.SetParameterValue(commandWrapper, "#ProductID",entity._productID);
database.SetParameterValue(commandWrapper, "#ProductDesc",entity._desc);
//..more parameters...
Utility.ExecuteNonQuery(transMangr, commandWrapper);
}

Not ideal from a purist way of doing things, but sometimes one is limited by frameworks and libraries, and that you are forced to call stored procedures in a certain way, bind parameters in a certain way, and that connections are managed by pools as part of your framework.
In such circumstances, a method we have found to work is to simply write your stored procedure with a lot of parameters, usually a name followed by a number, e.g. #ProductId1, #ProductDesc1, #ProductId2, #ProductDesc2 up to a number you decide, possibly say 32.
You can use some form of scripting language to produce the lines for this.
You can get the stored procedure to insert all the values first into a table parameter that allows nulls, then do bulk inserts / merges on this data in a way similar to Johnv2020's answer. You might remove the null rows first.
It will usually be more efficient than doing it one at a time (partly because of the database operations itself, and partly because of your framework's overheads in getting the connection to call the procedure etc.)

How do I structure this transaction?

We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.

Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.

Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.

The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)

I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.

MySql Batching Stored Procedure Calls with .Net / Connector?

Is there a way to batch stored procedure calls in MySql with the .Net / Connector to increase performance?
Here's the scenario... I'm using a stored procedure that accepts a few parameters as input. This procedure basically checks to see whether an existing record should be updated or a new one inserted (I'm not using INSERT INTO .. ON DUPLICATE KEY UPDATE because the check involves date ranges, so I can't really make a primary key out of the criteria).
I want to call this procedure a lot of times (let's say batches of 1000 or so). I can of course, use one MySqlConnection and one MySqlCommand instance and keep changing the parameter values, and calling .ExecuteNonQuery().
I'm wondering if there's a better way to batch these calls?
The only thought that comes to mind is to manually construct a string like 'call sp_myprocedure(#parama_1,#paramb_1);call sp_myprocedure(#parama_2,#paramb2);...', and then create all the appropriate parameters. I'm not convinced this will be any better than calling .ExecuteNonQuery() a bunch of times.
Any advice? Thanks!
EDIT: More info
I'm actually trying to store data from an external data source, on a regular basis. Basically I'm taking rss feeds of Domain auctions (from various sources like godaddy, pool, etc.), and updating a table with the auction info using this stored procedure (let's call it sp_storeSale). Now, in this table that the sale info gets stored, I want to keep historical records for sales for a given domain, so I have a domain table, and a sale table. The sale table has a many to one relationship with the domain table.
Here's the stored procedure:
-- --------------------------------------------------------------------------------
-- Routine DDL
-- Note: comments before and after the routine body will not be stored by the server
-- --------------------------------------------------------------------------------
DELIMITER $$
CREATE PROCEDURE `DomainFace`.`sp_storeSale`
(
middle VARCHAR(63),
extension VARCHAR(10),
brokerId INT,
endDate DATETIME,
url VARCHAR(500),
category INT,
saleType INT,
priceOrBid DECIMAL(10, 2),
currency VARCHAR(3)
)
BEGIN
DECLARE existingId BIGINT DEFAULT NULL;
DECLARE domainId BIGINT DEFAULT 0;
SET #domainId = fn_getDomainId(#middle, #extensions);
SET #existingId = (
SELECT id FROM sale
WHERE
domainId = #domainId
AND brokerId = #brokerId
AND UTC_TIMESTAMP() BETWEEN startDate AND endDate
);
IF #existingId IS NOT NULL THEN
UPDATE sale SET
endDate = #endDate,
url = #url,
category = #category,
saleType = #saleType,
priceOrBid = #priceOrBid,
currency = #currency
WHERE
id = #existingId;
ELSE
INSERT INTO sale (domainId, brokerId, startDate, endDate, url,
category, saleType, priceOrBid, currency)
VALUES (#domainId, #brokerId, UTC_TIMESTAMP(), #endDate, #url,
#category, #saleType, #priceOrBid, #currency);
END IF;
END
As you can see, I'm basically looking for an existing record that is not 'expired', but has the same domain, and broker, in which case I assume the auction is not over yet, and the data is an update to the existing auction. Otherwise, I assume the auction is over, it is a historical record, and the data I've got is for a new auction, so I create a new record.
Hope that clears up what I'm trying to achieve :)

I'm not entirely sure what you're trying to do but it sounds kinda house-keeping or maintenance related so I won't be too ashamed at posting the following suggestion.
Why dont you move all of your logic into the database and process it all server side ?
The following example uses a cursor (shock/horror) but it's perfectly acceptable to use them in such circumstances.
If you can avoid using cursors at all - great, but the main point of my suggestion is about moving the logic from your application tier back into the data tier to save on the round trips. You'd call the following sproc once and it would process the entire range of data in single call.
call house_keeping(curdate() - interval 1 month, curdate());
Also, if you can provide just a bit more information about what you're trying to do we might be able to suggest other approaches.
Example stored procedure
drop procedure if exists house_keeping;
delimiter #
create procedure house_keeping
(
in p_start_date date,
in p_end_date date
)
begin
declare v_done tinyint default 0;
declare v_id int unsigned;
declare v_expired_date date;
declare v_cur cursor for
select id, expired_date from foo where
expired_date between p_start_date and p_end_date;
declare continue handler for not found set v_done = 1;
open v_cur;
repeat
fetch v_cur into v_id, v_expired_date;
/*
if <some condition> then
insert ...
else
update ...
end if;
*/
until v_done end repeat;
close v_cur;
end #
delimiter ;
Just incase you think I'm completely mad in suggesting cursors you might want to read this
Optimal MySQL settings for queries that deliver large amounts of data?
Hope this helps :)

C# code and SQL Server performance

I have a SQL Server database designed like this :
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name1 (string)
Name2 (string, can be null)
Name3 (string, can be null)
Name4 (string, can be null)
TableValue
Iteration (int)
IdTableParameter (int, FOREIGN KEY)
Type (string)
Value (decimal)
So, as you've just understood, TableValue is linked to TableParameter.
TableParameter is like a multidimensionnal dictionary.
TableParameter is supposed to have a lot of rows (more than 300,000 rows)
From my c# client program, I have to fill this database after each Compute() function :
for (int iteration = 0; iteration < 5000; iteration++)
{
Compute();
FillResultsInDatabase();
}
In FillResultsInDatabase() method, I have to :
Check if the label of my parameter already exists in TableParameter. If it doesn't exist, i have to insert a new one.
I have to insert the value in the TableValue
Step 1 takes a long time ! I load all the table TableParameter in a IEnumerable property and then, for each parameter I make a
.FirstOfDefault( x => x.Name1 == item.Name1 &&
x.Name2 == item.Name2 &&
x.Name3 == item.Name3 &&
x.Name4 == item.Name4 );
in order to detect if it already exists (and after to get the id).
Performance are very bad like this !
I've tried to make selection with WHERE word in order to avoid loading every row of TableParameter but performance are worse !
How can I improve the performance of step 1 ?
For Step 2, performance are still bad with classic INSERT. I am going to try SqlBulkCopy.
How can I improve the performance of step 2 ?
EDITED
I've tried with Store Procedure :
CREATE PROCEDURE GetIdParameter
#Id int OUTPUT,
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
SELECT TOP 1 #Id = Id FROM TableParameter
WHERE
TableParameter.Name1 = #Name1
AND
(#Name2 IS NULL OR TableParameter.Name2= #Name2)
AND
(#Name3 IS NULL OR TableParameter.Name3 = #Name3)
GO
CREATE PROCEDURE CreateValue
#Iteration int,
#Type nvarchar(50),
#Value decimal(32, 18),
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
DECLARE #IdParameter int
EXEC GetIdParameter #IdParameter OUTPUT,
#Name1, #Name2, #Name3
IF #IdParameter IS NULL
BEGIN
INSERT TablePArameter (Name1, Name2, Name3)
VALUES
(#Name1, #Name2, #Name3)
SELECT #IdParameter= SCOPE_IDENTITY()
END
INSERT TableValue (Iteration, IdParamter, Type, Value)
VALUES
(#Iteration, #IdParameter, #Type, #Value)
GO
I still have the same performance... :-( (not acceptable)

If I understand what's happening you're querying the database to see if the data is there in step 1. I'd use a db call to a stored procedure that that inserts the data if it not there. So just compute the results and pass to the sp.
Can you compute the results first, and then insert in batches?
Does the compute function take data from the database? If so can you turn the operation in to a set based operation and perform it on the server itself? Or may part of it?
Remember that sql server is designed for a large dataset operations.
Edit: reflecting comments
Since the code is slow on the data inserts, and you suspect that it's because the insert has to search back before it can be done, I'd suggest that you may need to place SQL Indexes on the columns that you search on in order to improve searching speed.
However I have another idea.
Why don't you just insert the data without the check and then later when you read the data remove the duplicates in that query?

Given the fact that name2 - name3 can be null, would it be possible to restructure the parameter table:
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name (string)
Dimension int
Now you can index it and simplify the query. (WHERE name = "TheNameIWant" AND Dimension="2")
(And speaking of indices, you do have index the name columns in the parameter table?)
Where do you do your commits on the insert? if you do one statement commits, group multiple inserts into one.
If you are the only one inserting values, if speed is really of essence, load all values from the database into the memory and check there.
just some ideas
hth
Mario

I must admit that I'm struggling to grasp the business process that you are trying to achieve here.
On initial review, it appears as if you are are performing a data comparison within your application tier. I would advise against this and suggest that you let the Database Engine do what it is designed to do, to manage and implement your data access.
As another poster has mentioned, I concur that you should look to create a Stored Procedure to handle your record insertion logic. The procedure can perform a simple check to see if your records already exist.
You should also consider:
Enforcing the insertion logic/rule by creating a Unique Constraint across the four name columns.
Creating a covering non-clustered index incorporating the four name columns.
With regard to performance of your inserts, perhaps you can provide some metrics to qualify what it is that you are seeing and how you are measuring it?
To give you a yardstick the current ETL insertion record for SQL Server is approx 16 million rows per second. What sort of numbers are you expecting and wanting to see?

the fastest way ( i know so far) is bulk insert. but not just lines of INSERT. try insert + select + union. it works pretty fast.
insert into myTable
select a1, b1, c1, ...
union select a2, b2, c2, ...
union select a3, b3, c3, ...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Better ways for merging data from two sources using Linq - c#

If you are using SQL 2008 you can use the Merge command http://www.builderau.com.au/program/sqlserver/soa/Using-SQL-Server-2008-s-MERGE-statement/0,339028455,339283059,00.htm

Related

Explain Code First CRUD auto-generated SQL for Identity column

Insert multiple sql rows via stored proc

How do I structure this transaction?

MySql Batching Stored Procedure Calls with .Net / Connector?

C# code and SQL Server performance

Categories

Resources