SQLBulkCopy Don't Copy Primary Keys

SQLBulkCopy Don't Copy Primary Keys - c#

What is the best way to deal with the primary key violation errors when using SQLBulkCopy
Violation of PRIMARY KEY constraint 'email_k__'. Cannot insert duplicate key in object 'lntmuser.email'.
(i.e. if the row already exists in the destination table) ?
Is there a way to skip inserting duplicate rows or would this have to be checked and dealt with before hand ?
Here is the code I am currently using:
var conPro = tx_ProConStr.Text;
var conArc = tx_ArcConStr.Text;
var con = new SqlConnection {ConnectionString = conPro};
var cmd = new SqlCommand("SELECT * FROM dbo.email", con);
con.Open();
var rdr = cmd.ExecuteReader();
var sbc = new SqlBulkCopy(conArc) {DestinationTableName = "dbo.email"};
sbc.WriteToServer(rdr);
sbc.Close();
rdr.Close();
con.Close();

I usually end up performing a Bulk Copy operation to a temporary table, and then copy data from it to the target table using regular SQL. This allows me to perform 'bulk updates', as well as take care of special situations like this (although I haven't encountered this specific need).
There's a performance hit compared to straight bulk copy, but it's still a lot faster than performing INSERTs.

You could adjust the source query to exclude duplicates. For example:
select distinct * from dbo.email
Or to filter for the first col1 per pkcol:
select *
from (
select row_number() over (parition by pkcol order by col1) as rn
from dbo.email
) as SubQueryAlias
where rn = 1

Related

How to avoid duplicates when inserting data using OleDB and Entity Framework?

I need to export Excel report data into the company db but my code just reads and inserts without checking for duplicates, i tried AddOrUpdate() but i couldn't make it work.
Any ideas on how to go through the datareader results and filter already existing IDs so they are not inserted again?
DataView ImportarDatosSites(string filename)
{
string conexion = string.Format("Provider = Microsoft.ACE.OLEDB.12.0; Data Source={0}; Extended Properties= 'Excel 8.0;HDR=YES'" ,filename );
using (OleDbConnection connection = new OleDbConnection(conexion))
{
connection.Open();
OleDbCommand command = new OleDbCommand("SELECT * FROM [BaseSitiosTelemetria$]", connection);
OleDbDataAdapter adaptador = new OleDbDataAdapter { SelectCommand = command };
DataSet ds = new DataSet();
adaptador.Fill(ds);
DataTable dt = ds.Tables[0];
using (OleDbDataReader dr = command.ExecuteReader())
{
while (dr.Read())
{
var SiteID = dr[1];
var ID_AA_FB = dr[2];
var Address = dr[3];
var CreateDate = dr[5];
var Tipo = dr[7];
var Measures = dr[9];
var Latitud = dr[10];
var Longitud = dr[11];
SitesMtto s = new SitesMtto();
s.siteIDDatagate = SiteID.ToString();
s.idFieldBeat = ID_AA_FB.ToString();
s.addressDatagate = Address.ToString();
s.createDateDatagate = Convert.ToDateTime(CreateDate);
s.typeDevice = Tipo.ToString();
s.MeasuresDevice = Measures.ToString();
if (Latitud.ToString() != "" && Longitud.ToString() != "")
{
s.latitudeSite = Convert.ToDouble(Latitud);
s.longitudeSite = Convert.ToDouble(Longitud);
}
db.SitesMtto.Attach(s);
db.SitesMtto.Add(s);
db.SaveChanges();
}
connection.Close();
return ds.Tables[0].DefaultView;
}
}
}

one way is to setup a try catch block and then set your primary key index using tsql. when a constraint error occurs then it will throw an database error which you can catch.

When it comes to an import process from an external source, I recommend using a Staging table approach. Dump the raw data from Excel/file into a clean staging table. (executing a TRUNCATE TABLE script against the staging table first) From there you can perform a query with a join against the real data table to detect and ignore/update possible duplicates, inserting real rows for any staged row that doesn't already have a corresponding value.
Depending on the number of rows I would recommend batching up the read and insert. You also don't need to call both Attach() and Add(), simply adding the item to the DbSet is sufficient:
Step 1: flush the staging table using a db.Database.ExecuteSqlCommand("TRUNCATE TABLE stagingSitesMtto");
Step 2: Open the data reader and bulk-insert the rows into the stagingSitesMtto table. This assumes that the Excel/file source does not include duplicate rows within it.
Step 3: Query your stagingSitesMtto joining your SitesMtto table on the PK/unique key. This is arguably a bit complex as Join is normally used to perform an INNER JOIN but we want an OUTER JOIN since we will be interested in StagingSites that have no corresponding site.
var query = db.StagingSitesMtto
.GroupJoin(db.SitesMto,
staging => staging.SiteID,
site => site.siteIDDatagate,
(staging, site) => new
{
Staging = staging,
Site = site
})
.SelectMany(group => group.Site.DefaultIfEmpty(),
(group, site) => new
{
Staging = group.Staging,
IsNew = site == null
})
.Where(x => x.IsNew)
.Select(x => x.Staging)
.ToList(); // Or run in a loop with Skip and Take
This will look to select all staging rows that do not have a corresponding real row. From there you can create new SitesMtto entities and copy the data across from the staging row, add it to the db.Sites, and save. If you want to update rows as well as insert then you can return the Staging and Site along with the IsNew flag and update the .Site using the values from .Staging. With change tracking enabled, the existing Site will be updated on SaveShanges if values were altered.
Disclaimer: The above code wasn't tested, just written from memory and reference for the Outer Join approach. see: How to make LEFT JOIN in Lambda LINQ expressions
Hopefully that gives you something to consider for handling imports.

Why does my SQL update for 20.000 records take over 5 minutes?

I have a piece of C# code, which updates two specific columns for ~1000x20 records in a database on the localhost. As I know (though I am really far from being a database expert), it should not take long, but it takes more than 5 minutes.
I tried SQL Transactions, with no luck. SqlBulkCopy seems a bit overkill, since it's a large table with dozens of columns, and I only have to update 1/2 column for a set of records, so I would like to keep it simple. Is there a better approach to improve efficiency?
The code itself:
public static bool UpdatePlayers(List<Match> matches)
{
using (var connection = new SqlConnection(Database.myConnectionString))
{
connection.Open();
SqlCommand cmd = connection.CreateCommand();
foreach (Match m in matches)
{
cmd.CommandText = "";
foreach (Player p in m.Players)
{
// Some player specific calculation, which takes almost no time.
p.Morale = SomeSpecificCalculationWhichMilisecond();
p.Condition = SomeSpecificCalculationWhichMilisecond();
cmd.CommandText += "UPDATE [Players] SET [Morale] = #morale, [Condition] = #condition WHERE [ID] = #id;";
cmd.Parameters.AddWithValue("#morale", p.Morale);
cmd.Parameters.AddWithValue("#condition", p.Condition);
cmd.Parameters.AddWithValue("#id", p.ID);
}
cmd.ExecuteNonQuery();
}
}
return true;
}

Updating 20,000 records one at a time is a slow process, so taking over 5 minutes is to be expected.
From your query, I would suggest putting the data into a temp table, then joining the temp table to the update. This way it only has to scan the table to update once, and update all values.
Note: it could still take a while to do the update if you have indexes on the fields you are updating and/or there is a large amount of data in the table.
Example update query:
UPDATE P
SET [Morale] = TT.[Morale], [Condition] = TT.[Condition]
FROM [Players] AS P
INNER JOIN #TempTable AS TT ON TT.[ID] = P.[ID];
Populating the temp table
How to get the data into the temp table is up to you. I suspect you could use SqlBulkCopy but you might have to put it into an actual table, then delete the table once you are done.
If possible, I recommend putting a Primary Key on the ID column in the temp table. This may speed up the update process by making it faster to find the related ID in the temp table.

Minor improvements;
use a string builder for the command text
ensure your parameter names are actually unique
clear your parameters for the next use
depending on how many players in each match, batch N commands together rather than 1 match.
Bigger improvement;
use a table value as a parameter and a merge sql statement. Which should look something like this (untested);
CREATE TYPE [MoraleUpdate] AS TABLE (
[Id] ...,
[Condition] ...,
[Morale] ...
)
GO
MERGE [dbo].[Players] AS [Target]
USING #Updates AS [Source]
ON [Target].[Id] = [Source].[Id]
WHEN MATCHED THEN
UPDATE SET SET [Morale] = [Source].[Morale],
[Condition] = [Source].[Condition]
DataTable dt = new DataTable();
dt.Columns.Add("Id", typeof(...));
dt.Columns.Add("Morale", typeof(...));
dt.Columns.Add("Condition", typeof(...));
foreach(...){
dt.Rows.Add(p.Id, p.Morale, p.Condition);
}
SqlParameter sqlParam = cmd.Parameters.AddWithValue("#Updates", dt);
sqlParam.SqlDbType = SqlDbType.Structured;
sqlParam.TypeName = "dbo.[MoraleUpdate]";
cmd.ExecuteNonQuery();
You could also implement a DbDatareader to stream the values to the server while you are calculating them.

Build Where Clause Dynamically in Ado.net C#

I will be taking in around 1000 records at a given time and I have to determine if they are existing records or new records.
If they are existing I have to update the records, if new then just insert them. I will not know if any of them will be existing or if they all will be existing.
I thought that it might be best to do one query to the database and try to find if any of them exist in the db and store them in memory and check that collection in memory and check that.
Originally I was told I 1 field would be enough to determine uniqueness. So I thought I could just do 1 big in clause against 1 field in the database but now I found out that is not the case and I need to use 3 fields to determine if the record is existing or now.
This is basically an and clause
select * from where columnA = "1" and ColumnB = "2" and ColumnC = "3"
How can I properly right this in C# ado.net?
I am guessing I going to need to have like some super where clause?
select * from where (columnA = "1" and ColumnB = "2" and ColumnC = "3") or (columnA = "4" and ColumnB = "5" and ColumnC = "6") or [....998 more conditional clauses)
I am open to better ideas if possible. I still think doing it in 1 shot is better than doing 1000 separate queries.

I can only help you to write query for your request
var recordCount = 1000;
var query = "SELECT * FROM TableName WHERE";
for (var i = 1; i < recordCount - 2; i += 3)
{
query += " (columnA = " + i + " and ColumnB = " + (i + 1) + " and ColumnC = " + (i + 2) + ") or ";
}

I feel kinda silly writing this answer, because I think that you should be able to put the pieces together a complete answer from other posts - but this is not an exact duplicate of either of the questions I have in mind.
There already are questions and answers in Stackoverflow dealing with this issue - however in my search I only found answers that are not thread safe, and most of them are using merge.
There are different questions and answers I can refer you to such as
my answer to Adding multiple parameterized variables to a database in c#
where you can see how to work with table valued parameters on c#, and to Aaron Bertrand's answer to Using a if condition in an insert SQL Server where you can see how to create a safe upsert - however I didn't find any answer that covers this completely - so here you go:
First you need to create a user defined table type in your database:
CERATE TYPE MyTableType AS TABLE
(
Column1 int NOT NULL,
Column2 int NOT NULL,
Column3 int NOT NULL,
-- rest of the columns in your table goes here
PRIMARY KEY (Column1, Column2, Column3)
)
Then, you create the stored procedure:
CREATE stp_UpsertMyTable
(
#MyTableType dbo.MyTableType readonly -- table valued parameters must be readonly
)
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
UPDATE t
SET t.column4 = tvp.column4,
t.column5 = tvp.column5 -- and so on for all columns that are not part of the key
FROM dbo.MyTable AS t
INNER JOIN #MyTableType AS tvp
ON t.Column1 = tvp.Column1
AND t.Column2 = tvp.Column2
AND t.Column3 = tvp.Column3;
-- Note: <ColumnsList> should be replaced with the actual columns in the table
INSERT dbo.MyTable(<ColumnsList>)
SELECT <ColumnsList>
FROM #MyTableType AS tvp
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.MyTable t
WHERE t.Column1 = tvp.Column1
AND t.Column2 = tvp.Column2
AND t.Column3 = tvp.Column3
);
COMMIT TRANSACTION;
GO
Then, the c# part is simple:
DataTable dt = new DataTable();
dt.Columns.Add("Column1", typeof(int));
dt.Columns.Add("Column2", typeof(int));
dt.Columns.Add("Column3", typeof(int));
dt.Columns.Add("Column4", typeof(string));
dt.Columns.Add("Column5", typeof(string));
// Fill your data table here
using (var con = new SqlConnection("ConnectionString"))
{
using(var cmd = new SqlCommand("stp_UpsertMyTable", con))
{
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.Add("#MyTable", SqlDbType.Structured).Value = dt;
con.Open();
cmd.ExecuteNonQuery();
}
}
Now you have a complete and safe upsert using a table valued parameter with only one round trip between c# and sql server.

What is the fastest way to update a huge amount of data through .Net C# application

I am working on an application which gets data from two different databases (i.e Database1.Table1 and Database2.Table2) then it compares these two tables ( comparision done only with the primary key i-e ID ) and insert rows from Database1.Table1 to Database2.Table2 if it does not exists in Database2.Table2
The problem is that there is a huge amount of data (about 0.8 Million in both tables ) and it takes a lot of time in comparision. Is there any way to do this Fast
NOTE: I am using Datatable in C# to compare there tables Code is given below
DataTable Database1_Table1;// = method to get all data from Database1.Table1
DataTable Database2_Table2;// = method to get all data from Database2.Table2
foreach (DataRow row in Database1_Table1.Rows) //(var GoodClass in Staging_distinct2)
{
if (Database2_Table2.Select("ID=" + row["ID"]).Count() < 1)
{
sqlComm = new SqlCommand("Delete from Database1.Table1 where Id=" + row["ID"], conn);
sqlComm.ExecuteNonQuery();
sqlComm = new SqlCommand("INSERT INTO Database2.Table2 Values (#ID,#EmpName,#Email,#UserName)", conn);
sqlComm.Parameters.Add("#ID", SqlDbType.Int).Value = row["ID"];
sqlComm.Parameters.Add("#EmpName", SqlDbType.VarChar).Value = row["EmpName"];
sqlComm.Parameters.Add("#Email", SqlDbType.VarChar).Value = row["Email"];
sqlComm.Parameters.Add("#UserName", SqlDbType.VarChar).Value = row["UserName"];
sqlComm.ExecuteNonQuery();
totalCount++;
added++;
}
else
{
deleted++;
totalCount++;
}
}

Submit this SQL from your application to the database:
INSERT INTO Database1..Table1 (Key, Column1,Column2)
SELECT Key, Column1,Column2
FROM Database2..Table2
WHERE NOT EXISTS (
SELECT * FROM Database1..Table1
WHERE Database1..Table1.Key = Database1..Table2.Key
)
It will copy all rows that don't match on column Key from Database..Table2 to Database..Table1
It will do it on the database server. No needless round trip of data. No RBAR (Row By Agonising Row). The only downside is you can't get a progress bar - do it asynchronously.

Bulk update/insert is the fastest way. (sqlbulk copy)
http://www.jarloo.com/c-bulk-upsert-to-sql-server-tutorial/

Best way to handle this is to bulk insert into a temp table then issue a merge statement from that temp table into your production table. I do this with millions of rows a day without issue. I have an example of the technique on my blog C# Sql Server Bulk Upsert

Dynamically print number of rows updated

The company that I work for has large databases, millions of records in a single table. I have written a C# program that migrates tables between remote servers.
I first create all the tables using SMO without copying data and then the data insertion is done after all the tables have been created.
During the record insertion since there are so many records the console window remains blank until all the rows have been inserted. Due to the sheer volumes of data this takes a long time.
What I want now is a way to print n rows updated like in MSSQL import export data wizard.
The insert part is just a simple insert into select * query.

It sounds like you might be using SqlCommands, if so here is a sample
using (SqlConnection connection = new SqlConnection(Connection.ConnectionString) )
{
using(SqlCommand command = new SqlCommand("insert into OldCustomers select * from customers",connection))
{
connection.Open();
var numRows = command.ExecuteNonQuery();
Console.WriteLine("Affected Rows: {0}",numRows);
}
}

You definitely need to look on OUTPUT clause. There are useful examples on MSDN.
using (SqlConnection conn = new SqlConnection(connectionStr) )
{
var sqlCmd = "
CREATE TABLE #tmp (
InsertedId BIGINT
);
INSERT INTO TestTable
OUTPUT Inserted.Id INTO #tmp
VALUES ....
SELECT COUNT(*) FROM #tmp";
using(SqlCommand cmd = new SqlCommand(sqlCmd,conn))
{
conn .Open();
var numRows = command.ExecuteNonQuery();
Console.WriteLine("Affected Rows: {0}",numRows);
}
}
Also I suggest to use stored procedure for such purposes.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

SQLBulkCopy Don't Copy Primary Keys - c#

You could adjust the source query to exclude duplicates. For example: select distinct * from dbo.email Or to filter for the first col1 per pkcol: select * from ( select row_number() over (parition by pkcol order by col1) as rn from dbo.email ) as SubQueryAlias where rn = 1

Related

How to avoid duplicates when inserting data using OleDB and Entity Framework?

Why does my SQL update for 20.000 records take over 5 minutes?

Build Where Clause Dynamically in Ado.net C#

What is the fastest way to update a huge amount of data through .Net C# application

Dynamically print number of rows updated

Categories

Resources