Avoiding duplicate record insertion into SQL table

Avoiding duplicate record insertion into SQL table - c#

I have a windows service which basically watches a folder for any CSV file. Each record in the CSV file is inserted into a SQL table. If the same CSV file is put in that folder, it can lead to duplicate record entries in the table. How can I avoid duplicate insertions into the SQL table?

Try INSERT WHERE NOT EXISTS, where a, b and c are relevant columns, #a, #b and #c are relevant values.
INSERT INTO table
(
a,
b,
c
)
VALUES
(
#a,
#b,
#c
)
WHERE NOT EXISTS
(
SELECT 0 FROM table WHERE a = #a, b = #b, c = #c
)

The accepted answer has a syntax error and is not compatible with relational databases like MySQL.
Specifically, the following is not compatible with most databases:
values(...) where not exists
While the following is generic SQL, and is compatible with all databases:
select ... where not exists
Given that, if you want to insert a single record into a table after checking if it already exists, you can do a simple select with a where not exists clause as part of your insert statement, like this:
INSERT
INTO table_name (
primay_col,
col_1,
col_2
)
SELECT 1234,
'val_1',
'val_2'
WHERE NOT EXISTS (
SELECT 1
FROM table_name
WHERE primary_col=1234
);
Simply pass all values with the select keyword, and put the primary or unique key condition in the where clause.

Problems with the answers using WHERE NOT EXISTS are:
performance -- row-by-row processing requires, potentially, a very large number of table scans against table
NULL handling -- for every column where there might be NULLs you will have to write the matching condition in a more complicated way, like
(a = #a OR (a IS NULL AND #a IS NULL)).
Repeat that for 10 columns and viola - you hate SQL :)
A better answer would take into account the great SET processing capabilities that relational databases provide (in short -- never use row-by-row processing in SQL if you can avoid it. If you can't -- think again and avoid it anyway).
So for the answer:
load (all) data into a temporary table (or a staging table that can be safely truncated before load)
run the insert in a "set"-way:
INSERT INTO table (<columns>)
select <columns> from #temptab
EXCEPT
select <columns> from table
Keep in mind that the EXCEPT is safely dealing with NULLs for every kind of column ;) as well as choosing a high-performance join type for matching (hash, loop, merge join) depending on the available indexes and table statistics.

Related

SQL Server Insert if not exist

I know that SELECT + INSERT is a bad practice that can lead to race condition (learned from insert-update antipattern when some who try to implement a upsert behaviour).
But I have no idea how to avoid that when I would just insert data in a table without duplicate for a given column (or tuple of column):
IF NOT EXISTS ( SELECT TOP(1) 1 FROM [tbl_myTable] WHERE [MyNotKeyField] = 'myValue' )
BEGIN
INSERT INTO [tbl_myTable] (...) VALUES (...)
END
Should I create an unique index and just try to insert the record anyway?
I am afraid that in this case the overhead of failed insert may be more costly.
PS: I am sending that command from a client application (C# application connected with SQL Server) so I suppose temporary table and use of MERGE is out of the question.

Combine the EXISTS with INSERT
eg
INSERT INTO [tbl_myTable] (...)
SELECT #val1, #val2 ...
WHERE NOT EXISTS
(
SELECT 1 FROM [tbl_myTable] WITH (UPDLOCK, SERIALIZABLE)
WHERE [MyNotKeyField] = 'myValue'
);
Aaron Bertrand has a great post on anti UPSERT patterns

How do I structure this transaction?

We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.

Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.

Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.

The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)

I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.

Comparing and reporting differing data between two databases

The database for my application contains tables (not editable by the user) that are necessary for my application to run. For instance, there is a Report table containing a list of my SSRS reports.
Except for the Auto-Increment and GUID fields, the data in my Report Table should match across all databases.
To keep existing client databases in synch with the ones created from scratch, there is a database updater app that runs scripts to update the existing client base.
There are Unit Tests to ensure Reports run correctly on both types of databases. However, other than developer eye, there is no system check to ensure the rows and values in those rows match among the tables. This is prone to human error.
To fix, I plan to add a small report to Unit Test report that will inform development of the following:
Records missing from the "Made From Scratch" database that exist in the "Updated" Database
Records missing from the "Updated" database that exist in the "Made From Scratch" Database
Fields that do not match between the tables
So far, I have a query to report the above information for all tables involved.
A sample query would look something like this:
--Take the fields I want to compare from TableToCompare in MadeFromScratch and put them in #First_Table_Var
--NOTE: MyFirstField should match in both tables in order to compare the values between rows
DECLARE #First_Table_Var table(
MyFirstField Varchar(255),
MySecondField VarChar(255),
MyThirdField Varchar(255),
);
INSERT INTO #First_Table_Var
SELECT
r.MyFirstField,
r.MySecondField,
l.MyThirdField
FROM
MadeFromScratch.dbo.TableToCompare r
INNER JOIN MadeFromScratch.dbo.LookUpTable l ON r.ForeignKeyID = l.PrimaryKeyID
--Take the fields I want to compare from TableToCompare in UpdatdDatabase and put them in #Second_Table_Var
DECLARE #Second_Table_Var table(
MyFirstField Varchar(255),
MySecondField VarChar(255),
MyThirdField Varchar(255),
);
INSERT INTO #Second_Table_Var
SELECT
r.MyFirstField,
r.MySecondField,
l.MyThirdField
FROM
UpdatdDatabase.dbo.TableToCompare r
INNER JOIN UpdatdDatabase.dbo.LookUpTable l ON r.ForeignKeyID = l.PrimaryKeyID
--**********************
-- CREATE OUTPUT
--**********************
--List Rows that exist in #Second_Table but not #First_Table
--(e.g. these rows need to be added to the table in MadeFromScratch)
SELECT
Problem = '1 MISSING ROW IN A MADE-FROM-SCRATCH DATABASE',
hur.MyFirstField,
hur.MySecondField,
hur.MyThirdField
FROM
#Second_Table_Var hur
WHERE
NOT EXISTS
(SELECT
*
FROM
#First_Table_Var hu
WHERE
hu.MyFirstField = hur.MyFirstField
)
UNION
--List Rows that exist in #First_Table but not #Second_Table
--(e.g. these rows need to be added to the table in UpdatdDatabase)
SELECT
Problem = '2 MISSING IN UPDATE DATABASE',
hur.MyFirstField,
hur.MySecondField,
hur.MyThirdField
FROM
#First_Table_Var hur
WHERE
NOT EXISTS
(SELECT
*
FROM
#Second_Table_Var hu
WHERE
hu.MySecondField = hur.MySecondField
)
UNION
--Compare fields among the tables where MyFirstField matches, but
SELECT
Problem = '3 MISMATCHED FIELD',
h.MyFirstField,
MySecondField = CASE WHEN h.MySecondField = hu.MySecondField THEN '' ELSE 'Created Value: ' + h.MySecondField + ' Updated Value: ' + hu.MySecondField END,
MyThirdField = CASE WHEN h.MyThirdField = hu.MyThirdField THEN '' ELSE 'Created Value: ' + CAST(h.MyThirdField AS VARCHAR(4)) + ' Updated Value: ' + CAST(hu.MyThirdField AS VARCHAR(4)) END,
FROM
#First_Table_Var h
INNER JOIN #Second_Table_Var hu on h.MyFirstField = hu.MyFirstField
WHERE
NOT EXISTS
(SELECT
*
FROM
#Second_Table_Var hu
WHERE
hu.MyFirstField = h.MyFirstField and
hu.MySecondField = h.MySecondField and
hu.MyThirdField = h.MyThirdField and
)
ORDER BY Problem
I won't have any problem writing code to parse through the results, but this methodology feels antiquated for the following reasons:
Several queries (which essentially do the same thing) will need to be written
Maintenance for this process can get cumbersome
I would like to be able to write something where the list of tables and fields to compare is maintained by some kind of file (XML?). So, whether fields are added or changes all the user has to do is update this file.
Is there a way to use LINQ and/or Reflection (or any feature in .NET 4.0 for that matter) where I could compare tables between two databases and maintain them like I've listed above?
Ideas are welcome. Ideas with an example would be great! :D

you said "Except for the Auto-Increment and GUID fields, the data in my Report Table should match across all databases."
I assume that these fields are ID fields, ideally, replication of the database should replicate the id fields too ensuring this will allow you to check for new inserts by ID, in case of updates, you can set a timestamp field for comparison.

How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing.
I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.
The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.
Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.
The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.
Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?

I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.
For example, you can create a (non-unique) index on the staging table to deal with the "key"

Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).
The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.
Here's an example you can run in SSMS to see what's happening:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- secondary table for removing duplicates
create table #test2
(
sk int not null identity primary key
,col1 int
,col2 varchar(50)
,col3 char(3)
-- add a uniqueness constraint to filter dups
,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go
-- insert all records from original table
-- this should generate a warning if duplicate records were ignored
insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go
Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- add temporary PK and index
alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go
-- note: rebuilding the indexes may or may not provide a performance benefit
alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go
-- remove duplicates
with ranks as
(
select
sk
,ordinal = row_number() over
(
-- put all the columns composing uniqueness into the partition
partition by col1, col2, col3
order by sk
)
from #test1
)
delete
from ranks
where ordinal > 1;
go
-- remove added columns
drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go

Why not simply use, instead of a Primary Key, create an Index and set
Ignore Duplicate Keys: YES
This will prevent any duplicate key to fire an error, and it will not be created (as it exists already).
I use this method to insert around 120.000 rows per day and works flawlessly.

I would bulk copy into a temporary table and then push the data from that into the actual destination table. In this way, you can use SQL to check for and handle duplicates.

What is the data volume? You have 2 options that I can see:
1: filter it at source, by implementing your own IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.
2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use CHECKSUM for some of this, but it depends.

And fix that table. No table ever should be without a unique index, preferably as a PK. Even if you add a surrogate key because there is no natural key, you need to be able to specifically identify a particular record. Otherwise how will you get rid of the duplicates you already have?

I think this is a lot cleaner.
var dtcolumns = new string[] { "Col1", "Col2", "Col3"};
var dtDistinct = dt.DefaultView.ToTable(true, dtcolumns);
using (SqlConnection cn = new SqlConnection(cn)
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.DestinationTableName = "TableNameToMapTo";
copy.WriteToServer(dtDistinct );
}
This way only need one database table and can keep Bussiness Logic in code.

Adding a Column Programmatically to a SQL Server database

I've taken over an ASP.NET application that needs to be re-written. The core functionality of this application that I need to replicate modifies a SQL Server database that is accessed via ODBC from third party software.
The third-party application creates files that represent printer labels, generated by a user. These label files directly reference an ODBC source's fields. Each row of the table represents a product that populates the label's fields. (So, within these files are direct references to the column names of the table.)
The ASP.NET application allows the user to create/update the data for these fields that are referenced by the labels, by adding or editing a particular row representing a product.
It also allows the occasional addition of new fields... where it actually creates a new column in the core table that is referenced by the labels.
My concern: I've never programmatically altered an existing table's columns before. The existing application seems to handle this functionality fine, but before I blindly do the same thing in my new application, I'd like to know what sort of pitfalls exist in doing this, if any... and if there are any obvious alternatives.

It can become problem when too many columns are added to tables, and you have to be careful if performance is a consideration (covering indexes are not applicable, so expensive bookmark lookups might be performed).
The other alternative is a Key-Value Pair structure: Key Value Pairs in Database design, but that too has it's pitfalls and you are better off creating new columns, as you are suggesting. (KVPs are good for settings)

One option I think is to use a KVP table for storing dynamic "columns" (as first mentioned by Mitch), join the products table with the KVP table based on the product id then pivot the results in order to have all the dynamic columns in the resultset.
EDIT: something along these lines:
Prepare:
create table Product(ProductID nvarchar(50))
insert Product values('Product1')
insert Product values('Product2')
insert Product values('Product3')
create table ProductKVP(ProductID nvarchar(50), [Key] nvarchar(50), [Value] nvarchar(255))
insert ProductKVP values('Product1', 'Key2', 'Value12')
insert ProductKVP values('Product2', 'Key1', 'Value21')
insert ProductKVP values('Product2', 'Key2', 'Value22')
insert ProductKVP values('Product2', 'Key3', 'Value23')
insert ProductKVP values('Product3', 'Key4', 'Value34')
Retrieve:
declare #forClause nvarchar(max),
#sql nvarchar(max)
select #forClause = isnull(#forClause + ',', '') + '[' + [Key] + ']' from (
select distinct [Key] from ProductKVP /* WHERE CLAUSE */
) t
set #forClause = 'for [Key] in (' + #forClause + ')'
set #sql = '
select * from (
select
ProductID, [Key], [Value]
from (
select k.* from
Product p
inner join ProductKVP k on (p.ProductID = k.ProductID)
/* WHERE CLAUSE */
) sq
) t pivot (
max([Value])' +
#forClause + '
) pvt'
exec(#sql)
Results:
ProductID Key1 Key2 Key3 Key4
----------- --------- --------- --------- -------
Product1 NULL Value12 NULL NULL
Product2 Value21 Value22 Value23 NULL
Product3 NULL NULL NULL Value34

It very much depends on the queries you want to run against those tables. The main disadvantage of KVP is that more complex queries can become very inefficient.
A "hybrid" approach of both might be interesting.
Store the values you want to query in dedicated columns and leave the rest in an XML blob (MS SQL has nice features to even query inside the XML) or alternatively in a KVP bag. Personally I really don't like KVPs in DBs because you cannot build application logic specific indixes anymore.
Just another approach would be not to model the specific columns at all. You create generic "custom attribute" tables like: Attribute1, Attribute2, Attribute3, Attribute4 (for the required data type etc...) You then add meta data to your database that describes what AttrX means for a specific type of printer label.
Again, it really depends on how you want to use that data in the end.

One risk is the table getting too wide. I used to maintain a horrible app that added 3 columns "automagically" when new values were added to some XML (for some reason it thought everything would be a string a date or a number- hence the creation of 3 columns).
There are other techniques like serializing a BLOB or designing the tables differently that may help.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.