Related
I have a windows service which basically watches a folder for any CSV file. Each record in the CSV file is inserted into a SQL table. If the same CSV file is put in that folder, it can lead to duplicate record entries in the table. How can I avoid duplicate insertions into the SQL table?
Try INSERT WHERE NOT EXISTS, where a, b and c are relevant columns, #a, #b and #c are relevant values.
INSERT INTO table
(
a,
b,
c
)
VALUES
(
#a,
#b,
#c
)
WHERE NOT EXISTS
(
SELECT 0 FROM table WHERE a = #a, b = #b, c = #c
)
The accepted answer has a syntax error and is not compatible with relational databases like MySQL.
Specifically, the following is not compatible with most databases:
values(...) where not exists
While the following is generic SQL, and is compatible with all databases:
select ... where not exists
Given that, if you want to insert a single record into a table after checking if it already exists, you can do a simple select with a where not exists clause as part of your insert statement, like this:
INSERT
INTO table_name (
primay_col,
col_1,
col_2
)
SELECT 1234,
'val_1',
'val_2'
WHERE NOT EXISTS (
SELECT 1
FROM table_name
WHERE primary_col=1234
);
Simply pass all values with the select keyword, and put the primary or unique key condition in the where clause.
Problems with the answers using WHERE NOT EXISTS are:
performance -- row-by-row processing requires, potentially, a very large number of table scans against table
NULL handling -- for every column where there might be NULLs you will have to write the matching condition in a more complicated way, like
(a = #a OR (a IS NULL AND #a IS NULL)).
Repeat that for 10 columns and viola - you hate SQL :)
A better answer would take into account the great SET processing capabilities that relational databases provide (in short -- never use row-by-row processing in SQL if you can avoid it. If you can't -- think again and avoid it anyway).
So for the answer:
load (all) data into a temporary table (or a staging table that can be safely truncated before load)
run the insert in a "set"-way:
INSERT INTO table (<columns>)
select <columns> from #temptab
EXCEPT
select <columns> from table
Keep in mind that the EXCEPT is safely dealing with NULLs for every kind of column ;) as well as choosing a high-performance join type for matching (hash, loop, merge join) depending on the available indexes and table statistics.
I have to update a number of rows to a table, if the updating row is not existing in the table I need to insert that row. I cannot use unique key, so no use with ON duplicate KEY UPDATE
I have to achieve something like this
DECLARE count DOUBLE;
SELECT count(uid)
INTO count
FROM Table
WHERE column1 ='xxxxx'
AND column2='xxxxx';
IF (count=0)
THEN
--peform insert
ELSE
--perform update
END IF
This is for a high performance application.Any ideas? Code level or Query level
FYI : Database is Mysql
You could work with a temporary table.
Put your data into a temporary table
Do an update of the "other" table via a JOIN
Delete the matching data from the temp table
Insert the remaining stuff from the temp table into the main table.
This will be faster than doing it record by record if you have loads of data.
That's the store procedure we use, could possibly work for you as well.
if not exists (select 1 from Table where column1 ='xxxxx' AND column2='xxxxx')
insert into Table ( column1,column2)
values ( #xxxx,xxxxx)
else
update Table
You can use EXISTS or check the cound of a sub select if its > 0 to know if the row exists allready
BEGIN TRAN
IF EXISTS ( SELECT *
FROM Table WITH ( UPDLOCK, SERIALIZABLE )
WHERE CONDITION)
BEGIN
UPDATE Table SET SOMETHING WHERE CONDITION
END
ELSE
BEGIN
INSERT INTO Table(Field1,....) VALUES (Value1,..... )
END
COMMIT TRAN
NOTE: Transaction are very good but using IF EXISTS is not good in case of Insertion/Updation with mulitple queries.
You may find useful REPLACE statement. Its syntax described here.
We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.
Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.
Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.
The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)
I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.
I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing.
I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.
The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.
Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.
The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.
Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?
I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.
For example, you can create a (non-unique) index on the staging table to deal with the "key"
Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).
The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.
Here's an example you can run in SSMS to see what's happening:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- secondary table for removing duplicates
create table #test2
(
sk int not null identity primary key
,col1 int
,col2 varchar(50)
,col3 char(3)
-- add a uniqueness constraint to filter dups
,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go
-- insert all records from original table
-- this should generate a warning if duplicate records were ignored
insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go
Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- add temporary PK and index
alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go
-- note: rebuilding the indexes may or may not provide a performance benefit
alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go
-- remove duplicates
with ranks as
(
select
sk
,ordinal = row_number() over
(
-- put all the columns composing uniqueness into the partition
partition by col1, col2, col3
order by sk
)
from #test1
)
delete
from ranks
where ordinal > 1;
go
-- remove added columns
drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go
Why not simply use, instead of a Primary Key, create an Index and set
Ignore Duplicate Keys: YES
This will prevent any duplicate key to fire an error, and it will not be created (as it exists already).
I use this method to insert around 120.000 rows per day and works flawlessly.
I would bulk copy into a temporary table and then push the data from that into the actual destination table. In this way, you can use SQL to check for and handle duplicates.
What is the data volume? You have 2 options that I can see:
1: filter it at source, by implementing your own IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.
2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use CHECKSUM for some of this, but it depends.
And fix that table. No table ever should be without a unique index, preferably as a PK. Even if you add a surrogate key because there is no natural key, you need to be able to specifically identify a particular record. Otherwise how will you get rid of the duplicates you already have?
I think this is a lot cleaner.
var dtcolumns = new string[] { "Col1", "Col2", "Col3"};
var dtDistinct = dt.DefaultView.ToTable(true, dtcolumns);
using (SqlConnection cn = new SqlConnection(cn)
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.DestinationTableName = "TableNameToMapTo";
copy.WriteToServer(dtDistinct );
}
This way only need one database table and can keep Bussiness Logic in code.
I am building a hit counter. I have an article directory and tracking unique visitors. When a visitor comes i insert the article id and their IP address in the database. First I check to see if the ip exists for the article id, if the ip does not exist I make the insert. This is two queries -- is there a way to make this one query
Also, I am not using stored procedures I am using regular inline sql
Here are some options:
INSERT IGNORE INTO `yourTable`
SET `yourField` = 'yourValue',
`yourOtherField` = 'yourOtherValue';
from MySQL reference manual: "If you use the IGNORE keyword, errors that occur while executing the INSERT statement are treated as warnings instead. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted.".) If the record doesn't yet exist, it will be created.
Another option would be:
INSERT INTO yourTable (yourfield,yourOtherField) VALUES ('yourValue','yourOtherValue')
ON DUPLICATE KEY UPDATE yourField = yourField;
Doesn't throw error or warning.
Yes, you create a UNIQUE constraint on the columns article_id and ip_address. When you attempt to INSERT a duplicate the INSERT will be refused with an error. Just answered the same question here for SQLite.
IF NOT EXISTS (SELECT * FROM MyTable where IPAddress...)
INSERT...
Not with SQL Server. With T-SQL you have to check for the existence of a row, then use either INSERT or UPDATE as appropriate.
Another option is to try UPDATE first, and then examine the row count to see if there was a record updated. If not, then INSERT. Given a 50/50 chance of a row being there, you have executed a single query 50% of the time.
MySQL has a extension called REPLACE that has the capability that you seek.
The only way I can think of is execute dynamic SQL using the SqlCommand object.
IF EXISTS(SELECT 1 FROM IPTable where IpAddr=<ipaddr>)
--Insert Statement
I agree with Larry about using uniqueness, but I would implement it like this:
IP_ADDRESS, pk
ARTICLE_ID, pk, fk
This ensures that a record is unique hit. Attempts to insert duplicates would get an error from the database.
I would really use procedures! :)
But either way, this will probably work:
Create a UNIQUE index for both the IP and article ID columns, the insert query will fail if they already exist, so technically it'll work! (tested on mysql)
try this (it's a real kludge, but it should work...):
Insert TableName ([column list])
Select Distinct #PK, #valueA, #ValueB, etc. -- list all values to be inserted
From TableName
Where Not Exists
(Select * From TableName
Where PK == #PK)