Best way to avoid adding duplicates in database - c#

I have a SQL Server table with three columns:
Table1
col1 int
col2 int
col3 string
I have a unique constraint defined for all three columns (col1, col2, col3)
Now, I have a .csv file from which I want to add records in this table and the *.csv file can have duplicate records.
I have searched for various options for avoiding duplicates in above scenario. Below are the three options which are working well for me. Please have a look and throw some ideas on pros/cons of each method so I can choose the best one.
Option#1 :
Avoiding duplicates in the first place i.e. while adding objects to the list from csv file. I have used HashSet<T> for this and overridden below methods for type T:
public override int GetHashCode()
{
return col1.GetHashCode() + col2.GetHashCode() + col3.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as T;
if (other == null)
{
return false;
}
return col1 == other.col1
&& col2 == other.col2
&& col3 == other.col3;
}
option #2
Having List<T> instead of HashSet<T>.
Removing duplicates after all the objects are added to List<T>
List<T> distinctObjects = allObjects
.GroupBy(x => new {x.col1, x.col2, x.col3})
.Select(x => x.First()).ToList();
option #3
Removing duplicates after all the objects are added to DataTable.
public static DataTable RemoveDuplicatesRows(DataTable dataTable)
{
IEnumerable<DataRow> uniqueRows = dataTable.AsEnumerable().Distinct(DataRowComparer.Default);
DataTable dataTable2 = uniqueRows.CopyToDataTable();
return dataTable2;
}
Although I have not compared their running time, but I prefer option#1 as I am removing duplicates as a first step - so moving ahead only with what is required.
Please share your views so I can choose the best one.
Thanks a lot!

I like option 1: the HashSet<T> provides a fast way of avoiding duplicates before ever sending them to the DB. You should implement a better GetHashCode, e.g. using Skeet's implementation from What is the best algorithm for an overridden System.Object.GetHashCode?
But there's a problem: what if the table already contains data that can be a duplicate of your CSV? You'd have to copy the whole table down first for a simple HashSet to really work. You could do just that, but to solve this, I might pair option 1 with a temporary table and an insert statement like Skip-over/ignore duplicate rows on insert's:
INSERT dbo.Table1(col1, col2, col3)
SELECT col1, col2, col3
FROM dbo.tmp_holding_Table1 AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.Table1 AS d
WHERE col1 = t.col1
AND col2 = t.col2
AND col3 = t.col3);
With this combination, the volume of data transferred to/from your DB is minimized.

Another solution could be the IGNORE_DUP_KEY = { ON | OFF } option when creating / rebuilding an index. This solution will prevent getting errors with inserting duplicate rows. Instead, SQL Server will generate warnings: Duplicate key was ignored..
CREATE TABLE dbo.MyTable (Col1 INT, Col2 INT, Col3 INT);
GO
CREATE UNIQUE INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable (Col1,Col2,Col3)
WITH (IGNORE_DUP_KEY = ON);
GO
INSERT dbo.MyTable (Col1,Col2,Col3)
VALUES (1,11,111);
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 1,11,111 UNION ALL
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
GO
/*
(1 row(s) affected)
(2 row(s) affected)
Duplicate key was ignored.
*/
SELECT * FROM dbo.MyTable;
/*
Col1 Col2 Col3
----------- ----------- -----------
1 11 111
2 22 222
3 33 333
*/
Note: Because you have an UNIQUE constraint if you try to change index options with ALTER INDEX
ALTER INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable
REBUILD WITH (IGNORE_DUP_KEY = ON)
you will get following error:
Msg 1979, Level 16, State 1, Line 1
Cannot use index option ignore_dup_key to alter index 'IUN_MyTable_Col1_Col2_Col3' as it enforces a primary or unique constraint.`
So, if you choose this solution the options are:
1) Create another UNIQUE index and to drop the UNIQUE constraint (this option will require more storage space but will be a UNIQUE index/constraint active all time) or
2) Drop the UNIQUE constraint and create an UNIQUE index with WITH (IGNORE_DUP_KEY = ON) option (I wouldn't recommend this last option).

Related

Return the row, when Unique key voliation happens

Is it possible to return the record, that causes the unique key violation in MSSQL, When inserting data?
Try this schema
select * from
(
--query used for your insert
) f1
where exists
(
select * from tablewhereyouwantinsert f2
where f1.key1=f2.key1 and f1.key2=f2.key2 ---- keys used into your unique key violation
)
You can use MERGE to conditionally insert or retrieve a row from the database using a single statement.
Unfortunately, to get the retrieval action, we do have to touch the existing row, I'm assuming that's acceptable and that you'll be able to construct a low impact "No-Op" UPDATE as below:
create table T (ID int not null primary key, Col1 varchar(3) not null)
insert into T(ID,Col1) values (1,'abc')
;merge T t
using (values (1,'def')) s(ID,Col1)
on t.ID = s.ID
when matched then update set Col1 = t.Col1
when not matched then insert (ID,Col1) values (s.ID,s.Col1)
output inserted.*,$action;
This produces:
ID Col1 $action
----------- ---- ----------
1 abc UPDATE
Including the $action column helps you know that this was an existing row rather than the insert of (1,def) succeeding.

Compare two DataTables with several keys and select the rows that are not present in second table

I have two DataTables and I want to select the rows from the first one which are not present in second one, both tables have 3 Keys custnum, shiptonum, connum
For example:
Table Contacts
custnum shiptonum connum column
1 1 1 data1
2 2 2 data2
3 3 3 data3
4 4 4 data4
Table Invitations
custnum shiptonum connum column
1 1 1 data11
3 3 3 data33
I'd like the result to be:
Table Result
custnum shiptonum connum column
2 2 2 data2
4 4 4 data4
I already tried using
var differences = table1.AsEnumerable().Except(table2.AsEnumerable(),DataRowComparer.Default);
but it didn't work. For example in my testing in Contacts table I have 14,389 records, in Invitations table I have two records that exist in Contacts table the count after using the abovesolution was 14,389 instead of 14,387 (removing the two records from Invitations table).
You wrote:
I want to select the rows from the first one which are not present in second one
From your example, I see that you don't want to select rows from the first table that are not rows in the second table, but that you only want to take the values of the keys into account:
I want to select all rows from tableA which have keys with values that are not keys from tableB
You didn't define your tables. They might be IQueryable, or IEnumerable, for your LINQ statements there is not a big difference. Try to avoid AsEnumerable, especially if your data source is in a different process, like a database management system. The other process is much more efficient in executing your query than your process. AsEnumerable transports all data from your other process to your process, which is a relatively slow process. Therefore as a rule: Only use AsEnumerable this if you really need to,
The second definition defines clearer what you want: apparently from tableB you only need the keys:
var keysTableB = tableB.Select(row => new
{
CustNum = row.custNum,
ShipToNum = row.shiptonum,
ConNum = row.connum,
});
In words: from every row in tableB make one new object of anonymous type with three properties: CustNum, ShipToNum and ConNum
Select uses lazy execution. No query is executed, only property Expression is changed.
Now you want to keep only the rows from tableA that have a key that is a member of sequence keysTableB: if you want to keep a subset of a sequence, use Where
var result = tableA.Where(row => keysTableB.Contains(new
{
CustNum = row.custNum,
ShipToNum = row.shiptonum,
Connum = row.connum,
}));
In words: from every row in tableB keep only those rows that have a key that is also in keysTableB, using value equality.
TODO: consider concatenating these two LINQ statements into one.I doubt whether this would improve performance. It surely will deteriorate readability of your code, and thus decreases changeability / maintenance / testability.
for (int i=0;i<table1.rows.count;i++)
{
var rowExists = from dr in table2.AsEnumerable()
where dr.Field<typeofcolumn>("colum_name")==table1.Rows[i]["column_name"]
select dr;
if(rowExists.ToList().Count==0)
{
//here u import row table1.rows[i] to new table
}
}

Translating a Set of Integers to String Values in SQL Server( Using T-SQL for certain and maybe .NET)

I have multiple tables that have int values in them that represent a specific string (Text) and I want to convert the integers to the string values. The goal is to make a duplicate copy of the table and then translate the integers to strings for easy analysis.
For example, I have the animalstable and the AnimalType Field consists of int values.
0 = "Cat", 1 = dog, 2= "bird", 3 = "turtle", 99 = "I Don't Know"
Can someone help me out with some starting code for this translation to animalsTable2 showing the string values?
Any help would be so very much appreciated! I want to thank you in advance for your help!
The best solution would be to create a related table that defines the integer values.
CREATE TABLE [Pets](
[ID] [int] NOT NULL,
[Pet] [varchar](50) NULL,
CONSTRAINT [PK_Pets] PRIMARY KEY CLUSTERED
([ID] ASC) ON [PRIMARY]
Then you can insert your pet descriptions but you can leave out the "I don't Know" item; it can be handled by left joining the Pets table to your main table.
--0 = "Cat", 1 = dog, 2= "bird", 3 = "turtle", 99 = "I Don't Know"`
INSERT INTO [Pets] ([ID],[Pet]) VALUES(0, 'cat');
INSERT INTO [Pets] ([ID],[Pet]) VALUES(1, 'dog');
INSERT INTO [Pets] ([ID],[Pet]) VALUES(2, 'bird');
INSERT INTO [Pets] ([ID],[Pet]) VALUES(3, 'turtle');
Now you can include the [Pets].[Descr] field in the output of your query like so
SELECT [MainTableFiled1]
,[MainTableFieldx]
,isnull([Pet], 'I dont know') as Pet
FROM [dbo].[MainTable] a
LEFT JOIN [dbo].[Pets] b
ON a.[MainTable_PetID] = b.[ID]
Alternatively, you can just define the strings in a case statement inside you query. This however, is not advised if you could be using the strings in more than one query.
Select case SomeField
when 0 then 'cat'
when 1 then 'dog'
when 2 then 'bird'
when 3 then 'turtle'
else 'i dont know' end as IntToString
from SomeTable
The benefit of the related table is you have only one place to maintain your string definitions and any edits would propagate to all queries, views or procedures that use it.
You can create a temp table to store the mappings. Then insert from a join between that table and the original table like so:
-- Create temp table
DECLARE #animalMapping TABLE(
animalType int NOT NULL,
animalName varchar(30) NOT NULL
);
-- Insert values into temp table
INSERT INTO #animalMapping (animalType, animalName)
VALUES (0, 'Cat'),
(1, 'Dog'),
(2, 'Bird'),
(3, 'Turtle'),
(99, 'I don''t know');
-- Insert into new table
INSERT INTO animalsTable2
SELECT id, <other fields from animalstable>,
#animalMapping.animalName
FROM animalstable
JOIN #animalMapping
ON animalstable.AnimalType = #animalMapping.animalType

TransactionScope fails check constraint

I am having a issue using TransactionScope and a check constraint in SQL Server.
I want to insert into the table as such:
Col A | Col B
------------
Dave | 0
Fred | 1
The table has a check constraint that there must always be an entry in Col B with '0'. The first row is inserting fine but the second row fails the constraint.
command.CommandText = #"INSERT INTO MyTable (ColA, ColB) VALUES(#ColA, #ColB)";
foreach (var row in model.Rows)
{
command.Parameters["#ColA"].Value = model.ColA;
command.Parameters["#ColB"].Value = model.ColB;
command.ExecuteNonQuery();
}
The check constraint calls the following function
IF EXISTS (SELECT * FROM mytable WHERE ColB = 0) RETURN 1
RETURN 0
Could this be because the constraint is only looking at committed data and if so how can it be told to look at uncommitted data as well
I don't think Check Constraints are suitable for a scenario like yours.You should use a instead of update/insert trigger to check that there's at least one row (in the table and /or in inserted values)
You have a inserted table in a trigger that contains all the rows that will be inserted so you can write something like this :
IF NOT EXISTS (SELECT * FROM mytable a UNION inserted WHERE ColB = 0) RIASEERROR("At least one row with ColB=0 should exist")

How to optimize SQL query?

I have 2 tables ('keys' is consists of about 6 fields, 'stats' is consists of about 65 fields).
I want to insert rows in both of tables without dublication of phrase text. I use something like this:
UPDATE Keys SET CommandType = 'ADDED', CommandCode = #CommandCode WHERE
KeyText = #KeyText AND Tab_ID = #TabID AND CommandType = 'DELETED';
INSERT INTO Keys (IsChecked, KeyText, AddDateTime, Tab_ID, KeySource_ID, CommandCode, CommandType)
SELECT 0, #KeyText, datetime(), #TabID, #KeySourceID, #CommandCode, 'ADDED'
WHERE NOT EXISTS (SELECT 1 FROM Keys WHERE Tab_ID = #TabID AND KeyText = #KeyText);
INSERT INTO Statistics (Key_ID)
SELECT ID FROM Keys WHERE KeyText = #KeyText AND Tab_ID = #TabID AND (CommandType IS NULL OR CommandType <> 'DELETED') AND
NOT EXISTS (SELECT 1 FROM Statistics WHERE Key_ID = (SELECT ID FROM Keys WHERE KeyText = #KeyText AND Tab_ID = #TabID AND (CommandType IS NULL OR CommandType <> 'DELETED') LIMIT 1));
How can I optimize it? I create indexes for all used in this query fields. Maybe you can recommend me some solution?
Thanks for help and sorry for my bad english.
Creating indices slows down insert and update queries because the index must be updated along with the data. To optimize your particular insert statements, get rid of any indices you don't need for your typical select statements. Then work on simplifying those "and not exists" clauses. Those are the only source of any performance gains you're going to get. Try creating indices to speed that subquery up once you've simplified it.
You can combine an insert/update statement into a single statement with a MERGE statement.
If you want to copy modifications of keys into statistics, you can use an OUTPUT statement.
You'd have to add your indexes to the question in order to be able to comment on their effectiveness, but basically you want a single index on each table that contains all of the columns in your where clause. You want to use include columns for anything in your select that is not in the where clause.
The best way to optimize is to get an estimated/actual query plan and see which parts of the query are slow. In SQL Server this is done from the "query" menu. Basically, look out for anything that says "scan", that means you're missing an index. "seek" is good.
However, a query plan is mostly helpful for fine-tuning. In this case, using a different algorithm (like merge/output) will make a more drastic difference.
In SQL Server, the results would look somewhat like this:
INSERT INTO [Statistics] (ID)
SELECT ID FROM
(
MERGE [Keys] AS TARGET
USING (
SELECT #KeyText AS KeyText, #TabID AS TabId, #CommandCode AS CommandCode, #KeySourceID AS KeySourceID, 'Added' AS CommandType
) AS SOURCE
ON (target.KeyText = source.KeyText AND target.Tab_Id = #TabID)
WHEN MATCHED AND CommandType = 'DELETED' THEN
UPDATE SET Target.CommandType = Source.CommandType, Target.CommandCode = Source.CommandCode
WHEN NOT MATCHED BY TARGET THEN
INSERT (IsChecked, KeyText, AddDateTime, Tab_Id, KeySource_ID, CommandCode, CommandType) VALUES (0, KeyText, getdate(), TabId, KeySourceId, CommandCode, CommandType)
OUTPUT $Action, INSERTED.ID
) AS Changes (Action, ID)
WHERE Changes.Action = 'INSERT'
AND NOT EXISTS (SELECT 1 FROM Statistics b WHERE b.ID = Changes.ID)
The problem was in bad indexes for my tables. I reconstruct it and replace some query parametrs with static content and it works great!

Categories