How to delete data faster from large table in SQL Server? - c#

I have a huge table (log) which keeps some history data. It has more than 10 columns:
Id, Year, Month, Day, data1, data2, data3, ......
Because the table is huge, it has lots of index on it.
The system keeps inserting lots of new data into this table. However, because of the way the system works, sometimes, duplidated data will be inserted (only id is different). The duplications' id (id only) are also inserted into another table (log_existing). We have another service which will delete the duplications in both tables. Here is what we are doing now.
SET #TotalRows = 0;
SET #Rows = 0;
WHILE 1=1
BEGIN
DECLARE #Ids TABLE (id BIGINT);
INSERT INTO #Ids
SELECT TOP (#BatchSize) Id
FROM Log
DELETE FROM Log WHERE Id IN (SELECT id FROM #Ids)
DELETE FROM Log_Existing WHERE Id IN (SELECT id FROM #Ids)
SET #Rows = ##ROWCOUNT
IF(#Rows < #BatchSize)
BEGIN
BREAK;
END
SET #TotalRows = #TotalRows + #Rows
IF(#TotalRows >= #DeleteSize)
BEGIN
BREAK;
END
SET #Rows = 0;
END
Basically, the service runs every 2 minutes (or 5 minutes, configurable) to run this batch delete. The #BatchSize = 2000 and #DeleteSize = 1000000, which usually runs more than 2/5 minutes.
It works ok for some time. But now we realize that there are too many dupliactions, this process can not delete duplications fast enough. So, the database size grows larger and larger, and process is slower and slower.
Is there a way to make it faster? or some kind of guideline?
Thanks

I would try to avoid inserting duplicates into the Log table. From your description this should be possible including some of the columns which make an entry unique (besides the Id).
One option is using the IGNORE_DUP_KEY option on an unique index. When such an index is existing and the INSERT statement tries to insert a row that violates the index's unique constraint the INSERT will be ignored. See Microsoft SQL Server Help.
CREATE TABLE #Test (C1 nvarchar(10), C2 nvarchar(50), C3 datetime);
GO
CREATE UNIQUE INDEX AK_Index ON #Test (C2)
WITH (IGNORE_DUP_KEY = ON);
GO
INSERT INTO #Test VALUES (N'OC', N'Ounces', GETDATE());
INSERT INTO #Test SELECT * FROM Production.UnitMeasure;
GO
SELECT COUNT(*)AS [Number of rows] FROM #Test;
GO
DROP TABLE #Test;
GO

I think if you use delete statement with the JOIN clause something like this. It should do better.
DELETE Log, Log_Existing FROM Log, Log_Existing
WHERE Log.LOGID=Log_Existing.LOGID

Related

How to insert Huge dummy data to Sql server

Currently development team is done their application, and as a tester needs to insert 1000000 records into the 20 tables, for performance testing.
I gone through the tables and there is relationship between all the tables actually.
To insert that much dummy data into the tables, I need to understand the application completely in very short span so that I don't have the dummy data also by this time.
In SQL server is there any way to insert this much data insertion possibility.
please share the approaches.
Currently I am planning with the possibilities to create dummy data in excel, but here I am not sure the relationships between the tables.
Found in Google that SQL profiler will provide the order of execution, but waiting for the access to analyze this.
One more thing I found in Google is red-gate tool can be used.
Is there any script or any other solution to perform this tasks in simple way.
I am very sorry if this is a common question, I am working first time in SQL real time scenario. but I have the knowledge on SQL.
Why You don't generate those records in SQL Server. Here is a script to generate table with 1000000 rows:
DECLARE #values TABLE (DataValue int, RandValue INT)
;WITH mycte AS
(
SELECT 1 DataValue
UNION all
SELECT DataValue + 1
FROM mycte
WHERE DataValue + 1 <= 1000000
)
INSERT INTO #values(DataValue,RandValue)
SELECT
DataValue,
convert(int, convert (varbinary(4), NEWID(), 1)) AS RandValue
FROM mycte m
OPTION (MAXRECURSION 0)
SELECT
v.DataValue,
v.RandValue,
(SELECT TOP 1 [User_ID] FROM tblUsers ORDER BY NEWID())
FROM #values v
In table #values You will have some random int value(column RandValue) which can be used to generate values for other columns. Also You have example of getting random foreign key.
Below is a simple procedure I wrote to insert millions of dummy records into the table, I know its not the most efficient one but serves the purpose for a million records it takes around 5 minutes. You need to pass the no of records you need to generate while executing the procedure.
IF EXISTS (SELECT 1 FROM dbo.sysobjects WHERE id = OBJECT_ID(N'[dbo].[DUMMY_INSERT]') AND type in (N'P', N'PC'))
BEGIN
DROP PROCEDURE DUMMY_INSERT
END
GO
CREATE PROCEDURE DUMMY_INSERT (
#noOfRecords INT
)
AS
BEGIN
DECLARE #count int
SET #count = 1;
WHILE (#count < #noOfRecords)
BEGIN
INSERT INTO [dbo].[LogTable] ([UserId],[UserName],[Priority],[CmdName],[Message],[Success],[StartTime],[EndTime],[RemoteAddress],[TId])
VALUES(1,'user_'+CAST(#count AS VARCHAR(256)),1,'dummy command','dummy message.',0,convert(varchar(50),dateadd(D,Round(RAND() * 1000,1),getdate()),121),convert(varchar(50),dateadd(D,Round(RAND() * 1000,1),getdate()),121),'160.200.45.1',1);
SET #count = #count + 1;
END
END
you can use the cursor for repeat data:
for example this simple code:
Declare #SYMBOL nchar(255), --sample V
#SY_ID int --sample V
Declare R2 Cursor
For SELECT [ColumnsName]
FROM [TableName]
For Read Only;
Open R2
Fetch Next From R2 INTO #SYMBOL,#SY_ID
While (##FETCH_STATUS <>-1 )
Begin
Insert INTO [TableName] ([ColumnsName])
Values (#SYMBOL,#SY_ID)
Fetch Next From R2 INTO #SYMBOL,#SY_ID
End
Close R2
Deallocate R2
/*wait a ... moment*/
SELECT COUNT(*) --check result
FROM [TableName]

SQL Server 2008: re increment table after deletion

using SQL Server 2008, using MS Visual Studio 2012 C# .NET4.5
I asked a similar question last week that was solved with the following query:
DECLARE #from int = 9, #to int = 3
UPDATE MainPareto
SET pareto = m.new_pareto
FROM (
SELECT pKey, -- this is your primary key for the table
new_pareto = row_number()
over(ORDER BY CASE WHEN pareto = #from THEN #to ELSE pareto END,
CASE WHEN pareto = #from THEN 0 ELSE 1 END)
FROM MainPareto
-- put in any conditions that you want to restrict the scores by.
WHERE PG = #pg AND pareto IS NOT NULL
-- end condtions
) as m
INNER JOIN MainPareto ON MainPareto.pKey = m.pKey
WHERE MainPareto.pareto <> m.new_pareto
As you can see this works great, incriments the "league" when changes are made.
Now after some functionality user has requested a deletion and recovery.
On my winform, the user can right click the grid and delete the "part" number.
The user can also recover if needed.
However, I need a Stored procedure that will resort the grid and update like this method does after a deletion from another stored procedure has been made, my Winform will sort that part out, but i do need a procedure that can do what my current one does for a deletion.
Hope you guys understand, if not ask me and ill try and clarify best I can.
I am not totally sure if this is what you are looking for, but this is how you can reseed your Primary Key column (if your primary key is also an identity). Notice how my insert after the truncate does not include Column 1 (the primary key column).
select *
into #temp
from MainPareto
truncate table MainPareto
insert into MainPareto (col2, col3, col4) --...
select col2, col3, col4 --...
from #temp

Procedure returning a list of identity [duplicate]

I am inserting records through a query similar to this one:
insert into tbl_xyz select field1 from tbl_abc
Now I would like to retreive the newly generated IDENTITY Values of the inserted records. How do I do this with minimum amount of locking and maximum reliability?
You can get this information using the OUTPUT clause.
You can output your information to a temp target table or view.
Here's an example:
DECLARE #InsertedIDs TABLE (ID bigint)
INSERT into DestTable (col1, col2, col3, col4)
OUTPUT INSERTED.ID INTO #InsertedIDs
SELECT col1, col2, col3, col4 FROM SourceTable
You can then query the table InsertedIDs for your inserted IDs.
##IDENTITY will return you the last inserted IDENTITY value, so you have two possible problems
Beware of triggers executed when inserting into table_xyz as this may change the value of ##IDENTITY.
Does tbl_abc have more than one row. If so then ##IDENTITY will only return the identity value of the last row
Issue 1 can be resolved by using SCOPE__IDENTITY() instead of ##IDENTITY
Issue 2 is harder to resolve. Does field1 in tbl_abc define a unique record within tbl_xyz, if so you could reselect the data from table_xyz with the identity column. There are other solutions using CURSORS but these will be slow.
SELECT ##IDENTITY
This is how I've done it before. Not sure if this will meet the latter half of your post though.
EDIT
Found this link too, but not sure if it is the same...
How to insert multiple records and get the identity value?
As far as I know, you can't really do this with straight SQL in the same script. But you could create an INSERT trigger. Now, I hate triggers, but it's one way of doing it.
Depending on what you are trying to do, you might want to insert the rows into a temp table or table variable first, and deal with the result set that way. Hopefully, there is a unique column that you can link to.
You could also lock the table, get the max key, insert your rows, and then get your max key again and do a range.
Trigger:
--Use the Inserted table. This conaints all of the inserted rows.
SELECT * FROM Inserted
Temp Table:
insert field1, unique_col into #temp from tbl_abc
insert into tbl_xyz (field1, unique_col) select field1, unique_col from tbl_abc
--This could be an update, or a cursor, or whatever you want to do
SELECT * FROM tbl_xyz WHERE EXISTS (SELECT top 1 unique_col FROM #temp WHERE unique_col = tbl_xyz.unique_col)
Key Range:
Declare #minkey as int, #maxkey as int
BEGIN TRANS --You have to lock the table for this to work
--key is the name of your identity column
SELECT #minkey = MAX(key) FROM tbl_xyz
insert into tbl_xyz select field1 from tbl_abc
SELECT #maxkey = MAX(key) FROM tbl_xyz
COMMIT Trans
SELECT * FROM tbl_xyz WHERE key BETWEEN #minkey and #maxkey

Concurrent access to database - preventing two users from obtaining the same value

I have a table with sequential numbers (think invoice numbers or student IDs).
At some point, the user needs to request the previous number (in order to calculate the next number). Once the user knows the current number, they need to generate the next number and add it to the table.
My worry is that two users will be able to erroneously generate two identical numbers due to concurrent access.
I've heard of stored procedures, and I know that that might be one solution. Is there a best-practice here, to avoid concurrency issues?
Edit: Here's what I have so far:
USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[sp_GetNextOrderNumber]
AS
BEGIN
BEGIN TRAN
DECLARE #recentYear INT
DECLARE #recentMonth INT
DECLARE #recentSequenceNum INT
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- get the most recent numbers
SELECT #recentYear = Year, #recentMonth = Month, #recentSequenceNum = OrderSequenceNumber
FROM dbo.OrderNumbers
WITH (XLOCK)
WHERE Id = (SELECT MAX(Id) FROM dbo.OrderNumbers)
// increment the numbers
IF (YEAR(getDate()) > IsNull(#recentYear,0))
BEGIN
SET #recentYear = YEAR(getDate());
SET #recentMonth = MONTH(getDate());
SET #recentSequenceNum = 0;
END
ELSE
BEGIN
IF (MONTH(getDate()) > IsNull(#recentMonth,0))
BEGIN
SET #recentMonth = MONTH(getDate());
SET #recentSequenceNum = 0;
END
ELSE
SET #recentSequenceNum = #recentSequenceNum + 1;
END
-- insert the new numbers as a new record
INSERT INTO dbo.OrderNumbers(Year, Month, OrderSequenceNumber)
VALUES (#recentYear, #recentMonth, #recentSequenceNum)
COMMIT TRAN
END
This seems to work, and gives me the values I want. So far, I have not yet added any locking to prevent concurrent access.
Edit 2: Added WITH(XLOCK) to lock the table until the transaction completes. I'm not going for performance here. As long as I don't get duplicate entries added, and deadlocks don't happen, this should work.
you know that SQL Server does that for you, right? You can you a identity column if you need sequential number or a calculated column if you need to calculate the new value based on another one.
But, if that doesn't solve your problem, or if you need to do a complicated calculation to generate your new number that cant be done in a simple insert, I suggest writing a stored procedure that locks the table, gets the last value, generate the new one, inserts it and then unlocks the table.
Read this link to learn about transaction isolation level
just make sure to keep the "locking" period as small as possible
Here is a sample Counter implementation. Basic idea is to use insert trigger to update numbers of lets say, invoices. First step is to create a table to hold a value of last assigned number:
create table [Counter]
(
LastNumber int
)
and initialize it with single row:
insert into [Counter] values(0)
Sample invoice table:
create table invoices
(
InvoiceID int identity primary key,
Number varchar(8),
InvoiceDate datetime
)
Stored procedure LastNumber first updates Counter row and then retrieves the value. As the value is an int, it is simply returned as procedure return value; otherwise an output column would be required. Procedure takes as a parameter number of next numbers to fetch; output is last number.
create proc LastNumber (#NumberOfNextNumbers int = 1)
as
begin
declare #LastNumber int
update [Counter]
set LastNumber = LastNumber + #NumberOfNextNumbers -- Holds update lock
select #LastNumber = LastNumber
from [Counter]
return #LastNumber
end
Trigger on Invoice table gets number of simultaneously inserted invoices, asks next n numbers from stored procedure and updates invoices with that numbers.
create trigger InvoiceNumberTrigger on Invoices
after insert
as
set NoCount ON
declare #InvoiceID int
declare #LastNumber int
declare #RowsAffected int
select #RowsAffected = count(*)
from Inserted
exec #LastNumber = dbo.LastNumber #RowsAffected
update Invoices
-- Year/month parts of number are missing
set Number = right ('000' + ltrim(str(#LastNumber - rowNumber)), 3)
from Invoices
inner join
( select InvoiceID,
row_number () over (order by InvoiceID desc) - 1 rowNumber
from Inserted
) insertedRows
on Invoices.InvoiceID = InsertedRows.InvoiceID
In case of a rollback there will be no gaps left. Counter table could be easily expanded with keys for different sequences; in this case, a date valid-until might be nice because you might prepare this table beforehand and let LastNumber worry about selecting the counter for current year/month.
Example of usage:
insert into invoices (invoiceDate) values(GETDATE())
As number column's value is autogenerated, one should re-read it. I believe that EF has provisions for that.
The way that we handle this in SQL Server is by using the UPDLOCK table hint within a single transaction.
For example:
INSERT
INTO MyTable (
MyNumber ,
MyField1 )
SELECT IsNull(MAX(MyNumber), 0) + 1 ,
"Test"
FROM MyTable WITH (UPDLOCK)
It's not pretty, but since we were provided the database design and cannot change it due to legacy applications accessing the database, this was the best solution that we could come up with.

How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing.
I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.
The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.
Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.
The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.
Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?
I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.
For example, you can create a (non-unique) index on the staging table to deal with the "key"
Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).
The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.
Here's an example you can run in SSMS to see what's happening:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- secondary table for removing duplicates
create table #test2
(
sk int not null identity primary key
,col1 int
,col2 varchar(50)
,col3 char(3)
-- add a uniqueness constraint to filter dups
,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go
-- insert all records from original table
-- this should generate a warning if duplicate records were ignored
insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go
Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:
if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go
-- example heap table with duplicate record
create table #test1
(
col1 int
,col2 varchar(50)
,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
( 250, 'Joe''s IT Consulting and Bait Shop', null )
,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
,( 250, 'Joe''s IT Consulting and Bait Shop', null ) -- dup record
,( 666, 'The Honest Politician', 'LIE' )
,( 100, 'My Invisible Friend', 'WHO' )
;
go
-- add temporary PK and index
alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go
-- note: rebuilding the indexes may or may not provide a performance benefit
alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go
-- remove duplicates
with ranks as
(
select
sk
,ordinal = row_number() over
(
-- put all the columns composing uniqueness into the partition
partition by col1, col2, col3
order by sk
)
from #test1
)
delete
from ranks
where ordinal > 1;
go
-- remove added columns
drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go
Why not simply use, instead of a Primary Key, create an Index and set
Ignore Duplicate Keys: YES
This will prevent any duplicate key to fire an error, and it will not be created (as it exists already).
I use this method to insert around 120.000 rows per day and works flawlessly.
I would bulk copy into a temporary table and then push the data from that into the actual destination table. In this way, you can use SQL to check for and handle duplicates.
What is the data volume? You have 2 options that I can see:
1: filter it at source, by implementing your own IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.
2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use CHECKSUM for some of this, but it depends.
And fix that table. No table ever should be without a unique index, preferably as a PK. Even if you add a surrogate key because there is no natural key, you need to be able to specifically identify a particular record. Otherwise how will you get rid of the duplicates you already have?
I think this is a lot cleaner.
var dtcolumns = new string[] { "Col1", "Col2", "Col3"};
var dtDistinct = dt.DefaultView.ToTable(true, dtcolumns);
using (SqlConnection cn = new SqlConnection(cn)
{
copy.ColumnMappings.Add(0, 0);
copy.ColumnMappings.Add(1, 1);
copy.ColumnMappings.Add(2, 2);
copy.DestinationTableName = "TableNameToMapTo";
copy.WriteToServer(dtDistinct );
}
This way only need one database table and can keep Bussiness Logic in code.

Categories