I have a clean up process that needs to delete around 8 million rows in a table each day (sometimes more). This process is written in C#, and uses SMO to query the schema for table indexes, disabling them before executing an sproc that deletes in batches of 500K rows.
My problem is that the entire operation is living inside a transaction. The sproc is executed inside a TransactionScope that's configured with the TransactionScopeOption.Suppress (this runs along other things that each start a new TransactionScope), which I thought would not allow a Transaction, and there are explicit commit points inside the sproc.
The C# part of the process could be summarized as this:
try {
DisableIndexes(table);
CleanTable(table);
}
finally {
RebuildIndexes(table);
}
And the sproc has a loop inside that's basically:
DECLARE #rowCount bigint = 1
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
WHILE #rowCount <> 0 BEGIN
DELETE TOP (#rowsPerBatch) Table
WHERE
ID <= #maxID
SET #rowCount = ##rowcount
END
Last night this process timed out half an hour after it started, took half an hour to rollback, and another half hour of index rebuilding...too much down time for zero work...=(
Update: I've run the process on a small sample database (and with a small timeout), and it's not as I thought. Evidently the process is correctly removing rows and making progress as I want. Still, the log is getting consumed. As I'm in the SIMPLE database mode, shouldn't the log not grow in this case? Or is the deletion sproc so 'fast', that I'm not giving the process that actually deletes rows the time it needs to keep the log clean?
Depending on your circumstances you may be able to use partitioned views instead of partitioned tables. Something like:
create table A
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 1)
)
create table B
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 2)
)
create table C
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 3)
)
go
create view ABC
as
select * from A
union all
select * from B
union all
select * from C
go
insert abc (x,y,z)
values (1,4,'test')
insert abc (x,y,z)
values (2,99,'test'), (3,123,'test')
insert abc (x,y,z)
values (3,15125,'test')
select * from abc
truncate table c
select * from abc
If I understand your problem correctly, what you want is an automatic sliding window and partitioned tables.
Related
So let's say I have a table in SQL server that serves as a queue for items that need processing. Something like this:
Id (bigint)
BatchGuid (guid)
BatchProcessed (bit)
...
...along with some other columns describing the item that needs to be processed, etc. So there are many running consumers that add records to this table as needed to indicate that an item needs to be processed.
So now let's say I have a job that is in charge of getting a batch of items from this table and processing them. Say we want to let it process 10 at a time. Now also assume that this job can have many instances running at once, so it is concurrently accessing the table (along with any other consumers who may be adding new records to the queue).
I was planning to do something like this:
using(var tx = new Transaction(Isolation.Serializable))
{
var batchGuid = //newGuid
executeSql("update top(10) [QUeueTable] set [BatchGuid] = batchGuid where [BatchGuid] is null");
var itemsToProcess = executeSql("select * from [QueueTable] where [BatchGuid] = batchGuid");
tx.Commit()
}
So basically what I'd be doing is starting a transaction as serializable, marking 10 items with a specific GUID, then getting those 10 items, then committing.
Is this a feasible strategy? I believe the isolation level of serializable will basically lock the whole table to prevent read/write until the transaction is complete - is this correct? Basically the transaction will block all other read/write operations on the table? I believe this is what I want in this case as I don't want to read dirty data and I don't want concurrent running jobs to stomp on each other when marking a batch of 10 to process.
Any insights as to whether I'm on the right track with this would be much appreciated. If there are better ways to accomplish this I'd welcome alternatives as well.
Serializable isolation mode does not necessarily lock the whole table. If you have an index on BatchGuid you will probably do ok, but if not then SQL will probably escalate to a table lock.
A few things you may want to look at:
Using the OUTPUT statement you can combine your UPDATE and SELECT into one query
You may need to use UPDLOCK if you have multiple processes running this query
You can do this in a single statement if you use the OUTPUT clause:
UPDATE TOP (10) [QueueTable]
OUTPUT inserted.*
SET [BatchGuid] = batchGuid
WHERE [BatchGuid] IS NULL;
Or more specifically:
var itemsToProcess = executeSql("update top(10) [QUeueTable] output inserted.* set [BatchGuid] = batchGuid where [BatchGuid] is null");
It is personally preference I suppose, but I have never been a fan of the UPDATE TOP(n) syntax, because you can't specify an ORDER BY, and in most cases when specfifying top, you want to specify an order by, I much prefer using something like:
UPDATE q
OUTPTUT inserted.*
SET [BatchGuid] = batchGuid
FROM ( SELECT TOP (10) *
FROM dbo.QueueTable
WHERE BatchGuid IS NULL
ORDER BY ID
) AS q
ADDENDUM
In response to the comment, I don't believe there is any chance of a race condition, but I was not 100% certain. The reason I don't believe this because although the query reads as a SELECT, and an UPDATE, it is syntactic sugar, it is just an update, and uses exactly the same plan, and locks as the top query. However, since I don't know for sure I decided to test:
First I set up a sample table in temp DB, and a logging table to log updated IDs
USE TempDB;
GO
CREATE TABLE dbo.T (ID BIGINT NOT NULL IDENTITY PRIMARY KEY, Col UNIQUEIDENTIFIER NULL);
INSERT dbo.T (Col)
SELECT TOP 1000000 NULL
FROM sys.all_objects a, sys.all_objects b;
CREATE TABLE dbo.T2 (ID BIGINT NOT NULL PRIMARY KEY);
Then in 10 different SSMS windows I ran this:
WHILE 1 = 1
BEGIN
DECLARE #ID UNIQUEIDENTIFIER = NEWID();
UPDATE T
SET Col = #ID
OUTPUT inserted.ID INTO dbo.T2 (ID)
FROM ( SELECT TOP 10 *
FROM dbo.T
WHERE Col IS NULL
ORDER BY ID
) t;
IF ##ROWCOUNT = 0
RETURN;
END
The whole process ran for 20 minutes updating ~500,000 rows before I stopped all 10 threads. Since updating the same row twice would throw and error when inserting to T2 as a primary key violation, and all 10 threads needed to be stopped this shows that there was no race condition, and to confirm this, I ran the following:
SELECT Col, COUNT(*)
FROM dbo.T
WHERE Col IS NOT NULL
GROUP BY Col
HAVING COUNT(*) <> 10;
Which, as expected, returned no rows.
I am happy to be proved wrong and to concede I was lucky in that none of these 100,000 iterations clashed, but I don't believe it was luck. I really believe there is a single lock, therefore it doesn't matter if you have a transaction or not, you just need the correct isolation level.
Currently development team is done their application, and as a tester needs to insert 1000000 records into the 20 tables, for performance testing.
I gone through the tables and there is relationship between all the tables actually.
To insert that much dummy data into the tables, I need to understand the application completely in very short span so that I don't have the dummy data also by this time.
In SQL server is there any way to insert this much data insertion possibility.
please share the approaches.
Currently I am planning with the possibilities to create dummy data in excel, but here I am not sure the relationships between the tables.
Found in Google that SQL profiler will provide the order of execution, but waiting for the access to analyze this.
One more thing I found in Google is red-gate tool can be used.
Is there any script or any other solution to perform this tasks in simple way.
I am very sorry if this is a common question, I am working first time in SQL real time scenario. but I have the knowledge on SQL.
Why You don't generate those records in SQL Server. Here is a script to generate table with 1000000 rows:
DECLARE #values TABLE (DataValue int, RandValue INT)
;WITH mycte AS
(
SELECT 1 DataValue
UNION all
SELECT DataValue + 1
FROM mycte
WHERE DataValue + 1 <= 1000000
)
INSERT INTO #values(DataValue,RandValue)
SELECT
DataValue,
convert(int, convert (varbinary(4), NEWID(), 1)) AS RandValue
FROM mycte m
OPTION (MAXRECURSION 0)
SELECT
v.DataValue,
v.RandValue,
(SELECT TOP 1 [User_ID] FROM tblUsers ORDER BY NEWID())
FROM #values v
In table #values You will have some random int value(column RandValue) which can be used to generate values for other columns. Also You have example of getting random foreign key.
Below is a simple procedure I wrote to insert millions of dummy records into the table, I know its not the most efficient one but serves the purpose for a million records it takes around 5 minutes. You need to pass the no of records you need to generate while executing the procedure.
IF EXISTS (SELECT 1 FROM dbo.sysobjects WHERE id = OBJECT_ID(N'[dbo].[DUMMY_INSERT]') AND type in (N'P', N'PC'))
BEGIN
DROP PROCEDURE DUMMY_INSERT
END
GO
CREATE PROCEDURE DUMMY_INSERT (
#noOfRecords INT
)
AS
BEGIN
DECLARE #count int
SET #count = 1;
WHILE (#count < #noOfRecords)
BEGIN
INSERT INTO [dbo].[LogTable] ([UserId],[UserName],[Priority],[CmdName],[Message],[Success],[StartTime],[EndTime],[RemoteAddress],[TId])
VALUES(1,'user_'+CAST(#count AS VARCHAR(256)),1,'dummy command','dummy message.',0,convert(varchar(50),dateadd(D,Round(RAND() * 1000,1),getdate()),121),convert(varchar(50),dateadd(D,Round(RAND() * 1000,1),getdate()),121),'160.200.45.1',1);
SET #count = #count + 1;
END
END
you can use the cursor for repeat data:
for example this simple code:
Declare #SYMBOL nchar(255), --sample V
#SY_ID int --sample V
Declare R2 Cursor
For SELECT [ColumnsName]
FROM [TableName]
For Read Only;
Open R2
Fetch Next From R2 INTO #SYMBOL,#SY_ID
While (##FETCH_STATUS <>-1 )
Begin
Insert INTO [TableName] ([ColumnsName])
Values (#SYMBOL,#SY_ID)
Fetch Next From R2 INTO #SYMBOL,#SY_ID
End
Close R2
Deallocate R2
/*wait a ... moment*/
SELECT COUNT(*) --check result
FROM [TableName]
We are working on a solution which fires many search requests torwards three different public databases placed in three different countries. For example a search fetches data from one db and passes them as parameter to another db. The parameter is a list which each item needs to be logically connected with an OR operator. Therefore we end up having a sql select statement with up to 1000 OR operators linked inside the where clause.
Now my question is does 1000 or 500 or even 5000 logical AND or OR Operators inside select statement make the db slower and should I instead better request all data to the pc and do the matching on my pc.
The amount of is data is between 5000 and 10000 records, we are talking about a public db therefore the amount keeps growing.
For example such a sql statement:
select * from some_table
where .. and .. or .. or.. or..
or.. or.. or.. or.. or.. or.. (1000 times)
If I fetch all data to my pc I could have a LINQ Statement that does the filtering.
What do you suggest me to do? Any experiences on this one guys?
Sorry if this is a duplicate just let me know in comments and I'll delete this question.
EDIT:
It should be considered that many users may access the databases at the same time.
I always learned that running a query with hundreds of OR statements are bad for performance. However, even when running a sample here on 12g, querying a table with or or in using an primary key index doesn't seem to change the execution plan.
Therefore I say: it doesn't matter. The only things you could consider are readability, query length, etc.
Still, I personally prefer the where in.
See this other useful question with sample data.
Process this all in the database with a single query. Batching similar operations is usually the best thing you can do for database performance.
The most expensive part of the query is reading the data from disk. Once the data is in memory, filtering out a few thousand conditions is a small amount of work. Your local processor probably is faster than the database server. But it doesn't matter because your machine would spend too much time on unnecessary IO if you returned all the records.
Also, 5000 conditions in a SQL query is only a problem if you run that query a hundred times a second.
I think you should just try.
Create an example that is a simple as possible, yet complex enough to be realistic, and then run it with some form of benchmarking
Whatever works best for you is what you should choose to do.
Edit:
That said - such a large number of and's and or's in a single SQL statement does sound complicated and messy. Unless there is a real benefit from doing it this way(?), I would probably try to find a cleaner way to do this, for instance by splitting the operation into several steps and applying Linq or something similar, as you sugest, even if it is just to make the solution more manageable.
The answer is - depend
How big is the data on the public db ? if you are querying Google than fetching all the data is not an option.
It would be reasonable to assume that those public db have much stronger hardware and db tuning than your home pc.
Is there an option that you will got black listed from those public db ?
Does order matter ? if you query db 1 and then db 2 will be faster then query db 2 and then db 1 ?
Mostly it's try & error and what work best for you and is possible.
SQL Queries on ORACLE Using Multiple Boolean Operators
Comments: I have worked many years with CRYSTAL REPORTS, a database report designer. It was one of the first drag-and-drop, GUI based tools which made it easier for developers with not much database background to construct queries with multiple tables and filter conditions. The trade-off was that the the tool was writing SQL under the hood; many times it was a serious performance hog because the workstation running the report file had to suck down the entire contents of the database tables being queried, only to run the filtering process locally on the client system. That was more than a decade ago, but I see other next-gen tools that also auto-generate really awful SQL code.
No amount of software can compensate for lousy database design. You won't get everything right the first time (as others have noticed), but a little planning can give some breathing room when the product reveals under real-world use the demands of PERFORMANCE and SCALABILITY.
Demonstration Schema and Test Data
The following solution was designed on an ORACLE 11g Release 2 RDBMS system. The first table can be represented by a database VIEW, INLINE QUERY, SUB QUERY, MATERIALIZED VIEW or even a CURSOR output, so the "attributes" discussed in this example could be coming from multiple table sources and joining criteria.
CREATE TABLE "ZZ_DATA_ATTRIBUTES"
( "DATA_ID" NUMBER(10,0) NOT NULL ENABLE,
"NAME" VARCHAR2(50),
"AGE" NUMBER(5,0),
"HH_SIZE" NUMBER(5,0),
"SURVEY_SCORE" NUMBER(5,0),
"DMA_REGION" VARCHAR2(100),
"LAST_CONTACT" DATE,
CONSTRAINT "ZZ_DATA_ATTRIBUTES_PK" PRIMARY KEY ("DATA_ID") ENABLE
)
/
CREATE SEQUENCE "ZZ_DATA_ATTRIBUTES_SEQ" MINVALUE 1 MAXVALUE
9999999999999999999999999999 INCREMENT BY 1 START WITH 41 CACHE 20 NOORDER NOCYCLE
/
CREATE OR REPLACE TRIGGER "BI_ZZ_DATA_ATTRIBUTES"
before insert on "ZZ_DATA_ATTRIBUTES"
for each row
begin
if :NEW."DATA_ID" is null then
select "ZZ_DATA_ATTRIBUTES_SEQ".nextval into :NEW."DATA_ID" from sys.dual;
end if;
end;
/
ALTER TRIGGER "BI_ZZ_DATA_ATTRIBUTES" ENABLE
/
The SEQUENCE and TRIGGER objects are just for unique, auto-incremented values for the primary key on each table.
CREATE TABLE "ZZ_CONDITION_RESULTS"
( "RESULT_ID" NUMBER(10,0) NOT NULL ENABLE,
"DATA_ID" NUMBER(10,0) NOT NULL ENABLE,
"COND_ONE" NUMBER(10,0),
"COND_TWO" NUMBER(10,0),
"COND_THREE" NUMBER(10,0),
"COND_FOUR" NUMBER(10,0),
"COND_FIVE" NUMBER(10,0),
CONSTRAINT "ZZ_CONDITION_RESULTS_PK" PRIMARY KEY ("RESULT_ID") ENABLE
)
/
ALTER TABLE "ZZ_CONDITION_RESULTS" ADD CONSTRAINT "ZZ_CONDITION_RESULTS_FK"
FOREIGN KEY ("DATA_ID") REFERENCES "ZZ_DATA_ATTRIBUTES" ("DATA_ID") ENABLE
/
CREATE SEQUENCE "ZZ_CONDITION_RESULTS_SEQ" MINVALUE 1 MAXVALUE
9999999999999999999999999999 INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER NOCYCLE
/
CREATE OR REPLACE TRIGGER "BI_ZZ_CONDITION_RESULTS"
before insert on "ZZ_CONDITION_RESULTS"
for each row
begin
if :NEW."RESULT_ID" is null then
select "ZZ_CONDITION_RESULTS_SEQ".nextval into :NEW."RESULT_ID" from sys.dual;
end if;
end;
/
ALTER TRIGGER "BI_ZZ_CONDITION_RESULTS" ENABLE
/
The table ZZ_CONDITION_RESULTS should be a TABLE type. It will contain the results of each individual boolean OR criteria. While 1000's of columns may not be practically feasible, the initial approach will show how you can line up lots of boolean outputs and be able to quickly identify and isolate the combinations and patterns of interest.
Sample Data
You can pick your own data values, but these were created to make the examples work. I chose the theme of MARKETING, where the data pulled together are different attributes our fictional company has gathered about their customers: customer name, age, hh_size (Household Size), The scoring results of some bench marked survey, DMA (Demographic Marketing Area) Region and the date the customer was last contacted.
Defined Boolean Arguments Using an Oracle Package Structure
The initial design is to calculate the business logic through an Oracle PL/SQL Package Object. For example, in the OP:
select * from some_table
where .. and .. or .. or.. or..
or.. or.. or.. or.. or.. or.. (1000 times)
Each blank is a separate Oracle function call from within the package(s). The result is represented as a column value for each record of attributes that are evaluated.
create or replace package ZZ_PKG_MARKETING_DEMO as
c_result_true constant pls_integer:= 1;
c_result_false constant pls_integer:= 0;
cursor attrib_cur is
select data_id, name, age, hh_size, survey_score, dma_region,
last_contact
from zz_data_attributes;
TYPE attrib_record_type IS RECORD (
data_id zz_data_attributes.data_id%TYPE,
name zz_data_attributes.name%TYPE,
age zz_data_attributes.age%TYPE,
hh_size zz_data_attributes.hh_size%TYPE,
survey_score zz_data_attributes.survey_score%TYPE,
dma_region zz_data_attributes.dma_region%TYPE,
last_contact zz_data_attributes.last_contact%TYPE
);
function evaluate_cond_one (
p_attrib_rec attrib_record_type) return pls_integer;
function evaluate_cond_two (
p_attrib_rec attrib_record_type) return pls_integer;
function evaluate_cond_three (
p_attrib_rec attrib_record_type) return pls_integer;
function evaluate_cond_four (
p_attrib_rec attrib_record_type) return pls_integer;
function evaluate_cond_five (
p_attrib_rec attrib_record_type) return pls_integer;
procedure main_driver;
end;
create or replace package body "ZZ_PKG_MARKETING_DEMO" is
function evaluate_cond_one (
p_attrib_rec attrib_record_type) return pls_integer
as
begin
-- Checks if person is from a DMA Region in California.
IF p_attrib_rec.dma_region like 'CA%'
THEN return c_result_true;
ELSE return c_result_false;
END IF;
end EVALUATE_COND_ONE;
function evaluate_cond_two (
p_attrib_rec attrib_record_type) return pls_integer
as
c_begin_age_range constant zz_data_attributes.age%TYPE:= 20;
c_end_age_range constant zz_data_attributes.age%TYPE:= 35;
begin
-- Part 1 of 2 Checks if person belongs to the 20 to 35 years age bracket
IF p_attrib_rec.age between c_begin_age_range and c_end_age_range
THEN return c_result_true;
ELSE return c_result_false;
END IF;
end EVALUATE_COND_TWO;
function evaluate_cond_three (
p_attrib_rec attrib_record_type) return pls_integer
as
c_lowest_age constant zz_data_attributes.age%TYPE:= 45;
begin
-- Part 2 of 2 Checks if person is from age 45 and up demographic.
IF p_attrib_rec.age >= c_lowest_age
THEN return c_result_true;
ELSE return c_result_false;
END IF;
end EVALUATE_COND_THREE;
function evaluate_cond_four (
p_attrib_rec attrib_record_type) return pls_integer
as
c_cutoff_score CONSTANT zz_data_attributes.survey_score%TYPE:= 1200;
begin
-- Checks if person's survey score is higher than c_cutoff_score
IF p_attrib_rec.survey_score >= c_cutoff_score
THEN return c_result_true;
ELSE return c_result_false;
END IF;
end EVALUATE_COND_FOUR;
function evaluate_cond_five (
p_attrib_rec attrib_record_type) return pls_integer
as
c_last_contact_period CONSTANT pls_integer:= -750;
-- Note current date is anchored to a static value so the data output
-- in this example will still work regardless of how old this post
-- may get.
c_current_date CONSTANT zz_data_attributes.last_contact%TYPE:=
to_date('03/25/2014','MM/DD/YYYY');
begin
-- Checks if person's last contact date has been in the last 750
-- days.
IF p_attrib_rec.last_contact >=
(c_current_date + c_last_contact_period)
THEN return c_result_true;
ELSE return c_result_false;
END IF;
end EVALUATE_COND_FIVE;
procedure MAIN_DRIVER
as
v_rec_attr attrib_record_type;
v_rec_cond zz_condition_results%ROWTYPE;
begin
for i in attrib_cur
loop
-- Set the input record variable with the attribute values queried by the
-- current cursor.
v_rec_attr.data_id := i.data_id;
v_rec_attr.name := i.name;
v_rec_attr.age := i.age;
v_rec_attr.hh_size := i.hh_size;
v_rec_attr.survey_score := i.survey_score;
v_rec_attr.dma_region := i.dma_region;
v_rec_attr.last_contact := i.last_contact;
-- Set each condition column value equal to their matching package function.
v_rec_cond.cond_one := evaluate_cond_one(p_attrib_rec => v_rec_attr);
v_rec_cond.cond_two := evaluate_cond_two(p_attrib_rec => v_rec_attr);
v_rec_cond.cond_three:= evaluate_cond_three(p_attrib_rec => v_rec_attr);
v_rec_cond.cond_four := evaluate_cond_four(p_attrib_rec => v_rec_attr);
v_rec_cond.cond_five := evaluate_cond_five(p_attrib_rec => v_rec_attr);
INSERT INTO zz_condition_results (data_id, cond_one, cond_two,
cond_three, cond_four, cond_five)
VALUES
( v_rec_attr.data_id,
v_rec_cond.cond_one,
v_rec_cond.cond_two,
v_rec_cond.cond_three,
v_rec_cond.cond_four,
v_rec_cond.cond_five );
end loop;
COMMIT;
end MAIN_DRIVER;
end "ZZ_PKG_MARKETING_DEMO";
PL/SQL Notes: Some may not be familiar the CUSTOM DATA TYPES such as the RECORD VARIABLE TYPE defined within the package in procedure MAIN_DRIVER. They provide easier to handle and reference identification of the data being processed.
Boolean Arithmetic in Plain English (well, sort of)
The CURSOR Named ATTRIB_CUR can be modified to operate on a single record or a smaller input data set. For now, invoke the MAIN_DRIVER procedure to process all the records in the attributes data source (again, this doesn't have to be a single table).
BEGIN
ZZ_PKG_MARKETING_DEMO.MAIN_DRIVER;
END;
Now that each example condition has been evaluated for all the sample records, there are several simpler pathways to evaluating the boolean values, currently captured as values of "1" (for TRUE) and "0" (for FALSE).
If only one of this series of conditions need to be met (as in a long chain of OR operators), then the WHERE clause should look something like this:
WHERE COND_ONE = 1 OR COND_TWO = 1 OR COND_THREE = 1 OR COND_FOUR = 1 OR COND_FIVE = 1
A shorthand approach could be:
WHERE (COND_ONE + COND_TWO + COND_THREE + COND_FOUR + COND_FIVE) > 0
What does this buy? There are performance gains by processing an otherwise static evaluation (the custom conditions) at the time that the data record is populated. One good reason is that each subsequent query that asks about this criteria will not need to crunch through the business logic again. We also leverage an advantage through a decision value with a very, very, very low cardinality (TWO!)
The second "shorthand" example of the WHERE filter criteria is a clue about how the final approach will manage "thousands" of Boolean evaluations.
Scalability: How to Do This Several Thousand More Times in a Row
It would be impractical to assume this approach could scale up to the magnitude presented in the OP. The final question: How can this solution apply for an N thousand chain of boolean values?
Hint: PIVOT your results.
Expandable Table Design for Lots of Boolean Conditions
Here is also a mock-up of the table with the way the sample data would fit into it:
The SQL needed to fetch a multiple OR relation between the five sample conditions can be accomplished through an aggregation query:
-- For multiple OR relations:
SELECT DATA_ID
FROM ZZ_CONDITION_PIVOT
GROUP BY DATA_ID
HAVING SUM(RESULT) > 0
Veterans will probably note this syntax can be further simplified with the use of database supported ANALYTICAL FUNCTIONS.
This design should be low maintenance with any number of boolean conditions introduced during or after the implementation. The table designs should remain the same throughout.
Let me know your thoughts, it looks like the discussion has moved on to other issues and contributors so this is probably long enough to get you started. Onward!
I have a table with sequential numbers (think invoice numbers or student IDs).
At some point, the user needs to request the previous number (in order to calculate the next number). Once the user knows the current number, they need to generate the next number and add it to the table.
My worry is that two users will be able to erroneously generate two identical numbers due to concurrent access.
I've heard of stored procedures, and I know that that might be one solution. Is there a best-practice here, to avoid concurrency issues?
Edit: Here's what I have so far:
USE [master]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[sp_GetNextOrderNumber]
AS
BEGIN
BEGIN TRAN
DECLARE #recentYear INT
DECLARE #recentMonth INT
DECLARE #recentSequenceNum INT
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- get the most recent numbers
SELECT #recentYear = Year, #recentMonth = Month, #recentSequenceNum = OrderSequenceNumber
FROM dbo.OrderNumbers
WITH (XLOCK)
WHERE Id = (SELECT MAX(Id) FROM dbo.OrderNumbers)
// increment the numbers
IF (YEAR(getDate()) > IsNull(#recentYear,0))
BEGIN
SET #recentYear = YEAR(getDate());
SET #recentMonth = MONTH(getDate());
SET #recentSequenceNum = 0;
END
ELSE
BEGIN
IF (MONTH(getDate()) > IsNull(#recentMonth,0))
BEGIN
SET #recentMonth = MONTH(getDate());
SET #recentSequenceNum = 0;
END
ELSE
SET #recentSequenceNum = #recentSequenceNum + 1;
END
-- insert the new numbers as a new record
INSERT INTO dbo.OrderNumbers(Year, Month, OrderSequenceNumber)
VALUES (#recentYear, #recentMonth, #recentSequenceNum)
COMMIT TRAN
END
This seems to work, and gives me the values I want. So far, I have not yet added any locking to prevent concurrent access.
Edit 2: Added WITH(XLOCK) to lock the table until the transaction completes. I'm not going for performance here. As long as I don't get duplicate entries added, and deadlocks don't happen, this should work.
you know that SQL Server does that for you, right? You can you a identity column if you need sequential number or a calculated column if you need to calculate the new value based on another one.
But, if that doesn't solve your problem, or if you need to do a complicated calculation to generate your new number that cant be done in a simple insert, I suggest writing a stored procedure that locks the table, gets the last value, generate the new one, inserts it and then unlocks the table.
Read this link to learn about transaction isolation level
just make sure to keep the "locking" period as small as possible
Here is a sample Counter implementation. Basic idea is to use insert trigger to update numbers of lets say, invoices. First step is to create a table to hold a value of last assigned number:
create table [Counter]
(
LastNumber int
)
and initialize it with single row:
insert into [Counter] values(0)
Sample invoice table:
create table invoices
(
InvoiceID int identity primary key,
Number varchar(8),
InvoiceDate datetime
)
Stored procedure LastNumber first updates Counter row and then retrieves the value. As the value is an int, it is simply returned as procedure return value; otherwise an output column would be required. Procedure takes as a parameter number of next numbers to fetch; output is last number.
create proc LastNumber (#NumberOfNextNumbers int = 1)
as
begin
declare #LastNumber int
update [Counter]
set LastNumber = LastNumber + #NumberOfNextNumbers -- Holds update lock
select #LastNumber = LastNumber
from [Counter]
return #LastNumber
end
Trigger on Invoice table gets number of simultaneously inserted invoices, asks next n numbers from stored procedure and updates invoices with that numbers.
create trigger InvoiceNumberTrigger on Invoices
after insert
as
set NoCount ON
declare #InvoiceID int
declare #LastNumber int
declare #RowsAffected int
select #RowsAffected = count(*)
from Inserted
exec #LastNumber = dbo.LastNumber #RowsAffected
update Invoices
-- Year/month parts of number are missing
set Number = right ('000' + ltrim(str(#LastNumber - rowNumber)), 3)
from Invoices
inner join
( select InvoiceID,
row_number () over (order by InvoiceID desc) - 1 rowNumber
from Inserted
) insertedRows
on Invoices.InvoiceID = InsertedRows.InvoiceID
In case of a rollback there will be no gaps left. Counter table could be easily expanded with keys for different sequences; in this case, a date valid-until might be nice because you might prepare this table beforehand and let LastNumber worry about selecting the counter for current year/month.
Example of usage:
insert into invoices (invoiceDate) values(GETDATE())
As number column's value is autogenerated, one should re-read it. I believe that EF has provisions for that.
The way that we handle this in SQL Server is by using the UPDLOCK table hint within a single transaction.
For example:
INSERT
INTO MyTable (
MyNumber ,
MyField1 )
SELECT IsNull(MAX(MyNumber), 0) + 1 ,
"Test"
FROM MyTable WITH (UPDLOCK)
It's not pretty, but since we were provided the database design and cannot change it due to legacy applications accessing the database, this was the best solution that we could come up with.
I have a SQL Server database designed like this :
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name1 (string)
Name2 (string, can be null)
Name3 (string, can be null)
Name4 (string, can be null)
TableValue
Iteration (int)
IdTableParameter (int, FOREIGN KEY)
Type (string)
Value (decimal)
So, as you've just understood, TableValue is linked to TableParameter.
TableParameter is like a multidimensionnal dictionary.
TableParameter is supposed to have a lot of rows (more than 300,000 rows)
From my c# client program, I have to fill this database after each Compute() function :
for (int iteration = 0; iteration < 5000; iteration++)
{
Compute();
FillResultsInDatabase();
}
In FillResultsInDatabase() method, I have to :
Check if the label of my parameter already exists in TableParameter. If it doesn't exist, i have to insert a new one.
I have to insert the value in the TableValue
Step 1 takes a long time ! I load all the table TableParameter in a IEnumerable property and then, for each parameter I make a
.FirstOfDefault( x => x.Name1 == item.Name1 &&
x.Name2 == item.Name2 &&
x.Name3 == item.Name3 &&
x.Name4 == item.Name4 );
in order to detect if it already exists (and after to get the id).
Performance are very bad like this !
I've tried to make selection with WHERE word in order to avoid loading every row of TableParameter but performance are worse !
How can I improve the performance of step 1 ?
For Step 2, performance are still bad with classic INSERT. I am going to try SqlBulkCopy.
How can I improve the performance of step 2 ?
EDITED
I've tried with Store Procedure :
CREATE PROCEDURE GetIdParameter
#Id int OUTPUT,
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
SELECT TOP 1 #Id = Id FROM TableParameter
WHERE
TableParameter.Name1 = #Name1
AND
(#Name2 IS NULL OR TableParameter.Name2= #Name2)
AND
(#Name3 IS NULL OR TableParameter.Name3 = #Name3)
GO
CREATE PROCEDURE CreateValue
#Iteration int,
#Type nvarchar(50),
#Value decimal(32, 18),
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
DECLARE #IdParameter int
EXEC GetIdParameter #IdParameter OUTPUT,
#Name1, #Name2, #Name3
IF #IdParameter IS NULL
BEGIN
INSERT TablePArameter (Name1, Name2, Name3)
VALUES
(#Name1, #Name2, #Name3)
SELECT #IdParameter= SCOPE_IDENTITY()
END
INSERT TableValue (Iteration, IdParamter, Type, Value)
VALUES
(#Iteration, #IdParameter, #Type, #Value)
GO
I still have the same performance... :-( (not acceptable)
If I understand what's happening you're querying the database to see if the data is there in step 1. I'd use a db call to a stored procedure that that inserts the data if it not there. So just compute the results and pass to the sp.
Can you compute the results first, and then insert in batches?
Does the compute function take data from the database? If so can you turn the operation in to a set based operation and perform it on the server itself? Or may part of it?
Remember that sql server is designed for a large dataset operations.
Edit: reflecting comments
Since the code is slow on the data inserts, and you suspect that it's because the insert has to search back before it can be done, I'd suggest that you may need to place SQL Indexes on the columns that you search on in order to improve searching speed.
However I have another idea.
Why don't you just insert the data without the check and then later when you read the data remove the duplicates in that query?
Given the fact that name2 - name3 can be null, would it be possible to restructure the parameter table:
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name (string)
Dimension int
Now you can index it and simplify the query. (WHERE name = "TheNameIWant" AND Dimension="2")
(And speaking of indices, you do have index the name columns in the parameter table?)
Where do you do your commits on the insert? if you do one statement commits, group multiple inserts into one.
If you are the only one inserting values, if speed is really of essence, load all values from the database into the memory and check there.
just some ideas
hth
Mario
I must admit that I'm struggling to grasp the business process that you are trying to achieve here.
On initial review, it appears as if you are are performing a data comparison within your application tier. I would advise against this and suggest that you let the Database Engine do what it is designed to do, to manage and implement your data access.
As another poster has mentioned, I concur that you should look to create a Stored Procedure to handle your record insertion logic. The procedure can perform a simple check to see if your records already exist.
You should also consider:
Enforcing the insertion logic/rule by creating a Unique Constraint across the four name columns.
Creating a covering non-clustered index incorporating the four name columns.
With regard to performance of your inserts, perhaps you can provide some metrics to qualify what it is that you are seeing and how you are measuring it?
To give you a yardstick the current ETL insertion record for SQL Server is approx 16 million rows per second. What sort of numbers are you expecting and wanting to see?
the fastest way ( i know so far) is bulk insert. but not just lines of INSERT. try insert + select + union. it works pretty fast.
insert into myTable
select a1, b1, c1, ...
union select a2, b2, c2, ...
union select a3, b3, c3, ...