So let's say I have a table in SQL server that serves as a queue for items that need processing. Something like this:
Id (bigint)
BatchGuid (guid)
BatchProcessed (bit)
...
...along with some other columns describing the item that needs to be processed, etc. So there are many running consumers that add records to this table as needed to indicate that an item needs to be processed.
So now let's say I have a job that is in charge of getting a batch of items from this table and processing them. Say we want to let it process 10 at a time. Now also assume that this job can have many instances running at once, so it is concurrently accessing the table (along with any other consumers who may be adding new records to the queue).
I was planning to do something like this:
using(var tx = new Transaction(Isolation.Serializable))
{
var batchGuid = //newGuid
executeSql("update top(10) [QUeueTable] set [BatchGuid] = batchGuid where [BatchGuid] is null");
var itemsToProcess = executeSql("select * from [QueueTable] where [BatchGuid] = batchGuid");
tx.Commit()
}
So basically what I'd be doing is starting a transaction as serializable, marking 10 items with a specific GUID, then getting those 10 items, then committing.
Is this a feasible strategy? I believe the isolation level of serializable will basically lock the whole table to prevent read/write until the transaction is complete - is this correct? Basically the transaction will block all other read/write operations on the table? I believe this is what I want in this case as I don't want to read dirty data and I don't want concurrent running jobs to stomp on each other when marking a batch of 10 to process.
Any insights as to whether I'm on the right track with this would be much appreciated. If there are better ways to accomplish this I'd welcome alternatives as well.
Serializable isolation mode does not necessarily lock the whole table. If you have an index on BatchGuid you will probably do ok, but if not then SQL will probably escalate to a table lock.
A few things you may want to look at:
Using the OUTPUT statement you can combine your UPDATE and SELECT into one query
You may need to use UPDLOCK if you have multiple processes running this query
You can do this in a single statement if you use the OUTPUT clause:
UPDATE TOP (10) [QueueTable]
OUTPUT inserted.*
SET [BatchGuid] = batchGuid
WHERE [BatchGuid] IS NULL;
Or more specifically:
var itemsToProcess = executeSql("update top(10) [QUeueTable] output inserted.* set [BatchGuid] = batchGuid where [BatchGuid] is null");
It is personally preference I suppose, but I have never been a fan of the UPDATE TOP(n) syntax, because you can't specify an ORDER BY, and in most cases when specfifying top, you want to specify an order by, I much prefer using something like:
UPDATE q
OUTPTUT inserted.*
SET [BatchGuid] = batchGuid
FROM ( SELECT TOP (10) *
FROM dbo.QueueTable
WHERE BatchGuid IS NULL
ORDER BY ID
) AS q
ADDENDUM
In response to the comment, I don't believe there is any chance of a race condition, but I was not 100% certain. The reason I don't believe this because although the query reads as a SELECT, and an UPDATE, it is syntactic sugar, it is just an update, and uses exactly the same plan, and locks as the top query. However, since I don't know for sure I decided to test:
First I set up a sample table in temp DB, and a logging table to log updated IDs
USE TempDB;
GO
CREATE TABLE dbo.T (ID BIGINT NOT NULL IDENTITY PRIMARY KEY, Col UNIQUEIDENTIFIER NULL);
INSERT dbo.T (Col)
SELECT TOP 1000000 NULL
FROM sys.all_objects a, sys.all_objects b;
CREATE TABLE dbo.T2 (ID BIGINT NOT NULL PRIMARY KEY);
Then in 10 different SSMS windows I ran this:
WHILE 1 = 1
BEGIN
DECLARE #ID UNIQUEIDENTIFIER = NEWID();
UPDATE T
SET Col = #ID
OUTPUT inserted.ID INTO dbo.T2 (ID)
FROM ( SELECT TOP 10 *
FROM dbo.T
WHERE Col IS NULL
ORDER BY ID
) t;
IF ##ROWCOUNT = 0
RETURN;
END
The whole process ran for 20 minutes updating ~500,000 rows before I stopped all 10 threads. Since updating the same row twice would throw and error when inserting to T2 as a primary key violation, and all 10 threads needed to be stopped this shows that there was no race condition, and to confirm this, I ran the following:
SELECT Col, COUNT(*)
FROM dbo.T
WHERE Col IS NOT NULL
GROUP BY Col
HAVING COUNT(*) <> 10;
Which, as expected, returned no rows.
I am happy to be proved wrong and to concede I was lucky in that none of these 100,000 iterations clashed, but I don't believe it was luck. I really believe there is a single lock, therefore it doesn't matter if you have a transaction or not, you just need the correct isolation level.
Related
I have an application in C# that uses an Oracle database.
I need a query to fetch the unlocked row from a table in oracle database.
How can I select all unlocked rows?
Is there any 'translator' out there that can translate this T-SQL (MS SQL Server) query to Oracle dialect?
SELECT TOP 1 * FROM TableXY WITH(UPDLOCK, READPAST);
I'm a little bit disappointed with Oracle lacking such a feature. They want to make me use AQ or what?
Oracle does have this feature, specifically the SKIP LOCKED portion of the SELECT statement. To quote:
SKIP LOCKED is an alternative way to handle a contending transaction
that is locking some rows of interest. Specify SKIP LOCKED to instruct
the database to attempt to lock the rows specified by the WHERE clause
and to skip any rows that are found to be already locked by another
transaction.
The documentation goes on to say it's designed for use in multi-consumer queues but this does not mean that you have to use it in this environment. Though the documentation says this there is a large caveat. You can't ask for the next N unlocked rows - only the next N rows, of which the unlocked ones will be returned.
SELECT *
FROM TableXY
WHERE ROWNUM = 1
FOR UPDATE SKIP LOCKED
Note that if the table you're selecting from is locked in exclusive mode, i.e. you've already instructed the database not to let any other session lock the table you will not get any rows returned until the exclusive lock is released.
I faced with the same problem recently, and after solved it, I wrote this blog entry:
http://nhisawesome.blogspot.com/2013/01/how-to-lock-first-unlocked-row-in-table.html
Feel free to leave a comment! Your comments are appreciated.
Short summary: instead of select first unlocked one and lock it, I select a bunch of records, then loop through the bunch and try to acquire a lock on it using SKIP LOCKED hint. If the selected one is not lockable, move on to the next one, until a lock acquired or none remain.
select for update nowait will error out if you select a row that is locked. Is that what you want? I am curious as to what problem you are trying to solve. Unless you have long-running transactions, the lock on a row would be transient from one moment to the next.
Example:
CREATE TABLE TEST
(
COL1 NUMBER(10) NOT NULL,
COL2 VARCHAR2(20 BYTE) NOT NULL
);
CREATE UNIQUE INDEX TEST_PK ON TEST
(COL1);
ALTER TABLE TEST ADD (
CONSTRAINT TEST_PK
PRIMARY KEY
(COL1)
USING INDEX TEST_PK
);
SQL Session #1:
SQL> insert into test values(1,'1111');
1 row created.
SQL> insert into test values(2,'2222');
1 row created.
SQL> commit;
Commit complete.
SQL> update test set col2='AAAA' where col1=1;
1 row updated.
SQL Session #2: Attempt to read locked row, get error:
SQL> select * from test where col1=1 for update nowait;
select * from test where col1=1 for update nowait
*
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.
Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.
Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.
The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)
I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.
I have a clean up process that needs to delete around 8 million rows in a table each day (sometimes more). This process is written in C#, and uses SMO to query the schema for table indexes, disabling them before executing an sproc that deletes in batches of 500K rows.
My problem is that the entire operation is living inside a transaction. The sproc is executed inside a TransactionScope that's configured with the TransactionScopeOption.Suppress (this runs along other things that each start a new TransactionScope), which I thought would not allow a Transaction, and there are explicit commit points inside the sproc.
The C# part of the process could be summarized as this:
try {
DisableIndexes(table);
CleanTable(table);
}
finally {
RebuildIndexes(table);
}
And the sproc has a loop inside that's basically:
DECLARE #rowCount bigint = 1
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
WHILE #rowCount <> 0 BEGIN
DELETE TOP (#rowsPerBatch) Table
WHERE
ID <= #maxID
SET #rowCount = ##rowcount
END
Last night this process timed out half an hour after it started, took half an hour to rollback, and another half hour of index rebuilding...too much down time for zero work...=(
Update: I've run the process on a small sample database (and with a small timeout), and it's not as I thought. Evidently the process is correctly removing rows and making progress as I want. Still, the log is getting consumed. As I'm in the SIMPLE database mode, shouldn't the log not grow in this case? Or is the deletion sproc so 'fast', that I'm not giving the process that actually deletes rows the time it needs to keep the log clean?
Depending on your circumstances you may be able to use partitioned views instead of partitioned tables. Something like:
create table A
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 1)
)
create table B
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 2)
)
create table C
(
X int,
Y int,
Z varchar(300),
primary key (X,Y),
check (X = 3)
)
go
create view ABC
as
select * from A
union all
select * from B
union all
select * from C
go
insert abc (x,y,z)
values (1,4,'test')
insert abc (x,y,z)
values (2,99,'test'), (3,123,'test')
insert abc (x,y,z)
values (3,15125,'test')
select * from abc
truncate table c
select * from abc
If I understand your problem correctly, what you want is an automatic sliding window and partitioned tables.
I have a SQL Server database designed like this :
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name1 (string)
Name2 (string, can be null)
Name3 (string, can be null)
Name4 (string, can be null)
TableValue
Iteration (int)
IdTableParameter (int, FOREIGN KEY)
Type (string)
Value (decimal)
So, as you've just understood, TableValue is linked to TableParameter.
TableParameter is like a multidimensionnal dictionary.
TableParameter is supposed to have a lot of rows (more than 300,000 rows)
From my c# client program, I have to fill this database after each Compute() function :
for (int iteration = 0; iteration < 5000; iteration++)
{
Compute();
FillResultsInDatabase();
}
In FillResultsInDatabase() method, I have to :
Check if the label of my parameter already exists in TableParameter. If it doesn't exist, i have to insert a new one.
I have to insert the value in the TableValue
Step 1 takes a long time ! I load all the table TableParameter in a IEnumerable property and then, for each parameter I make a
.FirstOfDefault( x => x.Name1 == item.Name1 &&
x.Name2 == item.Name2 &&
x.Name3 == item.Name3 &&
x.Name4 == item.Name4 );
in order to detect if it already exists (and after to get the id).
Performance are very bad like this !
I've tried to make selection with WHERE word in order to avoid loading every row of TableParameter but performance are worse !
How can I improve the performance of step 1 ?
For Step 2, performance are still bad with classic INSERT. I am going to try SqlBulkCopy.
How can I improve the performance of step 2 ?
EDITED
I've tried with Store Procedure :
CREATE PROCEDURE GetIdParameter
#Id int OUTPUT,
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
SELECT TOP 1 #Id = Id FROM TableParameter
WHERE
TableParameter.Name1 = #Name1
AND
(#Name2 IS NULL OR TableParameter.Name2= #Name2)
AND
(#Name3 IS NULL OR TableParameter.Name3 = #Name3)
GO
CREATE PROCEDURE CreateValue
#Iteration int,
#Type nvarchar(50),
#Value decimal(32, 18),
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
DECLARE #IdParameter int
EXEC GetIdParameter #IdParameter OUTPUT,
#Name1, #Name2, #Name3
IF #IdParameter IS NULL
BEGIN
INSERT TablePArameter (Name1, Name2, Name3)
VALUES
(#Name1, #Name2, #Name3)
SELECT #IdParameter= SCOPE_IDENTITY()
END
INSERT TableValue (Iteration, IdParamter, Type, Value)
VALUES
(#Iteration, #IdParameter, #Type, #Value)
GO
I still have the same performance... :-( (not acceptable)
If I understand what's happening you're querying the database to see if the data is there in step 1. I'd use a db call to a stored procedure that that inserts the data if it not there. So just compute the results and pass to the sp.
Can you compute the results first, and then insert in batches?
Does the compute function take data from the database? If so can you turn the operation in to a set based operation and perform it on the server itself? Or may part of it?
Remember that sql server is designed for a large dataset operations.
Edit: reflecting comments
Since the code is slow on the data inserts, and you suspect that it's because the insert has to search back before it can be done, I'd suggest that you may need to place SQL Indexes on the columns that you search on in order to improve searching speed.
However I have another idea.
Why don't you just insert the data without the check and then later when you read the data remove the duplicates in that query?
Given the fact that name2 - name3 can be null, would it be possible to restructure the parameter table:
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name (string)
Dimension int
Now you can index it and simplify the query. (WHERE name = "TheNameIWant" AND Dimension="2")
(And speaking of indices, you do have index the name columns in the parameter table?)
Where do you do your commits on the insert? if you do one statement commits, group multiple inserts into one.
If you are the only one inserting values, if speed is really of essence, load all values from the database into the memory and check there.
just some ideas
hth
Mario
I must admit that I'm struggling to grasp the business process that you are trying to achieve here.
On initial review, it appears as if you are are performing a data comparison within your application tier. I would advise against this and suggest that you let the Database Engine do what it is designed to do, to manage and implement your data access.
As another poster has mentioned, I concur that you should look to create a Stored Procedure to handle your record insertion logic. The procedure can perform a simple check to see if your records already exist.
You should also consider:
Enforcing the insertion logic/rule by creating a Unique Constraint across the four name columns.
Creating a covering non-clustered index incorporating the four name columns.
With regard to performance of your inserts, perhaps you can provide some metrics to qualify what it is that you are seeing and how you are measuring it?
To give you a yardstick the current ETL insertion record for SQL Server is approx 16 million rows per second. What sort of numbers are you expecting and wanting to see?
the fastest way ( i know so far) is bulk insert. but not just lines of INSERT. try insert + select + union. it works pretty fast.
insert into myTable
select a1, b1, c1, ...
union select a2, b2, c2, ...
union select a3, b3, c3, ...
I'm building an ASP.NET MVC 2 site that uses LINQ to SQL. In one of the places where my site accesses the DB, I think a race condition is possible.
DB Architecture
Here are some of the columns of the relevant DB table, named Revisions:
RevisionID - bigint, IDENTITY, PK
PostID - bigint, FK to PK of Posts table
EditNumber - int
RevisionText - nvarchar(max)
On my site, users can submit a Post and edit a Post later on. Users other than the original poster are able to edit a Post - so there is scope for multiple edits on a single Post simultaneously.
When submitting a Post, a record in the Posts table is created, as well as a record in the Revisions table with PostID set to the ID of the Posts record, RevisionText set to the Post text, and EditNumber set to 1.
When editing a Post, only a Revisions record is created, with EditNumber being set to 1 higher than the latest edit number.
Thus, the EditNumber column refers to how many times a Post has been edited.
Incrementing EditNumber
The challenge that I see in implementing those functions is incrementing the EditNumber column. As that column can't be an IDENTITY, I have to manipulate its value manually.
Here's my LINQ query for determining what EditNumber a new Revision should have:
using(var db = new DBDataContext())
{
var rev = new Revision();
rev.EditNumber = db.Revisions.Where(r => r.PostID == postID).Max(r => r.EditNumber) + 1;
// ... (fill other properties)
db.Revisions.InsertOnSubmit(rev);
db.SubmitChanges();
}
Calculating a maximum and incrementing it can lead to a race condition.
Is there a better way to implement that function?
Update directly in the database and return the new revision:
update Revisions
set EditNumber += 1
output INSERTED.EditNumber
where PostID = #postId;
Unfortunately, this is not possible in LINQ. In fact, is not possible in the client at all, no matter the technology used, short of doing pessimistic locking which has too many drawback to worth considering.
Updated:
Here is how I would insert a new revision (including first revision):
create procedure usp_insertPostRevision
#postId int,
#text nvarchar(max),
#revisionId bigint output
as
begin
set nocount on;
declare #nextEditNumber (EditNumber int not null);
declare #rc int = 0;
begin transaction;
begin try
update Posts
set LastRevision += 1
output INSERTED.LastRevision
into #nextEditNumber (EditNumber)
where PostId = #postId;
set #rc = ##rowcount;
if (#rc <> 1)
raiserror (N'Expected exactly one post with Id:%i. Found:%i',
16, 1 , #postId, #rc);
insert into Revisions
(PostId, Text, EditNumber)
select #postID, #text, EditNumber
from #nextEditNumber;
set #revisionId = scope_identity();
commit;
end try
begin catch
... // Error handling omitted
end catch
end
I omitted the error handling, see Exception handling and nested transactions for a template procedure than handles errors and nested transactions properly.
You'll notice the Posts table has a LastRevision field that is used as the increment for the post revisions. This is much better than computing the MAX each time you add a revision, as it avoid a (range) scan of Revisions. It also acts as a concurrency protection: only one transaction at a time will be able to update it, and only that transaction will proceed with inserting a new revision. Concurrent transactions will block and wait until the first one commits, then the next transaction unblocked will correctly update the revision number to +1.
Can multiple users edit the same post at the same time? If not then you do not have a race condition unless some how a single user can submit multiple edits simultaneously.
If revisions are only permitted by the user who submitted the comment then you're OK with the above - if multiple users can be revising a single comment then there's scope for problems.
Since there is only one record in the Posts table per Post, use a lock.
Read the record in the Posts table and use a table hint [WITH (ROWLOCK, XLOCKX)] to get an exclusive lock. Set the lock timeout to wait a few milliseconds.
If the process gets the lock, then it can add the revision record. If the process cannot get the lock, then have the process try again. After a few retries if the process cannot get a lock, return an error.
Since EditNumber is a property determined by membership in a collection, have the collection provide it.
Make EditNumber a computed column - COUNT of records for same post with lesser RevisionID.