How to properly reserve identity values for usage in a database?

How to properly reserve identity values for usage in a database? - c#

We have some code in which we need to maintain our own identity (PK) column in SQL. We have a table in which we bulk insert data, but we add data to related tables before the bulk insert is done, thus we can not use an IDENTITY column and find out the value up front.
The current code is selecting the MAX value of the field and incrementing it by 1. Although there is a highly unlikely chance that two instances of our application will be running at the same time, it is still not thread-safe (not to mention that it goes to the database everytime).
I am using the ADO.net entity model. How would I go about 'reserving' a range of id's to use, and when that range runs out, grab a new block to use, and guarantee that the same range will not be used.

use more universal unique identifier data type like UNIQUEIDENTIFIER (UUID) instead of INTEGER. In this case you can basically create it on the client side, pass it to the SQL and do not have to worry about it. The disadvantage is that, of course, the size of this field.
create a simple table in the database CREATE TABLE ID_GEN (ID INTEGER IDENTITY), and use this as a factory to give you the identifiers. Ideally you would create a stored procedure (or function), to which you would pass the number of identifiers you need. The stored procedure will then insert this number of rows (empty) into this ID_GEN table and will return you all new ID's, which you can use in your code. Obviously, your original tables will not have the IDENTITY anymore.
create your own variation of the ID_Factory above.
I would choose simplicity (UUID) if you are not constrained otherwise.

If it's viable to change the structure of the table, then perhaps use a uniqueidentifier for the PK instead along with newid() [SQL] or Guid.NewGuid() [C#] in your row generation code.
From Guid.NewGuid() doco:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.

Why are you using ADO.net Entity Framework to do what sounds like ETL work? (See critique of ADO.NET Entity Framework and ORM in general below. It is rant free).
Why use ints at all? Using a uniqueidentifier would solve the "multiple instances of the application running" issue.
Using a uniqueidentifier as a column default will be slower than using an int IDENTITY... it takes more time to generate a guid than an int. A guid will also be larger (16 byte) than an int (4 bytes). Try this first and if it results in acceptable performance, run with it.
If the delay introduced by generating a guid on each row insert it unacceptable, create guids in bulk (or on another server) and cache them in a table.
Sample TSQL code:
CREATE TABLE testinsert
(
date_generated datetime NOT NULL DEFAULT GETDATE(),
guid uniqueidentifier NOT NULL,
TheValue nvarchar(255) NULL
)
GO
CREATE TABLE guids
(
guid uniqueidentifier NOT NULL DEFAULT newid(),
used bit NOT NULL DEFAULT 0,
date_generated datetime NOT NULL DEFAULT GETDATE(),
date_used datetime NULL
)
GO
CREATE PROCEDURE GetGuid
#guid uniqueidentifier OUTPUT
AS
BEGIN
SET NOCOUNT ON
DECLARE #return int = 0
BEGIN TRY
BEGIN TRANSACTION
SELECT TOP 1 #guid = guid FROM guids WHERE used = 0
IF #guid IS NOT NULL
UPDATE guids
SET
used = 1,
date_used = GETDATE()
WHERE guid = #guid
ELSE
BEGIN
SET #return = -1
PRINT 'GetGuid Error: No Unused guids are available'
END
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SET #return = ERROR_NUMBER() -- some error occurred
SET #guid = NULL
PRINT 'GetGuid Error: ' + CAST(ERROR_NUMBER() as varchar) + CHAR(13) + CHAR(10) + ERROR_MESSAGE()
ROLLBACK
END CATCH
RETURN #return
END
GO
CREATE PROCEDURE InsertIntoTestInsert
#TheValue nvarchar(255)
AS
BEGIN
SET NOCOUNT ON
DECLARE #return int = 0
DECLARE #guid uniqueidentifier
DECLARE #getguid_return int
EXEC #getguid_return = GetGuid #guid OUTPUT
IF #getguid_return = 0
BEGIN
INSERT INTO testinsert(guid, TheValue) VALUES (#guid, #TheValue)
END
ELSE
SET #return = -1
RETURN #return
END
GO
-- generate the guids
INSERT INTO guids(used) VALUES (0)
INSERT INTO guids(used) VALUES (0)
--Insert data through the stored proc
EXEC InsertIntoTestInsert N'Foo 1'
EXEC InsertIntoTestInsert N'Foo 2'
EXEC InsertIntoTestInsert N'Foo 3' -- will fail, only two guids were created
-- look at the inserted data
SELECT * FROM testinsert
-- look at the guids table
SELECT * FROM guids
The fun question is... how do you map this to ADO.Net's Entity Framework?
This is a classic problem that started in the early days of ORM (Object Relational Mapping).
If you use relational-database best practices (never allow direct access to base tables, only allow data manipulation through views and stored procedures), then you add headcount (someone capable and willing to write not only the database schema, but also all the views and stored procedures that form the API) and introduce delay (the time to actually write this stuff) to the project.
So everyone cuts this and people write queries directly against a normalized database, which they don't understand... thus the need for ORM, in this case, the ADO.NET Entity Framework.
ORM scares the heck out of me. I've seen ORM tools generate horribly inefficient queries which bring otherwise performant database servers to their knees. What was gained in programmer productivity was lost in end-user waiting and DBA frustration.

The Hi/Lo algorithm may be of interest to you:
What's the Hi/Lo algorithm?

Two clients could reserve the same block of id's.
There is no solution short of serializing your inserts by locking.
See Locking Hints in MSDN.

I fyou have a lot of child tables you might not want to change the PK. PLus the integer filedsa relikely to perform better in joins. But you could still add a GUID field and populate it in the bulk insert with pre-generated values. Then you could leave the identity insert alone (almost alawys a bad idea to turn it off) and use the GUID values you pre-generated to get back the Identity values you just inserted for the insert into child tables.
If you use a regular set-based insert (one with the select clause instead of the values clause) instead of a bulk insert, you could use the output clause to get the identities back for the rows if you are using SQL Server 2008.

The most general solution is generate client identifiers that never across with database identifiers - usually it is negative values, then update identifiers with identifier generated by database on inserting.
This way is safe to use in application with many users inserts the data simultaneously. Any other ways except GUIDs are not multiuser-safe.
But if you have that rare case when entity's primary key is required to be known before entity is saved to database, and it is impossible to use GUID, you may use identifier generation algorithm which are prevent identifier overlapping.
The most simple is assigning a unique identifier prefix for each connected client, and prepend it to each identifier generated by this client.
If you are using ADO.NET Entity Framework, you probably should not worry about identifier generation: EF generates identifiers by itself, just mark primary key of the entity as IsDbGenerated=true.
Strictly saying, entity framework as other ORM does not require identifier for objects are not saved to database yet, it is enought object reference for correctly operating with new entities. Actual primary key value is required only on updating/deleting entity, and on updating/deleting/inserting entity that references new entity, e.i. in cases when actual primary key value is about to be written in database. If entity is new, it is impossible to save other entites that are referenced new entity until new entity is not saved to database, and ORMs maintains specific order of entities saving which take references map into account.

Related

Explain Code First CRUD auto-generated SQL for Identity column

Code-first auto generates an insert procedure code as below for a table that has ProductID as primary key (identity column).
CREATE PROCEDURE [dbo].[InsertProducts]
#ProductName [nvarchar](max),
#Date [datetime],
AS
BEGIN
INSERT dbo.ProductsTable([ProductName], [Date])
VALUES (#ProductName, #Date)
-- identity stuff starts here
DECLARE #ProductID int
SELECT #ProductID = [ProductID]
FROM dbo.FIT_StorageLocations
WHERE ##ROWCOUNT > 0 AND [ProductID] = scope_identity()
SELECT t0.[ProductID]
FROM dbo.ProductsTable AS t0
WHERE ##ROWCOUNT > 0 AND t0.[ProductID] = #ProductID
END
GO
Could you please explain the code that handles the identity column? Also, if an insert procedure is to be manually written from scratch, would it be handled differently?
If for example I would remove this auto generated code, I would encounter one of the following errors:
Procedure ....expects parameter '#ProductID', which was not supplied
Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. See http://go.microsoft.com/fwlink/?LinkId=472540 for information on understanding and handling optimistic concurrency exceptions.
In the app, this is how I call the procedure which works fine until I try to mess with the code first auto generated SQL:
using (var db = new AppContext())
{
var record = new ProductObj()
{
ProductName= this.ProductName,
Date = DateTime.UtcNow
};
db.ProductDbSet.Add(record);
db.SaveChanges();
}

I guess there are two things to be explained here.
Why a SELECT statement when I insert stuff?
Let's first see what a regular insert by Entity Framework looks like. By "regular" I mean an insert without mapping CUD actions to stored procedures. The normal pattern is:
INSERT [dbo].[Product]([Name], ...)
VALUES (#0, ...)
SELECT [Id]
FROM [dbo].[Product]
WHERE ##ROWCOUNT > 0 AND [Id] = scope_identity()
So the INSERT is followed by a SELECT. This is because EF needs to know the identity value that the database assigns to the new Product to assign it to the entity object's Product.ProductId property and to track the entity. If for some reason you'd decide to do an update immediately after the insert, EF will be able to generate an update statement like UPDATE ... WHERE Id = #0.
When the insert is handled by a stored procedure, the sproc should return the new Id value in a way that looks like the regular insert. It expects to receive a one-column result set of which the column is named after the identity column. It should contain one row, the new identity value.
So that's why there is a SELECT statement in there, and why EF complains if you remove it. But, you might ask, does EF really need 7 lines of code to get an assigned identity value?
Why so much code?
Honestly, I have to speculate a bit here, because it isn't documented as far as I can find. But let's look at a minimal working version:
INSERT [dbo].[Products]([Name])
VALUES (#Name)
SELECT scope_identity() AS ProductId;
This does the job. It's even the standard example of many tutorials, including official ones, on mapping CUD actions to stored procedures.
But a database can be stuffed with triggers, constraints, defaults, etc. It's hard to predict their influence on the returned scope_identity() under the wide range of circumstances EF may encounter. So EF wants to guarantee that the returned value really belongs to the newly inserted record. And that a record has actually been inserted in the first place. That's why it adds the SELECT from the Product table, including the ##ROWCOUNT.
To implement these safeguards, a minimal version would be:
INSERT [dbo].[Products]([Name])
VALUES (#Name)
SELECT t0.[ProductId]
FROM [dbo].[Products] AS t0
WHERE ##ROWCOUNT > 0 AND t0.[ProductId] = scope_identity()
Same as in the regular insert.
That's as far as I can follow EF. It puzzles me a bit that this single SELECT apparently is enough for a regular INSERT but not for a stored procedure. I can't explain why there are two SELECTs in the generated code.

Using String values for primary key

Can I store SOF/2015/01 as my ID, and can I auto increment 01 like usual primary key?

Can I store SOF/2015/01 as my ID
Answer : yes you can
can I auto increment 01 like usual primary key.
Answer : no you can't
Auto increment can increment only numbers.
You have to that manually.
You can use function in trigger to generate your desire auto incremented number like this
create function NextCustomerNumber()
returns char(5)
as
begin
declare #lastval char(5)
set #lastval = (select max(customerNumber) from Customers)
if #lastval is null set #lastval = 'C0001'
declare #i int
set #i = right(#lastval,4) + 1
return 'C' + right('000' + convert(varchar(10),#i),4)
end
This can cause some issues, however:
What if two processes attempt to add a row to the table at the exact
same time? Can you ensure that the same value is not generated for
both processes?
There can be overhead querying the existing data each time you'd like
to insert new data
Unless this is implemented as a trigger, this means that all inserts
to your data must always go through the same stored procedure that
calculates these sequences. This means that bulk imports, or moving
data from production to testing and so on, might not be possible or
might be very inefficient.
If it is implemented as a trigger, will it work for a set-based
multi-row INSERT statement? If so, how efficient will it be? This
function wouldn't work if called for each row in a single set-based
INSERT -- each NextCustomerNumber() returned would be the same value.
You can learn more from this

Create a two column unique primary key with the string 'SOF/2015' as part one and an auto increment-ing integer as the second column. You can combine the two columns using a function that returns a string to give you the combined key. For syntactic sugar, you can create a view on the table using your function to combine the keys into one view column.

You can certainly use NCHAR or NVARCHAR types as primary keys, on the presumption that any variable-sized columns don't use MAX, and data doesn't exceed the maximum allowable size of your index.
As for using it as an auto-incremented column, that won't work. SQL is very smart, but not quite in that way.
I would suggest pulling that string into two or three separate columns, so that you can store the "01" portion as a separate, IDENTITY'd column. But certainly this is a design question that you'd have to work out on your own.
The other solution would be a trigger, but I'd probably hesitate in general with using something like this as a primary key. Using numeric types is just a lot nicer in many ways, particularly when you have to reference the tables elsewhere. You could always apply a UNIQUE index on the string representation.

How do I structure this transaction?

We have an ASP.NET/MSSQL based web app which generates orders with sequential order numbers.
When a user saves a form, a new order is created as follows:
SELECT MAX(order_number) FROM order_table, call this max_order_number
set new_order_number = max_order_number + 1
INSERT a new order record, with this new_order_number (it's just a field in the order record, not a database key)
If I enclose the above 3 steps in single transaction, will it avoid duplicate order numbers from being created, if two customers save a new order at the same time? (And let's say the system is eventually on a web farm with multiple IIS servers and one MSSQL server).
I want to avoid two customers selecting the same MAX(order_number) due to concurrency somewhere in the system.
What isolation level should be used? Thank you.

Why not just use an Identity as the order number?
Edit:
As far as I know, you can make the current order_number column an Identity (you may have to reset the seed, it's been a while since I've done this). You might want to do some tests.
Here's a good read about what actually goes on when you change a column to an Identity in SSMS. The author mentions how this may take a while if the table already has millions of rows.

Using an identity is by far the best idea. I create all my tables like this:
CREATE TABLE mytable (
mytable_id int identity(1, 1) not null primary key,
name varchar(50)
)
The "identity" flag means, "Let SQL Server assign this number for me". The (1, 1) means that identity numbers should start at 1 and be incremented by 1 each time someone inserts a record into the table. Not Null means that nobody should be allowed to insert a null into this column, and "primary key" means that we should create a clustered index on this column. With this kind of a table, you can then insert your record like this:
-- We don't need to insert into mytable_id column; SQL Server does it for us!
INSERT INTO mytable (name) VALUES ('Bob Roberts')
But to answer your literal question, I can give a lesson about how transactions work. It's certainly possible, although not optimal, to do this:
-- Begin a transaction - this means everything within this region will be
-- executed atomically, meaning that nothing else can interfere.
BEGIN TRANSACTION
DECLARE #id bigint
-- Retrieves the maximum order number from the table
SELECT #id = MAX(order_number) FROM order_table
-- While you are in this transaction, no other queries can change the order table,
-- so this insert statement is guaranteed to succeed
INSERT INTO order_table (order_number) VALUES (#id + 1)
-- Committing the transaction releases your lock and allows other programs
-- to work on the order table
COMMIT TRANSACTION
Just keep in mind that declaring your table with an identity primary key column does this all for you automatically.

The risk is two processes selecting the MAX(order_number) before one of them inserts the new order. A safer way is to do it in one step:
INSERT INTO order_table
(order_number, /* other fields */)
VALUES
( (SELECT MAX(order_number)+1 FROM order_table ) order_number,
/* other values */
)

I agree with G_M; use an Identity field. When you add your record, just
INSERT INTO order_table (/* other fields */)
VALUES (/* other fields */) ; SELECT SCOPE_IDENTITY()
The return value from Scope Identity will be your order number.

How can I Insert/Update into two related tables in one command?

A database exists with two tables
Data_t : DataID Primary Key that is
Identity 1,1. Also has another field
'LEFT' TINYINT
Data_Link_t : DataID PK and FK where
DataID MUST exist in Data_t. Also has another field 'RIGHT' SMALLINT
Coming from a microsoft access environment into C# and sql server I'm looking for a good method of importing a record into this relationship.
The record contains information that belongs on both sides of this join (Possibly inserting/updating upwards 5000 records at once). Bonus to process the entire batch in some kind of LINQ list type command but even if this is done record by record the key goal is that BOTH sides of this record should be processed in the same step.
There are countless approaches and I'm looking at too many to determine which way I should go so I thought faster to ask the general public. Is LINQ an option for inserting/updating a big list like this with LINQ to SQL? Should I go record by record? What approach should I use to add a record to normalized tables that when joined create the full record?

Sounds like a case where I'd write a small stored proc and call that from C# - e.g. as a function on my Linq-to-SQL data context object.
Something like:
CREATE PROCEDURE dbo.InsertData(#Left TINYINT, #Right SMALLINT)
AS BEGIN
DECLARE #DataID INT
INSERT INTO dbo.Data_t(Left) VALUES(#Left)
SELECT #DataID = SCOPE_IDENTITY();
INSERT INTO dbo.Data_Link_T(DataID, Right) VALUES(#DataID, #Right)
END
If you import that into your data context, you could call this something like:
using(YourDataContext ctx = new YourDataContext)
{
foreach(YourObjectType obj in YourListOfObjects)
{
ctx.InsertData(obj.Left, obj.Right)
}
}
and let the stored proc handle all the rest (all the details, like determining and using the IDENTITY from the first table in the second one) for you.

I have never tried it myself, but you might be able to do exactly what you are asking for by creating an updateable view and then inserting records into the view.
UPDATE
I just tried it, and it doesn't look like it will work.
Msg 4405, Level 16, State 1, Line 1
View or function 'Data_t_and_Data_Link_t' is not updatable because the modification affects multiple base tables.
I guess this is just one more thing for all the Relational Database Theory purists to hate about SQL Server.
ANOTHER UPDATE
Further research has found a way to do it. It can be done with a view and an "instead of" trigger.
create table Data_t
(
DataID int not null identity primary key,
[LEFT] tinyint,
)
GO
create table Data_Link_t
(
DataID int not null primary key foreign key references Data_T (DataID),
[RIGHT] smallint,
)
GO
create view Data_t_and_Data_Link_t
as
select
d.DataID,
d.[LEFT],
dl.[RIGHT]
from
Data_t d
inner join Data_Link_t dl on dl.DataID = d.DataID
GO
create trigger trgInsData_t_and_Data_Link_t on Data_t_and_Data_Link_T
instead of insert
as
insert into Data_t ([LEFT]) select [LEFT] from inserted
insert into Data_Link_t (DataID, [RIGHT]) select ##IDENTITY, [RIGHT] from inserted
go
insert into Data_t_and_Data_Link_t ([LEFT],[RIGHT]) values (1, 2)

How to get the primary key from a table without making a second trip?

How would I get the primary key ID number from a Table without making a second trip to the database in LINQ To SQL?
Right now, I submit the data to a table, and make another trip to figure out what id was assigned to the new field (in an auto increment id field). I want to do this in LINQ To SQL and not in Raw SQL (I no longer use Raw SQL).
Also, second part of my question is: I am always careful to know the ID of a user that's online because I'd rather call their information in various tables using their ID as opposed to using a GUID or a username, which are all long strings. I do this because I think that SQL Server doing a numeric compare is much (?) more efficient than doing a username (string) or even a guid (very long string) compare. My questions is, am I more concerned than I should be? Is the difference worth always keeping the userid (int32) in say, session state?
#RedFilter provided some interesting/promising leads for the first question, because I am at this stage unable to try them, if anyone knows or can confirm these changes that he recommended in the comments section of his answer?

If you have a reference to the object, you can just use that reference and call the primary key after you call db.SubmitChanges(). The LINQ object will automatically update its (Identifier) primary key field to reflect the new one assigned to it via SQL Server.
Example (vb.net):
Dim db As New NorthwindDataContext
Dim prod As New Product
prod.ProductName = "cheese!"
db.Products.InsertOnSubmit(prod)
db.SubmitChanges()
MessageBox.Show(prod.ProductID)
You could probably include the above code in a function and return the ProductID (or equivalent primary key) and use it somewhere else.
EDIT: If you are not doing atomic updates, you could add each new product to a separate Collection and iterate through it after you call SubmitChanges. I wish LINQ provided a 'database sneak peek' like a dataset would.

Unless you are doing something out of the ordinary, you should not need to do anything extra to retrieve the primary key that is generated.
When you call SubmitChanges on your Linq-to-SQL datacontext, it automatically updates the primary key values for your objects.
Regarding your second question - there may be a small performance improvement by doing a scan on a numeric field as opposed to something like varchar() but you will see much better performance either way by ensuring that you have the correct columns in your database indexed. And, with SQL Server if you create a primary key using an identity column, it will by default have a clustered index over it.

Linq to SQL automatically sets the identity value of your class with the ID generated when you insert a new record. Just access the property. I don't know if it uses a separate query for this or not, having never used it, but it is not unusual for ORMs to require another query to get back the last inserted ID.
Two ways you can do this independent of Linq To SQL (that may work with it):
1) If you are using SQL Server 2005 or higher, you can use the OUTPUT clause:
Returns information from, or
expressions based on, each row
affected by an INSERT, UPDATE, or
DELETE statement. These results can be
returned to the processing application
for use in such things as confirmation
messages, archiving, and other such
application requirements.
Alternatively, results can be inserted
into a table or table variable.
2) Alternately, you can construct a batch INSERT statement like this:
insert into MyTable
(field1)
values
('xxx');
select scope_identity();
which works at least as far back as SQL Server 2000.

In T-SQL, you could use the OUTPUT clause, saying:
INSERT table (columns...)
OUTPUT inserted.ID
SELECT columns...
So if you can configure LINQ to use that construct for doing inserts, then you can probably get it back easily. But whether LINQ can get a value back from an insert, I'll let someone else answer that.

Calling a stored procedure from LINQ that returns the ID as an output parameter is probably the easiest approach.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.