Approach for primary key generation - c#

What is the best approach when generating a primary key for a table?
That is, when the data received by the database is not injective and can't be used as a primary key.
In the code, what is the best way to manage a primary key for the table rows?
Thanks.

First recommendation stay away from uniqueidentifier for any primary key. Although it has some interesting easy ways to generate it client side, it makes it almost impossible to have any idexes on the primary key that may be useful. If I could go back in time and ban uniqueidentifiers from 99% of the places that they have been used, this would have saved more than 3 man years of dba/development time in the last 2 years.
Here is what I would recommend, using the INT IDENTITY as a primary key.
create table YourTableName(
pkID int not null identity primary key,
... the rest of the columns declared next.
)
where pkID is the name of your primary key column.
This should do what you are looking for.

AUTO_INCREMENT in mysql, IDENTITY in SQL Server..

IDENTITY in SQL Server
and if you need to get know what you new ID was while INSERT-ing data, use OUTPUT clause of INSERT statement - so the copy of new rows is put to table-type param.
If for some reason generating unique ID at SQL is not suitable for you, generate GUID's at your app - GUID has a very hight level of uniquness (but it's not guaranteed in fact). And SQL Server has dedicated GUID type for column - it's called uniqueidentifier.
http://msdn.microsoft.com/en-us/library/ms187942.aspx

Related

Bulk insert related sets of data with unknown auto-incremented IDs

We are converting database primary keys from GUIDs to auto-incremented INTs. We have data that we parse from text files and put into two C# DataTables Claim and ClaimCharge that we have been using to bulk insert into identically named tables in the database. In the database, ClaimCharge.ClaimID is a foreign key to Claim.ID and several claim charges exist for one claim.
With GUIDs we generated the Claim and ClaimCharge IDs in C#, so bulk inserting was no problem. But with INTs, I don't know what the Claim.ID will be, so I can't assign ClaimCharge.ClaimID. I need some ideas on how this could be accomplished with INTs.
For instance, if the Claim table could be manually locked against inserts, I could:
Bulk insert into alternate tables named ClaimBulkData ClaimChargeBulkData. These tables would still use GUIDs for convenience in keeping the relationship maintained between C# and SQL.
Manually lock the Claim table against inserts (don't know if this is possible) and get the max(ID).
Increment all of the data in ClaimBulkData using MAX(ID).
Associate ClaimChargeBulkData to ClaimBulkData using the newly updated INT
Insert data into real Claim table as a set using IDENTITY_INSERT ON using some kind of exception to the imaginary lock created in step 2.
Release manually created lock against inserts on Claim table (again I don't know if this is possible.
Insert data into real ClaimCharge table.
I want to avoid inserting the data one row at a time in either C# or T-SQL.
Why not just add the new auto-increment column to the master tables -- you will then have both GUID and autoid column so you can fix up the foreign key relationship (one master table at a time)
i.e.,
Assume you have master1 and detail1 and detail1
alter table Master1 add ID int identity(1,1) not null
GO
alter Detail1 add master1ID int null
GO
alter Detail2 add master1ID int null
GO
Then update Detail1 and Detail12 based on joining Master1 on the oldguid key to set the corresponding value of Master1ID for each table
You can then add the foreign keys based on Master1ID to Detail and Detail2
At this point you should have a complete set of data based on both sets of keys, and you can test update views, etc. to make sure they work with the new integer ids
Finally, once all is cool, drop to unneeded GUID foreign key and the Guid columns themselves.
You can always run a database pack once you get everything clean and converted if your intent was to reduce overall disk usage via this restructuring. The point is much of the work is fixups for foreign keys in a process like this.

SQL Primary Key Generation

Used SQL Server = MySQL
Programming language = irrelevant, but I stick to Java and C#
I have a theoretical question regarding the best way to go about primary key generation for SQL databases which are then used by another program that I write, (let's assume it is not web-based.)
I know that a primary key must be unique, and I prefer primary keys where I can also immediately tell where they are coming from when I see them, either in my eclipse or windows console when I use a database, as well as in relationship tables. For that reason, I generally create my own primary key as an alphanumeric string unless a specific unique value is available such as an ISBN or SS num. For a table Actors, a primary key could then look like a1, and in a table Movies m1020 (Assuming titles are not unique such as different versions of the movie 'Return to witch Mountain').
So my question then is, how is a primary key best generated (in my program or in the db itself as a procedure)? For such a scheme, is it best to use two columns, one with a constant string such as 'a' for actors and a single running count? (In that case i need to research how to reference a table whose PK spans multiple columns) What is the most professional way of handling such a task?
Thank you for your time.
A best practice is to let your DB engine generate the primary key as an auto-increment NUMBER. Alphanumeric string are not a good way, even if it seems too "abstract" for you. Then, you don't have to worry about your primary key in your program (Java, C#, anything else) ; at each line inserted in your Database, an unique primary key is automatially inserted
By the way, with your solution, I'm not sure you manage the case where two rows are inserted simultaneously... Are you sure in absolutely no case, your primary key can be duplicated ?
Your first line says:-
SQL Server = MySQL
Thats not true. They are different.
how is a primary key best generated (in my program or in the db itself
as a procedure)?
Primary keys are generated by MYSQL when you specify the column with primary key constraint on it. The primary keys are automatically generated and they are automatically incremented.
If you want your primary key as alphanumeric(which I personally will not recommend) then you may try like this:-
CREATE TABLE A(
id INT NOT NULL AUTO_INCREMENT,
prefix CHAR(30) NOT NULL,
PRIMARY KEY (id, prefix),
I would recommend you to have Primary key as Integer as that would help you to make your selction a bit easier and optimal.For MyIsam tables you can create a multi-column index and put auto_increment field on secondary column
For MySQL there's a best way - set AUTO_INCREMENT property for your primary key table field.
You can get the generated id later with last_insert_id function or it's java or c# analog.
I don't know why you would use "alphanumeric" values - why not just a plain number?
Anyway, use whatever auto-increment functionality is available in whichever DB-system you are using, and stick with that. Do not create primary keys outside of the DB - you can't know when / how two systems might access the DB at the same time, which could cause problems if the two create the same PK value, and attempt to insert it.
Also, in my view, a PK should just be an ID (in a single column) for a specific row, and nothing more - if you need a field indicating that a record concerns data of type "actor" for instance, then that should be a separate field, and have nothing to do with the primary key (why would it?)

Slow Insert Time With Composite Primary Key in Cassandra

I have been working with Cassandra and I have hit a bit of a stumbling block. For how I need to search on data I found that a Composite primary key works great for what I need but the insert times for the record in this Column Family go to the dogs with it and I am not entirely sure why.
Table Definition:
CREATE TABLE exampletable (
clientid int,
filledday int,
filledtime bigint,
id uuid,
...etc...
PRIMARY KEY (clientid, filledday, filledtime, id)
);
clientid = The internal id of the client. filledday = The number of days since 1/1/1900. filledtime = The number of ticks of the day at which the record was recived. id = A Guid.
The day and time structure exists because I need to be able to filter by day easily and quickly.
I know Cassandra stores Column Families with composite primary keys quite differently. From what I understand it will store the everything as new columns off of a base row of the main component of the primary key. Is that the reason the inserts would be slow? When I say slow I mean that if I just have a primary key on id the insert will take ~200 milliseconds and with the composite primary key (or any subset of it, I tried just clientid and id to the same effect) it will take upwards of 32 seconds for 1000 records. The Select times are faster out of the composite key table since I have to apply secondary indexes and use 'ALLOW FILTERING' in order to get the proper records back with the standard key table (I know I could do this in code but the concern is that I am dealing with some massive data sets and that will not always be practical or possible).
Am I declaring the Column Family or the Primary Key wrong for what I am trying to do? With all the unlisted, non-primary key columns the table is 37 columns wide, would that be the problem? I am quite stumped at this point. I have not be able to really find anything about others having similar problems.
Well, your partition key is the client id, so all writes per client go to one node. If you are writing lots of data per client, you could end up with a hotspot, thus decreasing your overall throughput.
Also, could you give an example of the queries that you run? In Cassandra, the data model always need to resemble the queries you want to run. If you need to "allow filtering", then it seems that something is not quite right with your data model. For instance, I don't really see the point of "filledtime" in your PK. If you want to query by time period, just replace your three column keys with a TimeUUID column "ts". This would create a wide row, with one column per entry with a unique timestam, clustered/partitioned per client id.
This allows queries like:
select * from exampletable where clientid = 123 and ts > minTimeuuid('2013-06-18 16:23:00') and ts < minTimeuuid('2013-06-18 16:24:00');
Again, this would depend on the queries you actually need to run.
And lastly, for overall guidance on data modelling, take a look into this ebay tech blog. Reading it helped me cleared up some things for me.
Hope that helps!

Check for duplicate column values (not a key) in SQL

Is there a way for SQL to enforce unique column values, that are not a primary key to another table?
For instance, say I have TblDog which has the fields:
DogId - Primary Key
DogTag - Integer
DogNumber - varchar
The DogTag and DogNumber fields must be unique, but are not linked to any sort of table.
The only way I can think of involves pulling any records that match the DogTag and pulling any records that match the DogNumber before creating or editing (excluding the current record being updated.) This is two calls to the database before even creating/editing the record.
My question is: is there a way to set SQL to enforce these values to be unique, without setting them as a key, or in Entity Frameworks (without excessive calls to the DB)?
I understand that I could group the two calls in one, but I need to be able to inform the user exactly which field has been duplicated (or both).
Edit: The database is SQL Server 2008 R2.
As MilkywayJoe suggests, use unique key constraints in the SQL database. These are checked during inserts + Updates.
ALTER TABLE TblDog ADD CONSTRAINT U_DogTag UNIQUE(DogTag)
AND
ALTER TABLE TblDog ADD CONSTRAINT U_DogNumber UNIQUE(DogNumber)
I'd suggest setting unique constraints/indexes to prevent duplicate entries.
ALTER TABLE TblDog ADD CONSTRAINT U_DogTag UNIQUE(DogTag)
CREATE UNIQUE INDEX idxUniqueDog
ON TblDog (DogTag, DogNUmber)
It doesn't appear as though Entity Framework supports it (yet), but was on the cards. Looks like you are going to need to do this directly in the database using Unique Constraints as mentioned in the comments.

Using a hash as a primary key?

I have a requirement to store the list of services for multiple computers. I thought I would create one table to hold a list of all possible tables, a table for all possible computers and then a table to link a service to a computer.
I was thinking to keep the full services list unique, I could possibly use a hash of the executable as the primary key for the service, but i'm not sure if there would be any downsides to this (note that the hashing is only for identification. Not for any types of security purposes). I was thinking rather than using a binary field as the primary/foreign key, that I would store the value as a base 64 encoded sha512, and using an nvarchar(88). Something similar to this:
CREATE TABLE Services
(
ServiceHash nvarchar(88) NOT NULL,
ServiceName nvarchar(256) NOT NULL,
ServiceDescription nvarchar(256),
PRIMARY KEY (ServiceHash)
)
Is there any inherent problems with this solution? (I will be using a SQL 2008 database and generally accessing it via C#.Net).
The problem is that a hash is per definition NOT UNIQUE. It is unlikely you get a collision, but it IS possible. As a result, you can not use the hash only, which means the whole hash id is a dead end.
Use a normal ID field, use a unique constraint with index on the ServiceName.
From a performance point of view, having a non-incremental primary key would cause your clustered index to get fragmented rather quickly.
I recommend either:
Use an INT or BIGINT surrogate PK, with auto-increment.
Use a sequential GUID as a PK. Not as fast for indexing as an INT but incremental, therefore low fragmentation in time.
You can then play with non-clustered indexes on your other columns, including the one storing the hashes. Being VARCHAR you can also full-text index it and then do an exact matching when looking for a specific hash.
But, if possible, use a numerical hash instead and make a non-clustered index on it.
And of course, consider what #TomTom mentioned below.

Categories