Using a hash as a primary key?

Using a hash as a primary key? - c#

I have a requirement to store the list of services for multiple computers. I thought I would create one table to hold a list of all possible tables, a table for all possible computers and then a table to link a service to a computer.
I was thinking to keep the full services list unique, I could possibly use a hash of the executable as the primary key for the service, but i'm not sure if there would be any downsides to this (note that the hashing is only for identification. Not for any types of security purposes). I was thinking rather than using a binary field as the primary/foreign key, that I would store the value as a base 64 encoded sha512, and using an nvarchar(88). Something similar to this:
CREATE TABLE Services
(
ServiceHash nvarchar(88) NOT NULL,
ServiceName nvarchar(256) NOT NULL,
ServiceDescription nvarchar(256),
PRIMARY KEY (ServiceHash)
)
Is there any inherent problems with this solution? (I will be using a SQL 2008 database and generally accessing it via C#.Net).

The problem is that a hash is per definition NOT UNIQUE. It is unlikely you get a collision, but it IS possible. As a result, you can not use the hash only, which means the whole hash id is a dead end.
Use a normal ID field, use a unique constraint with index on the ServiceName.

From a performance point of view, having a non-incremental primary key would cause your clustered index to get fragmented rather quickly.
I recommend either:
Use an INT or BIGINT surrogate PK, with auto-increment.
Use a sequential GUID as a PK. Not as fast for indexing as an INT but incremental, therefore low fragmentation in time.
You can then play with non-clustered indexes on your other columns, including the one storing the hashes. Being VARCHAR you can also full-text index it and then do an exact matching when looking for a specific hash.
But, if possible, use a numerical hash instead and make a non-clustered index on it.
And of course, consider what #TomTom mentioned below.

Related

SQL Primary Key Generation

Used SQL Server = MySQL
Programming language = irrelevant, but I stick to Java and C#
I have a theoretical question regarding the best way to go about primary key generation for SQL databases which are then used by another program that I write, (let's assume it is not web-based.)
I know that a primary key must be unique, and I prefer primary keys where I can also immediately tell where they are coming from when I see them, either in my eclipse or windows console when I use a database, as well as in relationship tables. For that reason, I generally create my own primary key as an alphanumeric string unless a specific unique value is available such as an ISBN or SS num. For a table Actors, a primary key could then look like a1, and in a table Movies m1020 (Assuming titles are not unique such as different versions of the movie 'Return to witch Mountain').
So my question then is, how is a primary key best generated (in my program or in the db itself as a procedure)? For such a scheme, is it best to use two columns, one with a constant string such as 'a' for actors and a single running count? (In that case i need to research how to reference a table whose PK spans multiple columns) What is the most professional way of handling such a task?
Thank you for your time.

A best practice is to let your DB engine generate the primary key as an auto-increment NUMBER. Alphanumeric string are not a good way, even if it seems too "abstract" for you. Then, you don't have to worry about your primary key in your program (Java, C#, anything else) ; at each line inserted in your Database, an unique primary key is automatially inserted
By the way, with your solution, I'm not sure you manage the case where two rows are inserted simultaneously... Are you sure in absolutely no case, your primary key can be duplicated ?

Your first line says:-
SQL Server = MySQL
Thats not true. They are different.
how is a primary key best generated (in my program or in the db itself
as a procedure)?
Primary keys are generated by MYSQL when you specify the column with primary key constraint on it. The primary keys are automatically generated and they are automatically incremented.
If you want your primary key as alphanumeric(which I personally will not recommend) then you may try like this:-
CREATE TABLE A(
id INT NOT NULL AUTO_INCREMENT,
prefix CHAR(30) NOT NULL,
PRIMARY KEY (id, prefix),
I would recommend you to have Primary key as Integer as that would help you to make your selction a bit easier and optimal.For MyIsam tables you can create a multi-column index and put auto_increment field on secondary column

For MySQL there's a best way - set AUTO_INCREMENT property for your primary key table field.
You can get the generated id later with last_insert_id function or it's java or c# analog.

I don't know why you would use "alphanumeric" values - why not just a plain number?
Anyway, use whatever auto-increment functionality is available in whichever DB-system you are using, and stick with that. Do not create primary keys outside of the DB - you can't know when / how two systems might access the DB at the same time, which could cause problems if the two create the same PK value, and attempt to insert it.
Also, in my view, a PK should just be an ID (in a single column) for a specific row, and nothing more - if you need a field indicating that a record concerns data of type "actor" for instance, then that should be a separate field, and have nothing to do with the primary key (why would it?)

Is this okay to have a Alphanumeric field as a PrimaryKey?

I am rewriting a new timesheet application including redesigning database and it will require data migration from Oracle to Oracle.
In the old system field ‘EmployeeCod’ is a Primary Key and it is in Alphanumeric form i.e. ‘UK001’, ‘UK002’,‘FR001’,’FR002’, ‘US001’ . Employee table is also linked to timesheet and other tables where the EmpCode is being referred as a FK.
To make the JOINs perform faster in the new system I was thinking about adding a new INT column in the Employee table and set it to PK. (Don't know if it will make any big difference)
-Employee table has about 600 rows.
-Data type of EmpCode is Varchar2(20) in old DB which I can reduce to Varchar2(6) in the new system and alter it later as company expends.
I am wondering if it is better to keep the EmpCode as a Primary Key which will make things easier in migrating data or should I add a INT column?
Someone has given me following advise in one of my previous thread:
“if you need to create a composite code of AANNN then I'd split this into two: a simple 'Prefix' field of CHAR(2) and an identity field of INT, then turn EmpCode into a computed field that concats the two and stick an index on there that (#Chris)”
I am not sure if this option would work as employee table is linked to other tables as well. (EmpCode is being used as FK in other tables)
n

If you do add this PK, and also keep the former PK, you will have some data management issues to deal with. Or perhaps your customers. Getting rid of the old PK may not be feasable if there are existing users who will be upgrading to the new database.
If EmployeeCode, the former PK is used by the users of the data to identify Employees, then you will have to add a constraint to make sure that this field is unique. Carrying both codes will wipe out any performance gains you were hoping for.
If it were me, I'd leave well enough alone. The performance gains, if any, will be trivial.

The performance difference will be negligible if the index you're creating on the alphanumeric field is the clustered index for the table. Which, based off of your question is going to be the case, but I wanted to note that for completeness. I say this for two reasons:
A clustered index is the physical order of the table and so when seeking against that index, looking for more data presumably off of the data page in a query, a binary search can be performed against it because it's also physically stored in that order.
A binary search is just about as efficient as you can get, lest we forget though a statistical index. I call this out because integer primary keys build statistical indexes which are as fast a seek as you can get because mathmatically speaking we know 2 comes after 1 for example.
So, just keep that in mind when building alphanumeric, or even compound, keys and indexes and trying to compare the difference between them and an integer key. Personally, I prefer to stick with integer primary keys because I have found them to perform better over time during extreme growth.
I hope this helps.

I use alphanumeric primary keys regularly and see absolutely no issues with it. There is no performance issue, you have a wider addressable space, and you can be more expressive/human readable. Integer keys are just a convention.
Add to that the risk you're adding to you project by adding a major architectural change over and above the porting issues, I'd say stick with the existing schema as much as possible.

There will be no performance improvement - in fact, unless you know and can prove/measure that you have a performance problem, changing things "to make them faster" usually leads to pain.
However, there is a concern that your primary key appears to carry meaning - it's a country code, concatenated with a number. What if an employee moves from the US to the UK? What if the UK hires its 1000th employee?
For that reason, I'd refactor the application to use a meaningless primary key; whether it's an INT or a VARCHAR is not hugely relevant.

You do occassionally come across alphanumeric primary keys.. personally I find it just makes life more difficult.. if you are able to change it and you want to change it, I would say go ahead.. it will make things easier for you later. As for it being an FK, you would need to be careful to write a script to properly update all the data. One way you can do this is:
Step 1: Create a new int column for the PK and set Identity Insert to true
Step 2: Add a new int column in your child table and then:
Step 3: write an update script like this:
UPDATE childTable C
INNER JOIN parentTable P ON C.oldEmpID = P.oldEmpID
SET C.myNewEmpIDColumn = P.myNewEmpIDColumn
Step 4: Repeat steps 2 & 3 for all child tables
Step 5: Delete all old FK columns
Something like that and don't forget to backup your current DB first ;)

Approach for primary key generation

What is the best approach when generating a primary key for a table?
That is, when the data received by the database is not injective and can't be used as a primary key.
In the code, what is the best way to manage a primary key for the table rows?
Thanks.

First recommendation stay away from uniqueidentifier for any primary key. Although it has some interesting easy ways to generate it client side, it makes it almost impossible to have any idexes on the primary key that may be useful. If I could go back in time and ban uniqueidentifiers from 99% of the places that they have been used, this would have saved more than 3 man years of dba/development time in the last 2 years.
Here is what I would recommend, using the INT IDENTITY as a primary key.
create table YourTableName(
pkID int not null identity primary key,
... the rest of the columns declared next.
)
where pkID is the name of your primary key column.
This should do what you are looking for.

AUTO_INCREMENT in mysql, IDENTITY in SQL Server..

IDENTITY in SQL Server
and if you need to get know what you new ID was while INSERT-ing data, use OUTPUT clause of INSERT statement - so the copy of new rows is put to table-type param.
If for some reason generating unique ID at SQL is not suitable for you, generate GUID's at your app - GUID has a very hight level of uniquness (but it's not guaranteed in fact). And SQL Server has dedicated GUID type for column - it's called uniqueidentifier.
http://msdn.microsoft.com/en-us/library/ms187942.aspx

long vs Guid for the Id (Entity), what are the pros and cons

I am doing a web-application on asp.net mvc and I'm choosing between the long and Guid data type for my entities, but I don't know which one is better. Some say that long is much faster. Guid also might have some advantages. Anybody knows ?

When GUIDs can be Inappropriate
GUIDs are almost always going to be slower because they are larger. That makes your indexes larger. That makes your tables larger. That means that if you have to scan your tables, either wholly or partially, it will take longer and you will see less performance. This is a huge concern in reporting based systems. For example, one would never use a GUID as a foreign key in a fact table because its length would usually be significant, as fact tables are often partially scanned to generate aggregates.
Also consider whether or not it is appropriate to use a "long". That's an enormously large number. You only need it if you think you might have over 2 BILLION entries in your table at some point. It's rare that I use them.
GUIDs can also be tough to use and debug. Saying, "there's a problem with Customer record 10034, Frank, go check it out" is a lot easier than saying "there's a problem with {2f1e4fc0-81fd-11da-9156-00036a0f876a}..." Ints and longs are also easier to type into queries when you need to.
Oh, and it's not the case that you never get the same GUID twice. It has been known to happen on very large, disconnected systems, so that's something to consider, although I wouldn't design for it in most apps.
When GUIDs can be Appropriate
GUIDs are the appropriate when you're working with disconnected systems where entities are created and then synchronized. For example, if someone makes a record in your database on a mobile device and syncs it, or you have entities being created at different branch offices and synced to a central store at night. That's the kind of flexibility they give you.
GUIDs also allow you the ability to associate entities without persisting them to the database, in certain ORM scenarios. Linq to SQL (and I believe the EF) don't have this problem, though there are times you might be forced to submit your changes to the database to get a key.
If you create your GUIDs on the client, it's possible that since the GUIDs you create are not sequential, that insert performance could suffer because of page splits on the DB.
My Advice
A lot of stuff to consider here. My vote is to not use them unless you have a compelling use case for them. If performance really is your goal, keep your tables small. Keep your fields small. Keep your DB indexes small and selective.

SIZE:
Long is 8 bytes
Guid is 16 bytes
GUID has definitely high probability for going to be unique and is best to use for identification of individual records in a data base(s).
long (Identity in DB), might represent a unique record in a table but you might have records represented by same ID (Identity), in one or more different table like as follows:
TableA: PersonID int, name varchar(50)
TableB: ProductID int, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name =''
both can return same value, but in case of GUID :
TableA: PersonID uniqueidentifier, name varchar(50)
TableB: ProductID uniqueidentifier, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name ='
you can rarely have same value as id returned from two tables
Have a look here
SQL Server - Guid VS. Long
GUIDs as PRIMARY KEYs and/or the clustering key

Guids make it much easier to create a 'fresh' entity in your API because you simply assign it the value of Guid.NewGuid(). There's no reliance on auto-incremented keys from a database, so this better decouples the Domain Model from the underlying persistence mechanism.
On the downside, if you use a Guid as the Clustered Index in SQL Server, inserts become expensive because new rows are very rarely added to the end of the table, so the index needs to be rebuilt very often.
Another issue is that if you perform selects from such a database without specifying an explicit ordering, you get out results in an essentially random order.

SQL Server - Guid VS. Long

Up until now i've been using the C# "Guid = Guid.NewGuid();" method to generate a unique ID that can be stored as the ID field in some of my SQL Server database tables using Linq to SQL.
I've been informed that for indexing reasons, using a GUID is a bad idea and that I should use an auto-incrementing Long instead. Will using a long speed up my database transactions? If so, how do I go about generating unique ID's that are of type Long?
Regards,

Both have pros and cons, it depends entirely on how you use them that matters.
Right off the bat, if you need identifiers that can work across several databases, you need GUIDs. There are some tricks with Long (manually assigning each database a different seed/increment), but these don't scale well.
As far as indexing goes, Long will give much better insert performance if the index is clustered (by default primary keys are clustered, but this can be modified for your table), since the table does not need to be reorganized after every insert.
As far as concurrent inserts are concerned however, Long (identity) columns will be slower then GUID - identity column generation requires a series of exclusive locks to ensure that only one row gets the next sequential number. In an environment with many users inserting many rows all the time, this can be a performance hit. GUID generation in this situation is faster.
Storage wise, a GUID takes up twice the space of a Long (8 bytes vs 16). However it depends on the overall size of your row if 8 bytes is going to make a noticable difference in how many records fit in one leaf, and thus the number of leaves pulled from disk during an average request.

The "Queen of Indexing" - Kim Tripp - basically says it all in her indexing blog posts:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
Ever increasing clustering key - the Clustered Index Debate......again!
Basically, her best practices are: an optimal clustering key should be:
unique
small
stable (never changing)
ever-increasing
GUID's violate the "small" and "ever-increasing" and are thus not optimal.
PLUS: all your clustering keys will be added to each and every single entry in each and every single non-clustered index (as the lookup to actually find the record in the database), thus you want to make them as small as possible (INT = 4 byte vs. GUID = 16 byte). If you have hundreds of millions of rows and several non-clustered indices, choosing an INT or BIGINT over a GUID can make a major difference - even just space-wise.
Marc

have a look at this
Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

A long (big int in sql server) is 8 bytes and a Guid is 16 bytes, so you are halving the number of the bytes sql server has to compare when doing a look up.
For generating a long, use IDENTITY(1,1) when you create the field in the database.
so either using create table or alter table:
Field_NAME BIGINT NOT NULL PRIMARY KEY IDENTITY(1,1)
See comments for posting Linq to sql

Use guids when you need to consider import/export to multiple databases. Guids are often easier to use than columns specifying the IDENTITY attribute when working with a dataset of multiple child relationships. this is because you can randomly generate guids in the code in a disconnected state from the database, and then submit all changes at once. When guids are generated properly, they are insainely hard to duplicate by chance. With identity columns, you often have to do an intial insert of a parent row and query for it's new identity before adding child data. You then have to update all child records with the new parent identity before committing them to the database. The same goes for grandchildren and so on down the heirarchy. It builds up to a lot of work that seems unnecessary and mundane. You can do something similar to Guids by comming up with random integers without the IDENTITY specification, but the chance of collision is greatly increased as you insert more records over time. (Guid.NewGuid() is similar to a random Int128 - which doesn't exist yet).
I use Byte (TinyInt), Int16 (SmallInt), Int32/UInt16 (Int), Int64/UInt32 (BigInt) for small lookup lists that do not change or data that does not replicate between multiple databases. (Permissions, Application Configuration, Color Names, etc.)
I imagine the indexing takes just as long to query against regardless if you are using a guid or a long. There are usually other fields in tables that are indexed that are larger than 128 bits anyway (user names in a user table for example). The difference between Guids and Integers is the size of the index in memory, as well as time populating and rebuilding indexes. The majority of database transactions is often reading. Writing is minimal. Concentrate on optimizing reading from the database first, as they are usually made of joined tables that were not optimized properly, improper paging, or missing indexes.
As with anything, the best thing to do is to prove your point. create a test database with two tables. One with a primary key of integers/longs, and the other with a guid. Populate each with N-Million rows. Moniter the performance of each during the CRUD operations (create, read, update, delete). You may find out that it does have a performance hit, but insignificant.
Servers often run on boxes without debugging environments and other applications taking up CPU, Memory, and I/O of hard drive (especially with RAID). A development environment only gives you an idea of performance.

Consider creating sequential GUID from .NET application:
http://dotnet-snippets.de/dns/sequential-guid-SID998.aspx
What are the performance improvement of Sequential Guid over standard Guid?

You can debate GUID or identity all day. I prefer the database to generate the unique value with an identity. If you merge data from multiple databases, add another column (to identify the source database, possibly a tinyint or smallint) and form a composite primary key.
If you do go with an identity, be sure to pick the right datatype, based on number of expected keys you will generate:
bigint - 8 Bytes - max positive value: 9,223,372,036,854,775,807
int - 4 Bytes - max positive value: 2,147,483,647
Note "number of expected keys " is different than the number of rows. If you mainly add and keep rows, you may find that an INT is enough with over 2 billion unique keys. I'll bet your table won't get that big. However, if you have a high volume table where you keep adding and removing rows, you row count may be low, but you'll go through keys fast. You should do some calculations to see how log it would take to go through the INTs 2 billion keys. If it won't use them up any time soon go with INT, otherwise double the key size and go with BIGINT.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.