I am rewriting a new timesheet application including redesigning database and it will require data migration from Oracle to Oracle.
In the old system field ‘EmployeeCod’ is a Primary Key and it is in Alphanumeric form i.e. ‘UK001’, ‘UK002’,‘FR001’,’FR002’, ‘US001’ . Employee table is also linked to timesheet and other tables where the EmpCode is being referred as a FK.
To make the JOINs perform faster in the new system I was thinking about adding a new INT column in the Employee table and set it to PK. (Don't know if it will make any big difference)
-Employee table has about 600 rows.
-Data type of EmpCode is Varchar2(20) in old DB which I can reduce to Varchar2(6) in the new system and alter it later as company expends.
I am wondering if it is better to keep the EmpCode as a Primary Key which will make things easier in migrating data or should I add a INT column?
Someone has given me following advise in one of my previous thread:
“if you need to create a composite code of AANNN then I'd split this into two: a simple 'Prefix' field of CHAR(2) and an identity field of INT, then turn EmpCode into a computed field that concats the two and stick an index on there that (#Chris)”
I am not sure if this option would work as employee table is linked to other tables as well. (EmpCode is being used as FK in other tables)
n
If you do add this PK, and also keep the former PK, you will have some data management issues to deal with. Or perhaps your customers. Getting rid of the old PK may not be feasable if there are existing users who will be upgrading to the new database.
If EmployeeCode, the former PK is used by the users of the data to identify Employees, then you will have to add a constraint to make sure that this field is unique. Carrying both codes will wipe out any performance gains you were hoping for.
If it were me, I'd leave well enough alone. The performance gains, if any, will be trivial.
The performance difference will be negligible if the index you're creating on the alphanumeric field is the clustered index for the table. Which, based off of your question is going to be the case, but I wanted to note that for completeness. I say this for two reasons:
A clustered index is the physical order of the table and so when seeking against that index, looking for more data presumably off of the data page in a query, a binary search can be performed against it because it's also physically stored in that order.
A binary search is just about as efficient as you can get, lest we forget though a statistical index. I call this out because integer primary keys build statistical indexes which are as fast a seek as you can get because mathmatically speaking we know 2 comes after 1 for example.
So, just keep that in mind when building alphanumeric, or even compound, keys and indexes and trying to compare the difference between them and an integer key. Personally, I prefer to stick with integer primary keys because I have found them to perform better over time during extreme growth.
I hope this helps.
I use alphanumeric primary keys regularly and see absolutely no issues with it. There is no performance issue, you have a wider addressable space, and you can be more expressive/human readable. Integer keys are just a convention.
Add to that the risk you're adding to you project by adding a major architectural change over and above the porting issues, I'd say stick with the existing schema as much as possible.
There will be no performance improvement - in fact, unless you know and can prove/measure that you have a performance problem, changing things "to make them faster" usually leads to pain.
However, there is a concern that your primary key appears to carry meaning - it's a country code, concatenated with a number. What if an employee moves from the US to the UK? What if the UK hires its 1000th employee?
For that reason, I'd refactor the application to use a meaningless primary key; whether it's an INT or a VARCHAR is not hugely relevant.
You do occassionally come across alphanumeric primary keys.. personally I find it just makes life more difficult.. if you are able to change it and you want to change it, I would say go ahead.. it will make things easier for you later. As for it being an FK, you would need to be careful to write a script to properly update all the data. One way you can do this is:
Step 1: Create a new int column for the PK and set Identity Insert to true
Step 2: Add a new int column in your child table and then:
Step 3: write an update script like this:
UPDATE childTable C
INNER JOIN parentTable P ON C.oldEmpID = P.oldEmpID
SET C.myNewEmpIDColumn = P.myNewEmpIDColumn
Step 4: Repeat steps 2 & 3 for all child tables
Step 5: Delete all old FK columns
Something like that and don't forget to backup your current DB first ;)
Related
I have been working with Cassandra and I have hit a bit of a stumbling block. For how I need to search on data I found that a Composite primary key works great for what I need but the insert times for the record in this Column Family go to the dogs with it and I am not entirely sure why.
Table Definition:
CREATE TABLE exampletable (
clientid int,
filledday int,
filledtime bigint,
id uuid,
...etc...
PRIMARY KEY (clientid, filledday, filledtime, id)
);
clientid = The internal id of the client. filledday = The number of days since 1/1/1900. filledtime = The number of ticks of the day at which the record was recived. id = A Guid.
The day and time structure exists because I need to be able to filter by day easily and quickly.
I know Cassandra stores Column Families with composite primary keys quite differently. From what I understand it will store the everything as new columns off of a base row of the main component of the primary key. Is that the reason the inserts would be slow? When I say slow I mean that if I just have a primary key on id the insert will take ~200 milliseconds and with the composite primary key (or any subset of it, I tried just clientid and id to the same effect) it will take upwards of 32 seconds for 1000 records. The Select times are faster out of the composite key table since I have to apply secondary indexes and use 'ALLOW FILTERING' in order to get the proper records back with the standard key table (I know I could do this in code but the concern is that I am dealing with some massive data sets and that will not always be practical or possible).
Am I declaring the Column Family or the Primary Key wrong for what I am trying to do? With all the unlisted, non-primary key columns the table is 37 columns wide, would that be the problem? I am quite stumped at this point. I have not be able to really find anything about others having similar problems.
Well, your partition key is the client id, so all writes per client go to one node. If you are writing lots of data per client, you could end up with a hotspot, thus decreasing your overall throughput.
Also, could you give an example of the queries that you run? In Cassandra, the data model always need to resemble the queries you want to run. If you need to "allow filtering", then it seems that something is not quite right with your data model. For instance, I don't really see the point of "filledtime" in your PK. If you want to query by time period, just replace your three column keys with a TimeUUID column "ts". This would create a wide row, with one column per entry with a unique timestam, clustered/partitioned per client id.
This allows queries like:
select * from exampletable where clientid = 123 and ts > minTimeuuid('2013-06-18 16:23:00') and ts < minTimeuuid('2013-06-18 16:24:00');
Again, this would depend on the queries you actually need to run.
And lastly, for overall guidance on data modelling, take a look into this ebay tech blog. Reading it helped me cleared up some things for me.
Hope that helps!
I have an application running that has entities that might be: CustomerType1, CustomerType2, and CustomerType3.
All three CustomerType entities might have completely different information, but they all have a CustomerID field which is an integer.
I am trying to figure out how to set things up so that no matter which type is created, the CustomerID will always be unique across all three types, and remain an integer.
For example, creating the following would result in the following CustomerID
CustomerType1 - 1
CustomerType1 - 2
CustomerType1 - 3
CustomerType2 - 4
CustomerType1 - 5
CustomerType3 - 6
CustomerType1 - 7
What is the best way to approach this?
2 possible approaches:
Use a single table for all of your customer types with a "discriminator" field to track each customer type, and include the CustomerID as its identity.
Use an external table to manage creating the CustomerID as an identity field.
The first approach has the advantage of having direct support in the Entity Framework for as outlined in the following tutorial:
http://www.asp.net/mvc/tutorials/getting-started-with-ef-using-mvc/implementing-inheritance-with-the-entity-framework-in-an-asp-net-mvc-application
Note that although your customer types may contain different data, this approach could end up being cheaper in the long run in terms of scalability despite the wasted database space. In your business layer, the customer types could simply ignore the fields that don't pertain to them.
The second approach would probably be best suited to adding onto existing applications that are too difficult to change. In the long run, there is more work involved with keeping track of the IDs this way. For one, your business layer will need to fetch an ID from one table in order to insert into another table, which can be expensive for large datasets. Depending on the requirements of the business layer, there may also be scenarios where you have to discard an unused CustomerID and it would simply not exist in the system (you would skip from CustomerID 58 to CustomerID 60 for example).
Your approach is similar to having the same entity type in network databases, with unique IDs.
Usually, many developers use a UID or related type. Its a type that has from 64, 128, or more digits, and its generated randomly and automatically. If you use the same database in different machines, its almost imposible to get the same value.
Some databases store that value as a string, instead of an integer.
If you only had a single table with integer keys, how do you generate the primary key value ? Automatic ? Do you generate the value in code, and later assigned to the primary key field ?
Solution 1
If your database supports U.I.D. or O.I.D. or Unique identifiers, that are generated automatically, as an integers, use them.
Solution 2
If your database supports U.I.D. or O.I.D. as varchar / string, or the database engine, or program that you use, have a function that generates U.I.D., you may use that function, cast the result value from string to integer, (stripping separators like "-"), and stored in an integer primary key field.
Summary
Many developers prefer to let the database engine generate the primary key automatically, when inserting a new record. In cases, likes this, its better, to generate the primary key in code, and assign it directly. Since you are using "Entity Framework", I ignore how does that library handles primary keys.
Cheers.
I am doing a web-application on asp.net mvc and I'm choosing between the long and Guid data type for my entities, but I don't know which one is better. Some say that long is much faster. Guid also might have some advantages. Anybody knows ?
When GUIDs can be Inappropriate
GUIDs are almost always going to be slower because they are larger. That makes your indexes larger. That makes your tables larger. That means that if you have to scan your tables, either wholly or partially, it will take longer and you will see less performance. This is a huge concern in reporting based systems. For example, one would never use a GUID as a foreign key in a fact table because its length would usually be significant, as fact tables are often partially scanned to generate aggregates.
Also consider whether or not it is appropriate to use a "long". That's an enormously large number. You only need it if you think you might have over 2 BILLION entries in your table at some point. It's rare that I use them.
GUIDs can also be tough to use and debug. Saying, "there's a problem with Customer record 10034, Frank, go check it out" is a lot easier than saying "there's a problem with {2f1e4fc0-81fd-11da-9156-00036a0f876a}..." Ints and longs are also easier to type into queries when you need to.
Oh, and it's not the case that you never get the same GUID twice. It has been known to happen on very large, disconnected systems, so that's something to consider, although I wouldn't design for it in most apps.
When GUIDs can be Appropriate
GUIDs are the appropriate when you're working with disconnected systems where entities are created and then synchronized. For example, if someone makes a record in your database on a mobile device and syncs it, or you have entities being created at different branch offices and synced to a central store at night. That's the kind of flexibility they give you.
GUIDs also allow you the ability to associate entities without persisting them to the database, in certain ORM scenarios. Linq to SQL (and I believe the EF) don't have this problem, though there are times you might be forced to submit your changes to the database to get a key.
If you create your GUIDs on the client, it's possible that since the GUIDs you create are not sequential, that insert performance could suffer because of page splits on the DB.
My Advice
A lot of stuff to consider here. My vote is to not use them unless you have a compelling use case for them. If performance really is your goal, keep your tables small. Keep your fields small. Keep your DB indexes small and selective.
SIZE:
Long is 8 bytes
Guid is 16 bytes
GUID has definitely high probability for going to be unique and is best to use for identification of individual records in a data base(s).
long (Identity in DB), might represent a unique record in a table but you might have records represented by same ID (Identity), in one or more different table like as follows:
TableA: PersonID int, name varchar(50)
TableB: ProductID int, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name =''
both can return same value, but in case of GUID :
TableA: PersonID uniqueidentifier, name varchar(50)
TableB: ProductID uniqueidentifier, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name ='
you can rarely have same value as id returned from two tables
Have a look here
SQL Server - Guid VS. Long
GUIDs as PRIMARY KEYs and/or the clustering key
Guids make it much easier to create a 'fresh' entity in your API because you simply assign it the value of Guid.NewGuid(). There's no reliance on auto-incremented keys from a database, so this better decouples the Domain Model from the underlying persistence mechanism.
On the downside, if you use a Guid as the Clustered Index in SQL Server, inserts become expensive because new rows are very rarely added to the end of the table, so the index needs to be rebuilt very often.
Another issue is that if you perform selects from such a database without specifying an explicit ordering, you get out results in an essentially random order.
I have several tables within my database that contains nothing but "metadata".
For example we have different grouptypes, contentItemTypes, languages, ect.
the problem is, if you use automatic numbering then it is possible that you create gaps.
The id's are used within our code so, the number is very important.
Now I wonder if it isn't better not to use autonumbering within these tables?
Now we have create the row in the database first, before we can write our code. And in my opinion this should not be the case.
What do you guys think?
I would use an identity column as you suggest to be your primary key(surrogate key) and then assign your you candidate key (identifier from your system) to be a standard column but apply a unique constraint to it. This way you can ensure you do not insert duplicate records.
Make sense?
if these are FK tables used just to expand codes into a description or contain other attributes, then I would NOT use an IDENTITY. Identity are good for ever inserting user data, metadata tables are usually static. When you deploy a update to your code, you don't want to be suprised and have an IDENTITY value different than you expect.
For example, you add a new value to the "Languages" table, you expect the ID will be 6, but for some reason (development is out of sync, another person has not implemented their next language type, etc) the next identity you get is different say 7. You then insert or convert a bunch of rows having using Language ID=6 which all fail becuase it does not exist (it is 7 iin the metadata table). Worse yet, they all actuall insert or update because the value 6 you thought was yours was already in the medadata table and you now have a mix of two items sharing the same 6 value, and your new 7 value is left unused.
I would pick the proper data type based on how many codes you need, how often you will need to look at it (CHARs are nice to look at for a few values, helps with memory).
for example, if you only have a few groups, and you'll often look at the raw data, then a char(1) may be good:
GroupTypes table
-----------------
GroupType char(1) --'M'=manufacturing, 'P'=purchasing, 'S'=sales
GroupTypeDescription varchar(100)
however, if there are many different values, then some form of an int (tinyint, smallint, int, bigint) may do it:
EmailTypes table
----------------
EmailType smallint --2 bytes, up to 32k different positive values
EmailTypeDescription varchar(100)
If the numbers are hardcoded in your code, don't use identity fields. Hardcode them in the database as well as they'll be less prone to changing because someone scripted a database badly.
I would use an identity column as the primary key also just for simplicity sake of inserting the records into the database, but then use a column for type of metadata, I call mine LookUpType(int), as well as columns for LookUpId (int value in code) or value in select lists, LookUpName(string), and if those values require additional settings so to speak use extra columns. I personally use two extras, LookUpKey for hierarchical relations, and LookUpValue for abbreviations or alternate values of LookUpName.
Well, if those numbers are important to you because they'll be in code, I would probably not use an IDENTITY.
Instead, just make sure you use a INT column and make it the primary key - in that case, you will have to provide the ID's yourself, and they'll have to be unique.
Up until now i've been using the C# "Guid = Guid.NewGuid();" method to generate a unique ID that can be stored as the ID field in some of my SQL Server database tables using Linq to SQL.
I've been informed that for indexing reasons, using a GUID is a bad idea and that I should use an auto-incrementing Long instead. Will using a long speed up my database transactions? If so, how do I go about generating unique ID's that are of type Long?
Regards,
Both have pros and cons, it depends entirely on how you use them that matters.
Right off the bat, if you need identifiers that can work across several databases, you need GUIDs. There are some tricks with Long (manually assigning each database a different seed/increment), but these don't scale well.
As far as indexing goes, Long will give much better insert performance if the index is clustered (by default primary keys are clustered, but this can be modified for your table), since the table does not need to be reorganized after every insert.
As far as concurrent inserts are concerned however, Long (identity) columns will be slower then GUID - identity column generation requires a series of exclusive locks to ensure that only one row gets the next sequential number. In an environment with many users inserting many rows all the time, this can be a performance hit. GUID generation in this situation is faster.
Storage wise, a GUID takes up twice the space of a Long (8 bytes vs 16). However it depends on the overall size of your row if 8 bytes is going to make a noticable difference in how many records fit in one leaf, and thus the number of leaves pulled from disk during an average request.
The "Queen of Indexing" - Kim Tripp - basically says it all in her indexing blog posts:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
Ever increasing clustering key - the Clustered Index Debate......again!
Basically, her best practices are: an optimal clustering key should be:
unique
small
stable (never changing)
ever-increasing
GUID's violate the "small" and "ever-increasing" and are thus not optimal.
PLUS: all your clustering keys will be added to each and every single entry in each and every single non-clustered index (as the lookup to actually find the record in the database), thus you want to make them as small as possible (INT = 4 byte vs. GUID = 16 byte). If you have hundreds of millions of rows and several non-clustered indices, choosing an INT or BIGINT over a GUID can make a major difference - even just space-wise.
Marc
have a look at this
Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?
A long (big int in sql server) is 8 bytes and a Guid is 16 bytes, so you are halving the number of the bytes sql server has to compare when doing a look up.
For generating a long, use IDENTITY(1,1) when you create the field in the database.
so either using create table or alter table:
Field_NAME BIGINT NOT NULL PRIMARY KEY IDENTITY(1,1)
See comments for posting Linq to sql
Use guids when you need to consider import/export to multiple databases. Guids are often easier to use than columns specifying the IDENTITY attribute when working with a dataset of multiple child relationships. this is because you can randomly generate guids in the code in a disconnected state from the database, and then submit all changes at once. When guids are generated properly, they are insainely hard to duplicate by chance. With identity columns, you often have to do an intial insert of a parent row and query for it's new identity before adding child data. You then have to update all child records with the new parent identity before committing them to the database. The same goes for grandchildren and so on down the heirarchy. It builds up to a lot of work that seems unnecessary and mundane. You can do something similar to Guids by comming up with random integers without the IDENTITY specification, but the chance of collision is greatly increased as you insert more records over time. (Guid.NewGuid() is similar to a random Int128 - which doesn't exist yet).
I use Byte (TinyInt), Int16 (SmallInt), Int32/UInt16 (Int), Int64/UInt32 (BigInt) for small lookup lists that do not change or data that does not replicate between multiple databases. (Permissions, Application Configuration, Color Names, etc.)
I imagine the indexing takes just as long to query against regardless if you are using a guid or a long. There are usually other fields in tables that are indexed that are larger than 128 bits anyway (user names in a user table for example). The difference between Guids and Integers is the size of the index in memory, as well as time populating and rebuilding indexes. The majority of database transactions is often reading. Writing is minimal. Concentrate on optimizing reading from the database first, as they are usually made of joined tables that were not optimized properly, improper paging, or missing indexes.
As with anything, the best thing to do is to prove your point. create a test database with two tables. One with a primary key of integers/longs, and the other with a guid. Populate each with N-Million rows. Moniter the performance of each during the CRUD operations (create, read, update, delete). You may find out that it does have a performance hit, but insignificant.
Servers often run on boxes without debugging environments and other applications taking up CPU, Memory, and I/O of hard drive (especially with RAID). A development environment only gives you an idea of performance.
Consider creating sequential GUID from .NET application:
http://dotnet-snippets.de/dns/sequential-guid-SID998.aspx
What are the performance improvement of Sequential Guid over standard Guid?
You can debate GUID or identity all day. I prefer the database to generate the unique value with an identity. If you merge data from multiple databases, add another column (to identify the source database, possibly a tinyint or smallint) and form a composite primary key.
If you do go with an identity, be sure to pick the right datatype, based on number of expected keys you will generate:
bigint - 8 Bytes - max positive value: 9,223,372,036,854,775,807
int - 4 Bytes - max positive value: 2,147,483,647
Note "number of expected keys " is different than the number of rows. If you mainly add and keep rows, you may find that an INT is enough with over 2 billion unique keys. I'll bet your table won't get that big. However, if you have a high volume table where you keep adding and removing rows, you row count may be low, but you'll go through keys fast. You should do some calculations to see how log it would take to go through the INTs 2 billion keys. If it won't use them up any time soon go with INT, otherwise double the key size and go with BIGINT.