Slow Insert Time With Composite Primary Key in Cassandra

Slow Insert Time With Composite Primary Key in Cassandra - c#

I have been working with Cassandra and I have hit a bit of a stumbling block. For how I need to search on data I found that a Composite primary key works great for what I need but the insert times for the record in this Column Family go to the dogs with it and I am not entirely sure why.
Table Definition:
CREATE TABLE exampletable (
clientid int,
filledday int,
filledtime bigint,
id uuid,
...etc...
PRIMARY KEY (clientid, filledday, filledtime, id)
);
clientid = The internal id of the client. filledday = The number of days since 1/1/1900. filledtime = The number of ticks of the day at which the record was recived. id = A Guid.
The day and time structure exists because I need to be able to filter by day easily and quickly.
I know Cassandra stores Column Families with composite primary keys quite differently. From what I understand it will store the everything as new columns off of a base row of the main component of the primary key. Is that the reason the inserts would be slow? When I say slow I mean that if I just have a primary key on id the insert will take ~200 milliseconds and with the composite primary key (or any subset of it, I tried just clientid and id to the same effect) it will take upwards of 32 seconds for 1000 records. The Select times are faster out of the composite key table since I have to apply secondary indexes and use 'ALLOW FILTERING' in order to get the proper records back with the standard key table (I know I could do this in code but the concern is that I am dealing with some massive data sets and that will not always be practical or possible).
Am I declaring the Column Family or the Primary Key wrong for what I am trying to do? With all the unlisted, non-primary key columns the table is 37 columns wide, would that be the problem? I am quite stumped at this point. I have not be able to really find anything about others having similar problems.

Well, your partition key is the client id, so all writes per client go to one node. If you are writing lots of data per client, you could end up with a hotspot, thus decreasing your overall throughput.
Also, could you give an example of the queries that you run? In Cassandra, the data model always need to resemble the queries you want to run. If you need to "allow filtering", then it seems that something is not quite right with your data model. For instance, I don't really see the point of "filledtime" in your PK. If you want to query by time period, just replace your three column keys with a TimeUUID column "ts". This would create a wide row, with one column per entry with a unique timestam, clustered/partitioned per client id.
This allows queries like:
select * from exampletable where clientid = 123 and ts > minTimeuuid('2013-06-18 16:23:00') and ts < minTimeuuid('2013-06-18 16:24:00');
Again, this would depend on the queries you actually need to run.
And lastly, for overall guidance on data modelling, take a look into this ebay tech blog. Reading it helped me cleared up some things for me.
Hope that helps!

Related

Auto-increment Using Dates

I'm quite a beginner in general but I have a theory, idea, etc...
I want to create a task database, with a unique TaskID column [primary key or not] using the date. I need the entry to be auto-generated. In order to avoid collisions, I want to attach a number to the end, so this should achieve the goal of having all entries unique. So a series of entries would look like this:
201309281 [2013-09-28]
201309282
201309291
My thought is that I could use auto-increment that would reset at midnight EST, and start again at the given date, or something like that.
The advantage, to me, of having it work like this, is that you could see all tasks created on a given day, but then the particular task may not be completed or invoiced until, say, a week later. This way you could search by creation date, completion date, or invoice date.
I realize that there are many ways to achieve the end goal of task database. I was just curious if this was possible, or if anyone had any thoughts on how to implement it as the primary key column, or any other column for that matter.
I also want to apologize if this question is unclear. I will try to sum up here.
Can you have an auto-increment column based on the date the row is created, so it automatically generates the date as a number [20130929] with an extra digit on the end in the following format, AND have that extra digit number on the end reset to "1" every day at midnight EST or UTC?
And thoughts on how to accomplish?
eg:
201309291
EDIT: BTW, I would like to use an MVC4 web app to give users CRUD functionality. Using C#. I thought this fact may expand the options.
EDIT: I found this q/a on stack, and it seems similar, but doesn't quite answer my question. My thought is posting the link here might help find an answer. Resetting auto-increment column back to 0 daily

I take it you're new to db design Nick but this sort of design would make any seasoned DBA cringe. You should avoid putting any information in primary keys. The results you're trying to achieve can be attained using something like the code below. Remember, PK's should always be dumb ID's, no intelligent keys!
Disclaimer: I'm a very strong proponent of surrogate key designs and I'm biased in that direction. I've been stung many times by architectures didn't fully consider the trade-offs or the downstream implications of a natural key design. I humbly respect and understand the opinions of natural key advocates but in my experience developing relational business apps - surrogate designs are the better choice 99% of the time.
(BTW, you don't really even need the createdt field in the RANK clause, you could use the auto-increment PK instead in the ORDER BY clause of the PARTITION).
CREATE TABLE tbl(
id int IDENTITY(1,1) NOT NULL,
dt date NOT NULL,
createdt datetime NOT NULL
CONSTRAINT PK_tbl PRIMARY KEY CLUSTERED (id ASC)
)
go
'I usually have this done for me by the database
'rather than pass it from middle tier
'ALTER TABLE tbl ADD CONSTRAINT DF_tbl_createdt
' DEFAULT (getdate()) FOR createdt
insert into tbl(dt,createdt) values
('1/1/13','1/1/13 1:00am'),('1/1/13','1/1/13 2:00am'),('1/1/13','1/1/13 3:00am'),
('1/2/13','1/2/13 1:00am'),('1/2/13','1/1/13 2:00am'),('1/2/13','1/1/13 3:00am')
go
SELECT id,dt,rank=RANK() OVER (PARTITION BY dt ORDER BY createdt ASC)
from tbl

I would say that this is a very bad design thought. Primary keys ideally should be surrogate in nature and thus automatically created by SQL Server.
The logic drafted by you might get implemented well but due to lot of manual-engineering it could lead to lot of complexities, maintenance overhead and performance issues.
For creating PKs you should restrict yourself to either IDENTITY property, SEQUENCES (new in SQL Server 2012), or GUID (newID()).
Even if you want to go with your design you can have a combination of Date type column and an IDENTITY int/bigint column. And you can add an extra computed column to concatenate them. Resetting IDENTITY column every midnight would not be a good idea.

Ok, I found an answer. There may be problems with this method that I don't know about, so comments would be welcome. But this method does work.
CREATE TABLE [dbo].[MainOne](
[DocketDate] NVARCHAR(8),
[DocketNumber] NVARCHAR(10),
[CorpCode] NVARCHAR(5),
CONSTRAINT pk_Docket PRIMARY KEY (DocketDate,DocketNumber)
)
GO
INSERT INTO [dbo].[MainOne] VALUES('20131003','1','CRH')
GO
CREATE TRIGGER AutoIncrement_Trigger ON [dbo].[MainOne]
instead OF INSERT AS
BEGIN
DECLARE #number INT
SELECT #number=COUNT(*) FROM [dbo].[MainOne] WHERE [DocketDate] = CONVERT(DATE, GETDATE())
INSERT INTO [dbo].[MainOne] (DocketDate,DocketNumber,CorpCode) SELECT (CONVERT(DATE, GETDATE
())),(#number+1),inserted.CorpCode FROM inserted
END
Any thoughts? I will wait three days before I mark as answer.
The only reason I'm not marking 'sisdog' is because it doesn't appear that his answer would make this an automatic function when an insert query is run.

Is this okay to have a Alphanumeric field as a PrimaryKey?

I am rewriting a new timesheet application including redesigning database and it will require data migration from Oracle to Oracle.
In the old system field ‘EmployeeCod’ is a Primary Key and it is in Alphanumeric form i.e. ‘UK001’, ‘UK002’,‘FR001’,’FR002’, ‘US001’ . Employee table is also linked to timesheet and other tables where the EmpCode is being referred as a FK.
To make the JOINs perform faster in the new system I was thinking about adding a new INT column in the Employee table and set it to PK. (Don't know if it will make any big difference)
-Employee table has about 600 rows.
-Data type of EmpCode is Varchar2(20) in old DB which I can reduce to Varchar2(6) in the new system and alter it later as company expends.
I am wondering if it is better to keep the EmpCode as a Primary Key which will make things easier in migrating data or should I add a INT column?
Someone has given me following advise in one of my previous thread:
“if you need to create a composite code of AANNN then I'd split this into two: a simple 'Prefix' field of CHAR(2) and an identity field of INT, then turn EmpCode into a computed field that concats the two and stick an index on there that (#Chris)”
I am not sure if this option would work as employee table is linked to other tables as well. (EmpCode is being used as FK in other tables)
n

If you do add this PK, and also keep the former PK, you will have some data management issues to deal with. Or perhaps your customers. Getting rid of the old PK may not be feasable if there are existing users who will be upgrading to the new database.
If EmployeeCode, the former PK is used by the users of the data to identify Employees, then you will have to add a constraint to make sure that this field is unique. Carrying both codes will wipe out any performance gains you were hoping for.
If it were me, I'd leave well enough alone. The performance gains, if any, will be trivial.

The performance difference will be negligible if the index you're creating on the alphanumeric field is the clustered index for the table. Which, based off of your question is going to be the case, but I wanted to note that for completeness. I say this for two reasons:
A clustered index is the physical order of the table and so when seeking against that index, looking for more data presumably off of the data page in a query, a binary search can be performed against it because it's also physically stored in that order.
A binary search is just about as efficient as you can get, lest we forget though a statistical index. I call this out because integer primary keys build statistical indexes which are as fast a seek as you can get because mathmatically speaking we know 2 comes after 1 for example.
So, just keep that in mind when building alphanumeric, or even compound, keys and indexes and trying to compare the difference between them and an integer key. Personally, I prefer to stick with integer primary keys because I have found them to perform better over time during extreme growth.
I hope this helps.

I use alphanumeric primary keys regularly and see absolutely no issues with it. There is no performance issue, you have a wider addressable space, and you can be more expressive/human readable. Integer keys are just a convention.
Add to that the risk you're adding to you project by adding a major architectural change over and above the porting issues, I'd say stick with the existing schema as much as possible.

There will be no performance improvement - in fact, unless you know and can prove/measure that you have a performance problem, changing things "to make them faster" usually leads to pain.
However, there is a concern that your primary key appears to carry meaning - it's a country code, concatenated with a number. What if an employee moves from the US to the UK? What if the UK hires its 1000th employee?
For that reason, I'd refactor the application to use a meaningless primary key; whether it's an INT or a VARCHAR is not hugely relevant.

You do occassionally come across alphanumeric primary keys.. personally I find it just makes life more difficult.. if you are able to change it and you want to change it, I would say go ahead.. it will make things easier for you later. As for it being an FK, you would need to be careful to write a script to properly update all the data. One way you can do this is:
Step 1: Create a new int column for the PK and set Identity Insert to true
Step 2: Add a new int column in your child table and then:
Step 3: write an update script like this:
UPDATE childTable C
INNER JOIN parentTable P ON C.oldEmpID = P.oldEmpID
SET C.myNewEmpIDColumn = P.myNewEmpIDColumn
Step 4: Repeat steps 2 & 3 for all child tables
Step 5: Delete all old FK columns
Something like that and don't forget to backup your current DB first ;)

long vs Guid for the Id (Entity), what are the pros and cons

I am doing a web-application on asp.net mvc and I'm choosing between the long and Guid data type for my entities, but I don't know which one is better. Some say that long is much faster. Guid also might have some advantages. Anybody knows ?

When GUIDs can be Inappropriate
GUIDs are almost always going to be slower because they are larger. That makes your indexes larger. That makes your tables larger. That means that if you have to scan your tables, either wholly or partially, it will take longer and you will see less performance. This is a huge concern in reporting based systems. For example, one would never use a GUID as a foreign key in a fact table because its length would usually be significant, as fact tables are often partially scanned to generate aggregates.
Also consider whether or not it is appropriate to use a "long". That's an enormously large number. You only need it if you think you might have over 2 BILLION entries in your table at some point. It's rare that I use them.
GUIDs can also be tough to use and debug. Saying, "there's a problem with Customer record 10034, Frank, go check it out" is a lot easier than saying "there's a problem with {2f1e4fc0-81fd-11da-9156-00036a0f876a}..." Ints and longs are also easier to type into queries when you need to.
Oh, and it's not the case that you never get the same GUID twice. It has been known to happen on very large, disconnected systems, so that's something to consider, although I wouldn't design for it in most apps.
When GUIDs can be Appropriate
GUIDs are the appropriate when you're working with disconnected systems where entities are created and then synchronized. For example, if someone makes a record in your database on a mobile device and syncs it, or you have entities being created at different branch offices and synced to a central store at night. That's the kind of flexibility they give you.
GUIDs also allow you the ability to associate entities without persisting them to the database, in certain ORM scenarios. Linq to SQL (and I believe the EF) don't have this problem, though there are times you might be forced to submit your changes to the database to get a key.
If you create your GUIDs on the client, it's possible that since the GUIDs you create are not sequential, that insert performance could suffer because of page splits on the DB.
My Advice
A lot of stuff to consider here. My vote is to not use them unless you have a compelling use case for them. If performance really is your goal, keep your tables small. Keep your fields small. Keep your DB indexes small and selective.

SIZE:
Long is 8 bytes
Guid is 16 bytes
GUID has definitely high probability for going to be unique and is best to use for identification of individual records in a data base(s).
long (Identity in DB), might represent a unique record in a table but you might have records represented by same ID (Identity), in one or more different table like as follows:
TableA: PersonID int, name varchar(50)
TableB: ProductID int, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name =''
both can return same value, but in case of GUID :
TableA: PersonID uniqueidentifier, name varchar(50)
TableB: ProductID uniqueidentifier, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name ='
you can rarely have same value as id returned from two tables
Have a look here
SQL Server - Guid VS. Long
GUIDs as PRIMARY KEYs and/or the clustering key

Guids make it much easier to create a 'fresh' entity in your API because you simply assign it the value of Guid.NewGuid(). There's no reliance on auto-incremented keys from a database, so this better decouples the Domain Model from the underlying persistence mechanism.
On the downside, if you use a Guid as the Clustered Index in SQL Server, inserts become expensive because new rows are very rarely added to the end of the table, so the index needs to be rebuilt very often.
Another issue is that if you perform selects from such a database without specifying an explicit ordering, you get out results in an essentially random order.

Is the usage of identity insert good with metadatatables

I have several tables within my database that contains nothing but "metadata".
For example we have different grouptypes, contentItemTypes, languages, ect.
the problem is, if you use automatic numbering then it is possible that you create gaps.
The id's are used within our code so, the number is very important.
Now I wonder if it isn't better not to use autonumbering within these tables?
Now we have create the row in the database first, before we can write our code. And in my opinion this should not be the case.
What do you guys think?

I would use an identity column as you suggest to be your primary key(surrogate key) and then assign your you candidate key (identifier from your system) to be a standard column but apply a unique constraint to it. This way you can ensure you do not insert duplicate records.
Make sense?

if these are FK tables used just to expand codes into a description or contain other attributes, then I would NOT use an IDENTITY. Identity are good for ever inserting user data, metadata tables are usually static. When you deploy a update to your code, you don't want to be suprised and have an IDENTITY value different than you expect.
For example, you add a new value to the "Languages" table, you expect the ID will be 6, but for some reason (development is out of sync, another person has not implemented their next language type, etc) the next identity you get is different say 7. You then insert or convert a bunch of rows having using Language ID=6 which all fail becuase it does not exist (it is 7 iin the metadata table). Worse yet, they all actuall insert or update because the value 6 you thought was yours was already in the medadata table and you now have a mix of two items sharing the same 6 value, and your new 7 value is left unused.
I would pick the proper data type based on how many codes you need, how often you will need to look at it (CHARs are nice to look at for a few values, helps with memory).
for example, if you only have a few groups, and you'll often look at the raw data, then a char(1) may be good:
GroupTypes table
-----------------
GroupType char(1) --'M'=manufacturing, 'P'=purchasing, 'S'=sales
GroupTypeDescription varchar(100)
however, if there are many different values, then some form of an int (tinyint, smallint, int, bigint) may do it:
EmailTypes table
----------------
EmailType smallint --2 bytes, up to 32k different positive values
EmailTypeDescription varchar(100)

If the numbers are hardcoded in your code, don't use identity fields. Hardcode them in the database as well as they'll be less prone to changing because someone scripted a database badly.

I would use an identity column as the primary key also just for simplicity sake of inserting the records into the database, but then use a column for type of metadata, I call mine LookUpType(int), as well as columns for LookUpId (int value in code) or value in select lists, LookUpName(string), and if those values require additional settings so to speak use extra columns. I personally use two extras, LookUpKey for hierarchical relations, and LookUpValue for abbreviations or alternate values of LookUpName.

Well, if those numbers are important to you because they'll be in code, I would probably not use an IDENTITY.
Instead, just make sure you use a INT column and make it the primary key - in that case, you will have to provide the ID's yourself, and they'll have to be unique.

SQL Server - Guid VS. Long

Up until now i've been using the C# "Guid = Guid.NewGuid();" method to generate a unique ID that can be stored as the ID field in some of my SQL Server database tables using Linq to SQL.
I've been informed that for indexing reasons, using a GUID is a bad idea and that I should use an auto-incrementing Long instead. Will using a long speed up my database transactions? If so, how do I go about generating unique ID's that are of type Long?
Regards,

Both have pros and cons, it depends entirely on how you use them that matters.
Right off the bat, if you need identifiers that can work across several databases, you need GUIDs. There are some tricks with Long (manually assigning each database a different seed/increment), but these don't scale well.
As far as indexing goes, Long will give much better insert performance if the index is clustered (by default primary keys are clustered, but this can be modified for your table), since the table does not need to be reorganized after every insert.
As far as concurrent inserts are concerned however, Long (identity) columns will be slower then GUID - identity column generation requires a series of exclusive locks to ensure that only one row gets the next sequential number. In an environment with many users inserting many rows all the time, this can be a performance hit. GUID generation in this situation is faster.
Storage wise, a GUID takes up twice the space of a Long (8 bytes vs 16). However it depends on the overall size of your row if 8 bytes is going to make a noticable difference in how many records fit in one leaf, and thus the number of leaves pulled from disk during an average request.

The "Queen of Indexing" - Kim Tripp - basically says it all in her indexing blog posts:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
Ever increasing clustering key - the Clustered Index Debate......again!
Basically, her best practices are: an optimal clustering key should be:
unique
small
stable (never changing)
ever-increasing
GUID's violate the "small" and "ever-increasing" and are thus not optimal.
PLUS: all your clustering keys will be added to each and every single entry in each and every single non-clustered index (as the lookup to actually find the record in the database), thus you want to make them as small as possible (INT = 4 byte vs. GUID = 16 byte). If you have hundreds of millions of rows and several non-clustered indices, choosing an INT or BIGINT over a GUID can make a major difference - even just space-wise.
Marc

have a look at this
Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

A long (big int in sql server) is 8 bytes and a Guid is 16 bytes, so you are halving the number of the bytes sql server has to compare when doing a look up.
For generating a long, use IDENTITY(1,1) when you create the field in the database.
so either using create table or alter table:
Field_NAME BIGINT NOT NULL PRIMARY KEY IDENTITY(1,1)
See comments for posting Linq to sql

Use guids when you need to consider import/export to multiple databases. Guids are often easier to use than columns specifying the IDENTITY attribute when working with a dataset of multiple child relationships. this is because you can randomly generate guids in the code in a disconnected state from the database, and then submit all changes at once. When guids are generated properly, they are insainely hard to duplicate by chance. With identity columns, you often have to do an intial insert of a parent row and query for it's new identity before adding child data. You then have to update all child records with the new parent identity before committing them to the database. The same goes for grandchildren and so on down the heirarchy. It builds up to a lot of work that seems unnecessary and mundane. You can do something similar to Guids by comming up with random integers without the IDENTITY specification, but the chance of collision is greatly increased as you insert more records over time. (Guid.NewGuid() is similar to a random Int128 - which doesn't exist yet).
I use Byte (TinyInt), Int16 (SmallInt), Int32/UInt16 (Int), Int64/UInt32 (BigInt) for small lookup lists that do not change or data that does not replicate between multiple databases. (Permissions, Application Configuration, Color Names, etc.)
I imagine the indexing takes just as long to query against regardless if you are using a guid or a long. There are usually other fields in tables that are indexed that are larger than 128 bits anyway (user names in a user table for example). The difference between Guids and Integers is the size of the index in memory, as well as time populating and rebuilding indexes. The majority of database transactions is often reading. Writing is minimal. Concentrate on optimizing reading from the database first, as they are usually made of joined tables that were not optimized properly, improper paging, or missing indexes.
As with anything, the best thing to do is to prove your point. create a test database with two tables. One with a primary key of integers/longs, and the other with a guid. Populate each with N-Million rows. Moniter the performance of each during the CRUD operations (create, read, update, delete). You may find out that it does have a performance hit, but insignificant.
Servers often run on boxes without debugging environments and other applications taking up CPU, Memory, and I/O of hard drive (especially with RAID). A development environment only gives you an idea of performance.

Consider creating sequential GUID from .NET application:
http://dotnet-snippets.de/dns/sequential-guid-SID998.aspx
What are the performance improvement of Sequential Guid over standard Guid?

You can debate GUID or identity all day. I prefer the database to generate the unique value with an identity. If you merge data from multiple databases, add another column (to identify the source database, possibly a tinyint or smallint) and form a composite primary key.
If you do go with an identity, be sure to pick the right datatype, based on number of expected keys you will generate:
bigint - 8 Bytes - max positive value: 9,223,372,036,854,775,807
int - 4 Bytes - max positive value: 2,147,483,647
Note "number of expected keys " is different than the number of rows. If you mainly add and keep rows, you may find that an INT is enough with over 2 billion unique keys. I'll bet your table won't get that big. However, if you have a high volume table where you keep adding and removing rows, you row count may be low, but you'll go through keys fast. You should do some calculations to see how log it would take to go through the INTs 2 billion keys. If it won't use them up any time soon go with INT, otherwise double the key size and go with BIGINT.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.