This article discusses possible ways CQL 3 could be used for creating composite columns in Cassandra 1.1. They are just ideas. Nothing is official, and the Datastax documentation doesn't cover this (only composite keys).
As I understand it, composite columns are a number of columns that together have only one value.
How do you create them with CQL?
EDIT
I will be using C# to interface into Cassandra. CQL looks straightforward to use, which is why I want to use it.
You've got a couple concepts confused, I think. Quite possibly this is the fault of the Datastax documentation; if you have any good suggestions for making it clearer after you have a better picture, I'll be glad to send them on.
The "composite keys" stuff in the Datastax docs is actually talking about composite Cassandra columns. The reason for the confusion is that rows in CQL 3 do not map directly to storage engine rows (what you work with when you use the thrift interface). "Composite key" in the context of a CQL table just means a primary key which consists of multiple columns, which is implemented by composite columns at the storage layer.
This article is one of the better explanations as to how the mapping happens and why the CQL model is generally easier to think about.
With this sort of use, the first CQL column becomes being the storage engine partition key.
As of Cassandra 1.2 (in development), it's also possible to create composite storage engine keys using CQL, by putting extra parentheses in the PRIMARY KEY definition around the CQL columns that will be stored in the partition key (see CASSANDRA-4179), but that's probably going to be the exception, not the rule.
With Cassandra, you store data in rows. Each row has a row key and some number of columns. Each column has a name and a value. Usually the column name and value (and row key, for that matter) are single values (int, long, UTF8, etc), but you can use composite values in row keys, column names and column values. A composite value is just some number of values that have been serialized together in some way.
Over time a number of language-specific API's have been developed. These API's start with the understanding I describe above and provide access to a Column Family accordingly. Hector, the java client API, is the one I'm most familiar with, but there are others.
CQL was introduced as a means to use Cassandra tables in an SQL/JDBC fashion. Not all Cassandra capabilities were supported through CQL at first, although CQL is getting more and more functional as time goes on.
I don't doubt your need for composite column names and values (I believe that's what your asking for). The problem is that CQL has yet to evolve (as I understand it) to that level of native support. Whether or not it ever will is not known to me.
I suggest that you complete the definition of your desired column family schemas, complete with composite values if necessary. Once you've done that, look at the various API's available to access Cassandra column families and choose the one that best supports your desired schema.
You haven't said what language you're using. If you were coding in java then I'd recommend Hector and not CQL.
Are you sure you want to create them with CQL? What is your use case?
Related
I'm setting up a data warehouse (in SQL Server) together with our engineers we got almost everything up and running. Our main application also uses SQL Server as backend, and aims to be code first while using the entity framework. In most tables we added a column like updatedAt to allow for incremental loading to our data warehouse, but there is a many-to-many association table created by the entity framework which we cannot modify. The table consists of two GUID columns with a composite key, so they are not iterable like an incrementing integer or dates. We are now basically figuring out the options on how to enable incremental load on this table, but there is little information to be found.
After searching for a while I mostly came across posts which explained how it's not possible to manually add columns (such as updatedAt) to the association table, such as here Create code first, many to many, with additional fields in association table. Suggestions are to split out the table into two one-to-many tables. We would like to prevent this if possible.
Another potential option would be to turn on change data capture on the server, but that would potentially defeat the purpose of code first in the application.
Another thought was to add a column in the database itself, not in code, with a default value of the current datetime. But that might also be impossible / non compatible with the entity framework, as well as defeating the code first principle.
Are we missing anything? Are there other solutions for this? The ideal solution would be a code first solution, or a solution in the ETL process without affecting the base application, without changing too much. Any suggestions are appreciated.
I checked the ranking in DbEngine about 'Wide Column Store' database, the Cassandra seems to be the widest choice at present.
If I understood correctly, the so called 'Wide Column' means the columns for one row are dynamically, such as count and the name of columns, so it doesn't need the Schema stuffs.
But from most articles and documentations online, I found there is always 'CREATE TABLE (...)' CQL query executed firstly, then insert the data with this schema. From my understanding, it's the 'Static Columns' in Cassandra, which has a fixed schema defined. So how to insert data without creating the schema firstly?
And also, I found another item called 'Wide Row', what does it exactly mean, any relations with the 'Wide Column'?
Thanks a lot, the concepts puzzled me a lot.
There are 2 interfaces to access the data in Cassandra - Thrift and CQL.
Thrift is kinda low level and gives you access to "internal" rows (aka Wide rows), and also allows you to use schemaless (dynamic) tables/column families.
CQL tables are built on top of the internal rows, and can only be accessed via CQL. CQL tables allow you to use all modern features like collections, user-types and etc.
You can find more information there: http://www.datastax.com/dev/blog/thrift-to-cql3
In my application code i am generating GUID using System.Guid.NewGuid() and saving this to SQL server DB.
I have few questions regarding the GUID generation:
when I ran the program I did not find any problem with this in terms of performance, but I still wanted to know whether we have any other better way to generate GUID.
System.Guid.NewGuid() is this the only way to create GUID in .NET
code?
The GUIDs generated by Guid.NewGuid are not sequential according to SQL Servers sort order. This means you are inserting randomly into your indexes which is a disaster for performance. It might not matter, if the write volume is small enough.
You can use SQL Servers NEWSEQUENTIALGUID() function to create sequential ones, or just use an int.
One alternative way to generate guids (I presume as your PK) is to set the column in the table up like this:
create table MyTable(
MyTableID uniqueidentifier not null default (newid()),
...
Implementing like this means that you've the choice whether or not to set them in .Net or to let SQL do it.
I wouldn't say either is going to be "better" or "quicker" though.
To answer the question:
Is there any better option for GUID creation than
System.Guid.NewGuid() in .net
I would venture to say that System.Guid.NewGuid() is the preferred choice.
But for the follow up question:
...saving this to SQL server DB.
The answer is less clear. This has been discussed on the web for a long time. Just Google "guid as primary key" and you'll have hours of reading to do.
Usually when you use a Guid in Sql server it is for the reason of using as primary keys in tables. This has many nice advantages:
It's easy to generate new values without accessing the database
You can be reasonably sure that you locally generated Guid will NOT cause a primary key collision
But there are significant drawbacks as well:
If the primary key is also the clustered index, inserting large amounts of new rows will cause a lot of IO (disc operations) and index updates.
The Guid is quite large compared to the other popular alternative for a surrogate key, the int. Since all other indexes on the table contain the clustered index key, they will grow much faster if you have a Guid vs an int.
Which will also cause more IO since those indexes will require more memory
To mitigate the IO issue, Sql Server 2005 introduced a new NEWSEQUENTIALGUID() function which can be used to generate sequential Guids when inserting new rows. But if you are ging to use that, then you will have to be in contact with the database to generate one, so you lose the possibility to generate one when off line. In this situation you could still generate a normal Guid and use that.
There are also many articles on the web about how to roll your own sequential Guids. One sample:
http://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-under-multiple-database
I have not tested any of them so I can't vouch for how good they are. I chose that specific sample because it contains some information that might be interesting. Specifically:
It gets even more complicated, because one eccentricity of Microsoft
SQL Server is that it orders GUID values according to the least
significant six bytes (i.e. the last six bytes of the Data4 block).
So, if we want to create a sequential GUID for use with SQL Server, we
have to put the sequential portion at the end. Most other database
systems will want it at the beginning.
EDIT: Since the issue seems to be about inserting large amounts of data using bulk copy, a sequential Guid will probably be needed. If it's not necessary to know the Guid value before inserting then the answer by Jon Egerton would be one good way to solve the issue. If you need to know the Guid value beforehand you will either have to generate sequential Guids to use when inserting or use a workaround.
One possible workaround could be to change the table to use a seeded INT as primary key (and clustered index), and have the Guid value as a separate column with a unique index. When inserting the Guid will be supplied by you while the seeded int will be the clustered index. The rows will then be inserted sequntially, and your generated Guid can still be used as an alternative key for fetching records later. I have no idea if this is a feasible solution for you but it's at least one possible workaround.
NewGuid would be the generally recommended way - unless you need sequential values, in which case you can P/Invoke to the rpcrt function UuidCreateSequential:
Private Declare Function UuidCreateSequential Lib "rpcrt4.dll" (ByRef id As Guid) As Integer
(Sorry, nicked from VB, sure you can convert to C# or other .NET languages as required).
I'm wondering if the following DB schema would have repercussions later. Let's say I'm writing a place entity. I'm not certain what properties of place will be stored in the DB. I'm thinking of making two tables: one to hold the required (or common) info, and one to hold additional info.
Table 1 - Place
PK PlaceId
Name
Lat
Lng
etc... (all the common fields)
Table 2 - PlaceData
PK DataId
PK FieldName
PK FK PlaceId
FieldData
Usage Scenario
I want certain visitors to have the capability of entering custom fields about a place. For example, a restaurant is a place that may have the following fields: HasParking, HasDriveThru, RequiresReservation, etc... but a car dealer is also a place, and those fields wouldn't make sense for a car dealer.
I want to support any type of place, from a single table (well, 2nd table has custom fields), because I don't know the number of types of places that will eventually be added to my site.
Overall goal
On my asp.net MVC (C#/Razor) site, where I display a place, it will show the attributes, as a unordered list populated by: SELECT * FROM PlaceData WHERE PlaceId = #0.
This way, I wouldn't need to show empty field names on the view (or do a string.IsNullOrWhitespace() check for each and every field. Which I would be forced to do if every attribute was a column on the table.
I'm assuming this scenario is quite common, but are there better ways to do it? Particularly from a performance perspective? What are the major drawbacks of this schema?
Your idea is referred to as an Entity-Attribute-Value table and is generally bad news in a RDBMS. RDBMSes are geared toward highly structured data.
The overall options are:
Model the db further in an RDBMS, which is most likely if someone is holding back specs from you.
Stick with the RDBMS, using XML columns for the data whose structure is variable. This makes the most sense if a relatively small portion of your data storage schema is semi- or un-structured. Speaking from a MS SQL Server perspective, this data can be indexed and you can perform checks that your data complies with an XML schema definition.
Move to a non-relational DB such as MongoDB, Cassandra, CouchDB, etc. This is what a lot of social sites and I suspect blog sites run with. Also, it is within reason to use a combination of RDBMS and non-relational stores if that's what your needs call for.
EAV gets to be a mess because you're creating a database within a database and lose all of the benefits a RDBMS can provide (foreign keys, data type enforcement, etc.) and the SQL code needed to reconstruct your objects goes from lasagna to fettuccine to spaghetti in the blink of an eye.
Given the information that's been added to the question, it would seem a good fit to create a PlaceDetails column of type XML in the Place table. You could also split that column into another table with a 1:1 relationship if performance requirements dictate it.
The upside to doing it that way is that you can retrieve the data using very simple SQL code, even using the xml data type's methods for searching the data. But that approach also allows you to do the more complex presentation-oriented data parsing in C#, which is better suited to that purpose than T-SQL is.
If you want your application to be able to create its own custom fields, this is a fine model. The Mantis Bugtracker uses this as well to allow Admins to add custom fields to their tickets.
If in any case, it's going to be the programmer that is going to create the field, I must agree with pst that this is more a premature optimization.
At any given time you can add new columns to the database (always watching for the third normalization rule) so you should go with what you want and only create a second table if needed or if such columns breaks any of the normal forms.
What would be the best database/technique to use if I'd like to create a database that can "add", "remove" and "edit" tables and columns?
I'd like it to be scaleable and fast.
Should I use one table and four columns for this (Id, Table, Column, Type, Value) - Is there any good articles about this. Or is there any other solutions?
Maybe three tables: One that holds the tables, one that holds the columns and one for the values?
Maybe someone already has created a db for this purpose?
My requirements is that I'm using .NET (I guess the database don't have to be on windows, but I would prefer that)
Since (in comments on the question) you are aware of the pitfalls of the "inner platform effect", it is also true that this is a very common requirement - in particular to store custom user-defined columns. And indeed, most teams have needed this. Having tried various approaches, the one which I have found most successful is to keep the extra data in-line with the record - in particular, this makes it simple to obtain the data without requiring extra steps like a second complex query on an external table, and it means that all the values share things like timestamp/rowversion for concurrency.
In particular, I've found a CustomValues column (for example text or binary; typically json / xml, but could be more exotic) a very effective way to work, acting as a property-bag for the additional data. And you don't have to parse it (or indeed, SELECT it) until you know you need the extra data.
All you then need is a way to tie named keys to expected types, but you need that metadata anyway.
I will, however, stress the importance of making the data portable; don't (for example) store any specific platform-bespoke serialization (for example, BinaryFormatter for .NET) - things like xml / json are fine.
Finally, your RDBMS may also work with this column; for example, SQL Server has the xml data type that allows you to run specific queries and other operations on xml data. You must make your own decision whether that is a help or a hindrance ;p
If you also need to add tables, I wonder if you are truly using the RDBMS as an RDBMS; at that point I would consider switching from an RDBMS to a document-database such as CouchDB or Raven DB