The limit of Int32 for Identity Column

The limit of Int32 for Identity Column - c#

This is just a consideration for a site am creating and for other big sites out there.
I am using Identity Column to store the ID of some of my tables and I have classes whose Id are decorated with Int32 to hold the value of the ID retrieved from database.
My worry is that as the site grows bigger, some tables that grows exponentially e.g QuestionComments might exceed the Int32 limit in future. So I change my class to use long.
public class Question
{
public long QuestionID { get; set; }
...
}
//Converting database value to .Net type
Question q = new Question();
q.QuestionID = Convert.ToInt32(myDataRow["QuestionID"]);
How true is my assumption? Would using a UniqueIdentifier be better? Are there other way to address this?

Can you imagine billions of records being stored for any one entity? If so you could switch to BigInt which is otherwise known as Int64. Of course once you start seeing many millions of records you need to start thinking about data partitioning and archiving to avoid serious performance issues. If you have a infrastructure team you may want to let them know that you expect heavy usage and will need a serious maintenance plan. If you are the infrastructure team then you better hit the books!

If you really think it's reasonable that you'll have over 2 billion comments, then use BIGINT in SQL Server (Int64 in .Net). This requires 8 bytes of storage instead of 4 bytes for INT, however you can offset this for the first 2+billion values if you use data compression.

Related

Azure table storage querying partitionkey

I am using Azure table storage to retrieve data though timestamp filter. I see the execution is very slow as timestamp is not a partition key or row key. I researched on stackoverflow and found that time stamp should be converted to ticks and stored in to Partition key. I did the same and while inserting data I took the below string and inserted tick string to partition key.
string currentDateTimeTick = ConvertDateTimeToTicks(DateTime.Now.ToUniversalTime()).ToString();
public static long ConvertDateTimeToTicks(DateTime dtInput)
{
long ticks = 0;
ticks = dtInput.Ticks;
return ticks;
}
This is fine till here. But When I am trying to retrieve last 5 days data, I am unable to query the tick against partition key. I am trying to get last 5 days data. What was my mistake in the below code?
int days = 5;
TableQuery<MyEntity> query = new TableQuery<MyEntity>()
.Where(TableQuery.GenerateFilterConditionForDate("PartitionKey", QueryComparisons.GreaterThanOrEqual, "0"+DateTimeOffset.Now.AddDays(days).Date.Ticks));

Are you sure you want to use ticks as a partition key? This means that every measureable 100 ns instant becomes it's own partition. With time based data you can use the partition key to specify an interval like every hour, minute or even second and then a row key with the actual timestamp.
That problem aside let me show you how to do the query. First let me comment on how you generate the partition key. I suggest you do it like this:
var partitionKey = DateTime.UtcNow.Ticks.ToString("D18");
Don't use DateTime.Now.ToUniversalTime() to get the current UTC time. It will internally use DateTime.UtcNow, then convert it to the local time zone and ToUniversalTime() will convert back to UTC which is just wasteful (and more time consuming than you may think).
And your ConvertDateTimeToTicks() method serves no other purpose than to get the Ticks property so it is just making your code more complex without adding any value.
Here is how to perform the query:
var days = 5;
var partitionKey = DateTime.UtcNow.AddDays(-days).Ticks.ToString("D18")
var query = new TableQuery<MyEntity>().Where(
TableQuery.GenerateFilterCondition(
"PartitionKey",
QueryComparisons.GreaterThanOrEqual,
partitionKey
)
);
The partition key is formatted as an 18 characters string allowing you to use a straightforward comparison.
I suggest that you move the code to generate the partition key (and row key) into a function to make sure that the keys are generated the same way throughout your code.
The reason 18 characters are used is because the Ticks value of a DateTime today as well as many thousands of years in the future uses 18 decimal digits. If you decide to base your partition key on hours, minutes or seconds instead of 100 ns ticks then you can shorten the length of the partition key accordingly.

As Martin suggests, using a timestamp as your partition key is almost certainly not what you want to do.
Partitions are the unit of scale in Azure Table Storage and more or less represent physical segmentation of your data. They're a scalability optimization that allows you to "throw hardware" at the problem of storing more and more data, while maintaining acceptable response times (something which is traditionally hard in data storage). You define the partitions in your data by assigning partition keys to each row. Its almost never desirable that each row lives in its own partition.
In ATS, the row key becomes your unique key within a given partition. So the combination of partition key + row key is the true unique key across the entire ATS table.
There's lots of advice out there for choosing a valid partition key and row key... none of which is generalized. It depends on the nature of your data, your anticipated query patterns, etc.
Choose a partition key that will aggregate your data into a reasonably well-distributed set of "buckets". All things being equal, if you anticipate having 1 million rows in your table, it's often useful to have, say, 10 buckets with 100,000 rows each... or maybe 100 buckets with 10,000 rows each. At query time you'll need to pick the partition(s) you're querying, so the number of buckets may matter to you. "Buckets" often correspond to a natural segmentation concept in your domain... a bucket to represent each US state, or a bucket to represent each department in your company, etc. Note that its not necessary (or often possible) to have perfectly distributed buckets... get as close as you can, with reasonable effort.
One example of where you might intentionally have an uneven distribution is if you intend to vary query patterns by bucket... bucket A will receive lots of cheap, fast queries, bucket B fewer, more expensive queries, etc. Or perhaps bucket A data will remain static while bucket B data changes frequently. This can be accomplished with multiple tables, too... so there's no "one size fits all" answer.
Given the limited knowledge we have of your problem, I like Martin's advice of using a time span as your partition key. Small spans will result in many partitions, and (among other things) make queries that utilize multiple time spans relatively expensive. Larger spans will result in fewer aggregation costs across spans, but will result in bigger partitions and thus more expensive queries within a partition (it will also make identifying a suitable row key potentially more challenging).
Ultimately you'll likely need to experiment with a few options to find the most suitable one for your data and intended queries.
One other piece of advice... don't be afraid to consider duplicating data in multiple data stores to suit widely varying query types. Not every query will work effectively against a single schema or storage configuration. The effort needed to synchronize data across stores may be less than that needed bend query technology X to your will.
more on Partition and Row key choices
also here
Best of luck!

One thing that was not mentioned in the answers above is that Azure will detect if you are using sequential, always-increasing or always-decreasing values for your partition key and create "range partitions". Range partitions group entities that have sequential unique PartitionKey values to improve the performance of range queries. Without range partitions, as mentioned above, a range query will need to cross partition boundaries or server boundaries, which can decrease the query performance. Range partitions happen under-the-hood and are decided by Azure, not you.
Now, if you want to do bulk inserts, let's say once a minute, you will still need to flatten out your timestamp partition keys to, say, ticks rounded up to the nearest minute. You can only do bulk inserts with the same partition key.

Generating a unique random number

I am entering student id as a randon number into the DB
int num = r.Next(1000);
Session["number"] = "SN" + (" ") + num.ToString();
But is there any chance of getting a duplicate number?How can i avoid this?
EDIT :: I have a identity column and the student id is separate from the ID,i am going to enter a random student id into the DB from UI.

It is a very common task to have a column in a DB that is merely an integer unique ID. So much so that every database I've ever worked with has a specific column type, function, etc. for dealing with it. It will vary based on whatever specific database you use, but you should figure out what that is and use it.
You need a value that is unique not, random. The two are different. Random numbers repeat, they aren't unique. Unique numbers also aren't random. For example, if you just increment numbers up from 0 it will be unique, but that's not in any way random.
You could use a GUID, which would be unique, but it would be 128 bits. That's pretty big. Most databases will just have a counter that they increment every time you add an item, so 32 bits is usually enough. This will save you a lot of space. Incrementing a counter is also quicker than calculating a GUID's new value. For DB operations that tend to involve adding lots of items, that could matter.
As Jodrell mentions in the comments, you should also consider the size of the index if you use a GUID or other large field. Storing and maintaining that index will be much more expensive (in both time and space) with column that needs that many more bits.
If you try to do something yourself there's a good chance you'll do it wrong. Either your algorithm won't be entirely unique, it will have race conditions due to improper synchronization, it will be less performant because of excessive synchronization, it will be significantly larger because that's what it took to reduce the risk of collisions, etc. At the end of the day the database will have access to tools that you don't; let it take care of it so you don't need to worry about what you could mess up.

Sure there is a very likely chance that you will get a duplicate number. Next is just giving you a number between 0 and 1000, but there is no guarantee that the number will not be some number that Next has returned in the past.
If you are trying to work with unique values, look into possibly using Guids instead of integers or have a constantly increasing integer value instead of any random number. Here the reference page on Guid
http://msdn.microsoft.com/en-us/library/system.guid.aspx

you can use Guid's instead of random int , they are going to always be unique
There is no way to guarentee an int is unique unless you check every one that already exists, and even then - like the comments say , you are guarenteed duplicates when you pass 1000 ids
EDIT:
I mention that I think Guid's are best here because of the question , first indexing the table is not going to take long at all - it is assumed that there are going to be less then 1000 students because of the size of int, 128 bits is fine in a table with less then 1000 rows.
Guid's are a good thing to learn - even though they are not always the most effecient way
Creating a unique Guid in c# has a benifit that you can keep using and displaying that id - like in the question , without another trip to Db to figure out which unique id was assigned to the student

Yes, you will get duplicates. If you want a truly unique item, you will need to use Guid. If you still want to use numbers, then you will need to keep track of the numbers you have already used, similar to identity column in database.

Yes, you will certainly get duplicates. You could use a GUID instead:
Guid g = Guid.NewGuid();
GUIDs are theoretically "Globally Unique".

You can try to generate id using Guid:
Session["number"] = "SN" + (" ") + Guid.NewGuid().ToString();
It will highly descrease a chance to get duplicate id.

If you are using random numbers then no there is no way of avoiding it. There will always be a chance of a collision.
I think what you are probably looking for is an Identity column, or whatever the equivalent is for your database server.

In LINQ to SQL it is possible to set row like this:
[Column ( IsPrimaryKey = true, IsDbGenerated = true )]
public int ID { get; set; }
I dont know if it helps you in asp, but maybe it is a good hint...

Yes there is a chance of course.
Quick solution:
Check if it is a duplicate number first and try again until it is no longer a duplicate number.

how to check if value is present in a very big data record or a big list efficiently

hi guys i have this doubt ...
if i have a record of username and password details for logging in to a website I'll most probably get the user name and password from the form and will be using to check if the given username is present in the database by using a contains() Boolean operation and if contains then check the password is same as saved in the database..
but for websites like g-mail and Facebook there are million of records and the authentication is very quick ...
how to they do it ..what method do they follow for this
how they check if a value is present in a large record that quickly ?
does the process involve just adding more server for processing speed ?
ty for the answers ...
**
sorry i have posted this question without knowing about indexers ..
(just came to know that by creating indexes to one or multiple column
the full table scan is minimized and index path is used instead which
is less costlier and more efficient operation ..)
**

You just need one SQL query:
select 1 from user u
where u.login = :theEnteredLogin
and u.hashed_password = :theHashedEnteredPassword
(where :xxx are parameters of the query).
If you have an index on the login column or even better, on [login - hashed_password], the query should not take more than a few milliseconds to execute.

Well, they have lots of servers and high-performance databases. At a low level, the table for the hash is probably indexed by the hash for fast lookup - binary search style.

For medium to large data sets indexing, combined with proper sizing of disk, memory and cpus, is the most adopted approach.
For very large data sets, the database can be distributed and data partitioned.
For very, very large data sets, aside from the above scenarios, used technologies usually involve using map reduce model.

dynamic data model

I have a project that requires user-defined attributes for a particular object at runtime (Lets say a person object in this example). The project will have many different users (1000 +), each defining their own unique attributes for their own sets of 'Person' objects.
(Eg - user #1 will have a set of defined attributes, which will apply to all person objects 'owned' by this user. Mutliply this by 1000 users, and that's the bottom line minimum number of users the app will work with.) These attributes will be used to query the people object and return results.
I think these are the possible approaches I can use. I will be using C# (and any version of .NET 3.5 or 4), and have a free reign re: what to use for a datastore. (I have mysql and mssql available, although have the freedom to use any software, as long as it will fit the bill)
Have I missed anything, or made any incorrect assumptions in my assessment?
Out of these choices - what solution would you go for?
Hybrid EAV object model. (Define the database using normal relational model, and have a 'property bag' table for the Person table).
Downsides: many joins per / query. Poor performance. Can hit a limit of the number of joins / tables used in a query.
I've knocked up a quick sample, that has a Subsonic 2.x 'esqe interface:
Select().From().Where ... etc
Which generates the correct joins, then filters + pivots the returned data in c#, to return a datatable configured with the correctly typed data-set.
I have yet to load test this solution. It's based on the EA advice in this Microsoft whitepaper:
SQL Server 2008 RTM Documents Best Practices for Semantic Data Modeling for Performance and Scalability
Allow the user to dynamically create / alter the object's table at run-time. This solution is what I believe NHibernate does in the background when using dynamic properties, as discussed where
http://bartreyserhove.blogspot.com/2008/02/dynamic-domain-mode-using-nhibernate.html
Downsides:
As the system grows, the number of columns defined will get very large, and may hit the max number of columns. If there are 1000 users, each with 10 distinct attributes for their 'Person' objects, then we'd need a table holding 10k columns. Not scalable in this scenario.
I guess I could allow a person attribute table per user, but if there are 1000 users to start, that's 1000 tables plus the other 10 odd in the app.
I'm unsure if this would be scalable - but it doesn't seem so. Someone please correct me if I an incorrect!
Use a NoSQL datastore, such as CouchDb / MongoDb
From what I have read, these aren't yet proven in large scale apps, based on strings, and are very early in development phase. IF I am incorrect in this assessment, can someone let me know?
http://www.eflorenzano.com/blog/post/why-couchdb-sucks/
Using XML column in the people table to store attributes
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
Serializing an object graph to the database.
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
C# bindings for berkelyDB
From what I read here: http://www.dinosaurtech.com/2009/berkeley-db-c-bindings/
Berkeley Db has definitely proven to be useful, but as Robert pointed out – there is no easy interface. Your entire wOO wrapper has to be hand coded, and all of your indices are hand maintained. It is much more difficult than SQL / linq-to-sql, but that’s the price you pay for ridiculous speed.
Seems a large overhead - however if anyone can provide a link to a tutorial on how to maintain the indices in C# - it could be a goer.
SQL / RDF hybrid.
Odd I didn't think of this before. Similar to option 1, but instead of an "property bag" table, just XREF to a RDF store?
Querying would them involve 2 steps - query the RDF store for people hitting the correct attributes, to return the person object(s), and use the ID's for these person object in the SQL query to return the relational data. Extra overhead, but could be a goer.

The ESENT database engine on Windows is used heavily for this kind of semi-structured data. One example is Microsoft Exchange which, like your application, has thousands of users where each user can define their own set of properties (MAPI named properties). Exchange uses a slightly modified version of ESENT.
ESENT has a lot of features that enable applications with large meta-data requirements: each ESENT table can have about ~32K columns defined; tables, indexes and columns can be added at runtime; sparse columns don't take up any record space when not set; and template tables can reduce the space used by the meta-data itself. It is common for large applications to have thousands of tables/indexes.
In this case you can have one table per user and create the per-user columns in the table, creating indexes on any columns that you want to query. That would be similar to the way that some versions of Exchange store their data. The downside of this approach is that ESENT doesn't have a query engine so you will have to hand-craft your queries as MakeKey/Seek/MoveNext calls.
A managed wrapper for ESENT is here:
http://managedesent.codeplex.com/

In a EAV model you don't have to have many joins, as you can just have the joins you need for the query filtering. For the resultset, return property entries as a separate rowset.
That is what we are doing in our EAV implementation.
For example, a query might return persons with extended property 'Age' > 18:
Properties table:
1 Age
2 NickName
First resultset:
PersonID Name
1 John
2 Mary
second resultset:
PersonID PropertyID Value
1 1 24
1 2 'Neo'
2 1 32
2 2 'Pocahontas'
For the first resultset, you need an inner join for the 'age' extended property
to query the basic Person object entity part:
select p.ID, p.Name from Persons p
join PersonExtendedProperties pp
on p.ID = pp.PersonID
where pp.PropertyName = 'Age'
and pp.PropertyValue > 18 -- probably need to convert to integer here
For the second resultset, we are making an outer join of the first resultset with PersonExtendedProperties table to get the rest of the extended properties. It's a 'narrow' resultset, we do not pivot the properties in sql, so we don't need multiple joins here.
Actually we use separate tables for different types to avoid data type conversion, to have extended properties indexed and easily queriable.

My recommendation:
Allow properties to be marked as indexable. Have a smallish hard limit on number of indexable properties, and on columns per object. Have a large hard limit on total column types in all objects.
Implement indexes as separate tables (one per index) joined with main table of data (main table has large unique key for object). (Index tables can then be created/dropped as required).
Serialize the data, including the index columns, plus put the index propertoes in first class relational columns in their dedicated index tables. Use JSON instead of XML to save space in the table. Enforce short column name policy (or long display name and short stored name policy) to save space and increase performance.
Use quarks for field identifiers (but only in the main engine to save RAM and speed some read operations -- don't rely on quark pointer comparison in all cases).
My thought on your options:
1 is a possible. Performance clearly will be lower than if field ID columns not stored.
2 is a no in general DB engines not all happy about dynamic schema changes. But a possible yes if your DB engine is good at this.
3 Possible.
4 Yes though I'd use JSON.
5 Seems like 4 only less optimized??
6 Sounds good; would go with if happy to try something new and also if happy about reliability and performance but usually would want to go with more mainstream technology. I'd also like to reduce the number of engines involved in coordinating a transaction to less then would be true here.
Edit: But of course though I've recommened something there can be no general right answer here -- profile various data models and approaches with your data to see what runs best for your application.
Edit: Changed last edit wording.

Assuming you an place a limit, N, on how many custom attributes each user can define; just add N extra columns to the Person table. Then have a separate table where you store per-user metadata to describe how to interpret the contents of those columns for each user. Similar to #1 once you've read in the data, but no joins needed to pull in the custom attributes.

For a problem similar to your problem, we have used the "XML Column" approach (the fourth one in your survey of methods). But you should note that many databases (DBMS) support index for xml values.
I recommend you to use one table for Person which contains one xml column along with other common columns. In other words, design the Person table with columns that are common for all person records and add a single xml column for dynamic and differing attributes.
We are using Oracle. it supports index for its xml-type. Two types of indices are supported: 1- XMLIndex for indexing elements and attributes within an xml, 2- Oracle Text Index for enabling full-text search in text fields of the xml.
For example, in Oracle you can create an index such as:
CREATE INDEX index1 ON table_name (XMLCast(XMLQuery ('$p/PurchaseOrder/Reference'
PASSING XML_Column AS "p" RETURNING CONTENT) AS VARCHAR2(128)));
and xml-query is supported in select queries:
SELECT count(*) FROM purchaseorder
WHERE XMLCast(XMLQuery('$p/PurchaseOrder/Reference'
PASSING OBJECT_VALUE AS "p" RETURNING CONTENT)
AS INTEGER) = 25;
As I know, other databases such as PostgreSQL and MS SQL Server (but not mysql) support such index models for xml value.
see also:
http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb_indexing.htm#CHDEADIH

C#: Is it possible to store a Decimal Array in an SQL database?

I'm working on an application for a lab project and I'm making it in C#. It's supposed to import results from a text file that is exported from the application we use to run the tests and so far, I've hit a road block.
I've gotten the program to save around 250 decimal values as a single-dimension array but then I'm trying to get the array itself to be able to saved in an SQL database so that I can later retrieve the array and use the decimal values to construct a plot of the points.
I need the entire array to be imported into the database as one single value though because the lab project has several specimens each with their own set of 250 or so Decimal points (which will be stored as arrays, too)
Thanks for your help.
EDIT: Thanks for the quick replies, guys but the problem is that its not just results from a specimen with only 1 test ran. Each specimen itself has the same test performed on them with different decibel levels over 15 times. Each test has its own sets of 250 results and we have many specimens.
Also, the specimens already have a unique ID assigned to them and it'd be stored as a String not an Int. What I'm planning on doing is having a separate table in the DB for each specimen and have each row include info on the decibel level of the test and store the array serialized...
I think this would work because we will NOT need to access individual points in the data straight from the database; I'm just using the database to store the data out of memory since there's so much of it. I'm going to query the database for the array and other info and then use zedgraph to plot the points in the array and compare multiple specimens simultaneously.

Short answer is absolutely not. These are two completely different data structures. There are work arounds like putting it in a blob or comma separating a text column. But, I really hate those. It doesn't allow you to do math at the SQL Server level.
IMO, the best option includes having more than one column in your table. Add an identifier so you know which array the data point belongs to.
For example:
AutoId Specimen Measurement
1 A 42
2 A 45.001
3 B 47.92
Then, to get your results:
select
measurement
from
mytable
where
specimen = 'A'
order by
autoid asc
Edit: You're planning on doing a separate 250 row table for each specimen? That's absolutely overkill. Just use one table, have the specimen identifier as a column (as shown), and index that column. SQL Server can handle millions upon millions of rows markedly well. Databases are really good at that. Why not play to their strengths instead of trying to recreate C# data structures?

I need the entire array to be imported
into the database as one single value
though because the lab project has
several specimens each with their own
set of 250 or so Decimal points (which
will be stored as arrays, too)
So you're trying to pound a nail, should you use an old shoe or a glass bottle?
The answer here isn't "serialize the array into XML and store it in a record". You really want to strive for correct database design, and in your case the simplest design is:
Specimens
---------
specimenID (pk int not null)
SpecimenData
------------
dataID (pk int not null
specimenID (fk int not null, points to Specimens table)
awesomeValue (decimal not null)
Querying for data is very straightforward:
SELECT * FROM SpecimenData where specimenID = #specimenID

As long as you don't to access the the individual values in your queries, you can serialize the array and store it as a blob in the database.

Presumably you could serialize the decimal array in C# to a byte array, and save that in a binary field on a table. Your table would have two fields: SpecimenID, DecimalArrayBytes
Alternately you could have a many to many type table and not store the array in one piece, having fields: SpecimenID, DecimalValue, and use SQL like
SELECT DecimalValue FROM Table WHERE SpecimenID = X

You can serialize the array and store it as a single chunk of xml/binary/json. Here is an example of serializing it as xml.
public static string Serialize<T>(T obj)
{
StringBuilder sb = new StringBuilder();
DataContractSerializer ser = new DataContractSerializer(typeof(T));
ser.WriteObject(XmlWriter.Create(sb), obj);
return sb.ToString();
}

You want two tables. One to store an index, the other to store the decimal values. Something like this:
create table arrayKey (
arrayId int identity(1,1) not null
)
create table arrayValue (
arrayID int not null,
sequence int identity(1,1) not null,
storedDecimal decimal(12,2) not null
)
Insert into arrayKey to get an ID to use. All of the decimal values would get stored into arrayValue using the ID and the decimal value to store. Insert them one at a time.
When you retrieve them, you can group them by arrayID so that they all come out together. If you need to retrieve them in the same order you stored them, sort by sequence.

Although any given example might be impractical, via programming you can engineer any shape of peg into any shape of hole.
You could serialize your data for storage in a varbinary, XML-ize it for storage into a SQL Server XML type, etc.
Pick a route to analyze and carefully consider. You can create custom CLR libraries for SQL as well so the virtual sky is the limit.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.