The following code can be found in the NHibernate.Id.GuidCombGenerator class. The algorithm creates sequential (comb) guids based on combining a "random" guid with a DateTime. I have a couple of questions related to the lines that I have marked with *1) and *2) below:
private Guid GenerateComb()
{
byte[] guidArray = Guid.NewGuid().ToByteArray();
// *1)
DateTime baseDate = new DateTime(1900, 1, 1);
DateTime now = DateTime.Now;
// Get the days and milliseconds which will be used to build the byte string
TimeSpan days = new TimeSpan(now.Ticks - baseDate.Ticks);
TimeSpan msecs = now.TimeOfDay;
// *2)
// Convert to a byte array
// Note that SQL Server is accurate to 1/300th of a millisecond so we divide by 3.333333
byte[] daysArray = BitConverter.GetBytes(days.Days);
byte[] msecsArray = BitConverter.GetBytes((long) (msecs.TotalMilliseconds / 3.333333));
// Reverse the bytes to match SQL Servers ordering
Array.Reverse(daysArray);
Array.Reverse(msecsArray);
// Copy the bytes into the guid
Array.Copy(daysArray, daysArray.Length - 2, guidArray, guidArray.Length - 6, 2);
Array.Copy(msecsArray, msecsArray.Length - 4, guidArray, guidArray.Length - 4, 4);
return new Guid(guidArray);
}
First of all, for *1), wouldn't it be better to have a more recent date as the baseDate, e.g. 2000-01-01, so as to make room for more values in the future?
Regarding *2), why would we care about the accuracy for DateTimes in SQL Server, when we only are interested in the bytes of the datetime anyway, and never intend to store the value in an SQL Server datetime field? Wouldn't it be better to use all the accuracy available from DateTime.Now?
Re 1: there is no relevance to the actual day value, the two bytes used from the value simply roll over 65536 days after 1/1/1900. The only thing that matters is that the values are roughly sequential. The dbase is going to be a bit inefficient in the summer of 2079, nobody will notice.
Re 2: yes, makes no sense. But same story, the actual value doesn't matter.
The algorithm is questionable, messing with the guaranteed uniqueness of Guids is a tricky proposition. You'll have to rely on somebody in the nHibernate team having insider knowledge that this works without problems. If you change it, you're liable to break it.
COMB was created specifically to efficiently use GUIDs as a clustered index in SQL Server. That's why it's written around SQL Server specific behavior.
I'm very late to this party but I thought I'd share the original intent of COMB.
I started using nHibernate in '04 and we wanted to use GUIDs for our IDs. After some research I found that SQL Server was not efficient at using completely random GUIDs as a primary key / clustered index because it wanted to store them in order and would have to do insert into the middle of the table (page splits). I got the algorithm for COMB from this article: http://www.informit.com/articles/article.aspx?p=25862 by Jimmy Nilsson which is a very comprehensive description of why COMBs are the way they are (and a good read). I started using a custom generator to generate COMBs and them NHibernate picked it up as a built-in generator.
COMB may not produce in-order IDs in other servers. I've never researched it.
Related
I have a SQL table that stores unique nvarchar data set to 60 characters max.
I now need to output each value to a file on a daily basis. This file is then fed into a 3rd party system.
However, this 3rd party system requires the value to be limited to 10 characters. The values does not have to be what is in the table. They just need to be unique and 10 characters max. They must also be consistent in that the same unique id is used each day for the table value.
I cannot truncate the string as it could then lose its uniqueness.
Looking at my options, I could:
Use GetHashCode()
Convert to Hexadecimal
With GetHashCode, this looks a simple straightforward option and I get the same value each time it it run. However, Microsoft documentation recommends against using it for my purpose...
https://learn.microsoft.com/en-us/dotnet/api/system.string.gethashcode?redirectedfrom=MSDN&view=netframework-4.8#System_String_GetHashCode
As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.
With Hexadecimal conversion, it may also lose uniqueness when trimmed to 10 characters.
I have also looked at this example but again I'm not sure how reliable it is with uniqueness: A fast hash function for string in C#
static UInt64 CalculateHash(string read)
{
UInt64 hashedValue = 3074457345618258791ul;
for(int i=0; i<read.Length; i++)
{
hashedValue += read[i];
hashedValue *= 3074457345618258799ul;
}
return hashedValue;
}
Are there any other options available to me?
Add an unique Identity key to your table and let SQL Server manage the incrementation for you. This can be seeded with a large number if needed.
I'm writing a program which basically processes data and outputs many files. There is no way it will be producing more than 10-20 files each use. I just wanted to know if using this method to generate unique filenames is a good idea? is it possible that rand will choose, lets say x and then within 10 instances, choose x again. Is using random(); a good idea? Any inputs will be appreciated!
Random rand = new Random ();
int randNo = rand.Next(100000,999999)l
using (var write = new StreamWriter("C:\\test" + randNo + ".txt")
{
// Stuff
}
I just wanted to know if using this method to generate unique filenames is a good idea?
No. Uniqueness isn't a property of randomness. Random means that the resulting value is not in any way dependent upon previous state. Which means repeats are possible. You could get the same number many times in a row (though it's unlikely).
If you want values which are unique, use a GUID:
Guid.NewGuid();
As pointed out in the comments below, this solution isn't perfect. But I contend that it's good enough for the problem at hand. The idea is that Random is designed to be random, and Guid is designed to be unique. Mathematically, "random" and "unique" are non-trivial problems to solve.
Neither of these implementations is 100% perfect at what it does. But the point is to simply use the correct one of the two for the intended functionality.
Or, to use an analogy... If you want to hammer a nail into a piece of wood, is it 100% guaranteed that the hammer will succeed in doing that? No. There exists a non-zero chance that the hammer will shatter upon contacting the nail. But I'd still reach for the hammer rather than jury rigging something with the screwdriver.
No, this is not correct method to create temporary file names in .Net.
The right way is to use either Path.GetTempFileName (creates file immediatedly) or Path.GetRandomFileName (creates high quality random name).
Note that there is not much wrong with Random, Guid.NewGuid(), DateTime.Now to generate small number of file names as covered in other answers, but using functions that are expected to be used for particular purpose leads to code that is easier to read/prove correctness.
If you want to generate a unique value, there's a tool specifically designed for generating unqiue identifying values, a Globally Unique IDentifier (GUID).
var guid = Guid.NewGuid();
Leave the problem of figuring out the best way of creating such a unique value to others.
There is what is called the Birthday Paradox... If you generate some random numbers (any number > 1), the possibility of encountering a "collision" increases... If you generate sqrt(numberofpossiblevalues) values, the possibility of a collision is around 50%... so you have 799998 possible values... sqrt(799998) is 894... It is quite low... With 45-90 calls to your program you have a 50% chance of a collision.
Note that random being random, if you generate two random numbers, there is a non-zero possibility of a collision, and if you generate numberofpossiblevalues + 1 random numbers, the possibility of a collision is 1.
Now... Someone will tell you that Guid.NewGuid will generate always unique values. They are sellers of very good snake oil. As written in the MSDN, in the Guid.NewGuid page...
The chance that the value of the new Guid will be all zeros or equal to any other Guid is very low.
The chance isn't 0, it is very (very very I'll add) low! Here the Birthday Paradox activates... Now... Microsoft Guid have 122 bits of "random" part and 6 bits of "fixed" part, the 50% chance of a collision happens around 2.3x10^18 . It is a big number! The 1% chance of collision is after 3.27x10^17... still a big number!
Note that Microsoft generates these 122 bits with a strong random number generator: https://msdn.microsoft.com/en-us/library/bb417a2c-7a58-404f-84dd-6b494ecf0d13#id9
Windows uses the cryptographic PRNG from the Cryptographic API (CAPI) and the Cryptographic API Next Generation (CNG) for generation of Version 4 GUIDs.
So while the whole Guid generated by Guid.NewGuid isn't totally random (because 6 bits are fixed), it is still quite random.
I would think it would be a good idea to add in the date & time the file was created in the file name in order to make sure it is not duplicated. You could also add random numbers to this if you want to make it even more unique (in the case your 10 files are saved at the exact same time).
So the files name might be file06182015112300.txt (showing the month, day, year, hour, minute & seconds)
If you want to use files of that format, and you know you won't run out of unused numbers, it's safer to check that the random number you generate isn't already used as follows:
Random rand = new Random();
string filename = "";
do
{
int randNo = rand.Next(100000, 999999);
filename = "C:\\test" + randNo + ".txt";
} while (File.Exists(filename));
using (var write = new StreamWriter(filename))
{
//Stuff
}
My colleague and I are debating which of these methods to use for auto generating user ID's and post ID's for identification in the database:
One option uses a single instance of Random, and takes some useful parameters so it can be reused for all sorts of string-gen cases (i.e. from 4 digit numeric pins to 20 digit alphanumeric ids). Here's the code:
// This is created once for the lifetime of the server instance
class RandomStringGenerator
{
public const string ALPHANUMERIC_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
public const string ALPHA_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public const string NUMERIC = "1234567890";
Random rand = new Random();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
s += chars[rand.Next() % chars.Length];
return s;
}
}
and the other option is simply to use:
Guid.NewGuid();
see Guid.NewGuid on MSDN
We're both aware that Guid.NewGuid() would work for our needs, but I would rather use the custom method. It does the same thing but with more control.
My colleague thinks that because the custom method has been cooked up ourselves, it's more likely to generate collisions. I'll admit I'm not fully aware of the implementation of Random, but I presume it is just as random as Guid.NewGuid(). A typical usage of the custom method might be:
RandomStringGenerator stringGen = new RandomStringGenerator();
string id = stringGen.GetRandomString(20, RandomStringGenerator.ALPHANUMERIC_CAPS.ToCharArray());
Edit 1:
We are using Azure Tables which doesn't have an auto increment (or similar) feature for generating keys.
Some answers here just tell me to use NewGuid() "because that's what it's made for". I'm looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
Edit 2:
We were also using the cooked up method to generate post ID's which, unlike session tokens, need to look pretty for display in the url of our website (like http://mywebsite.com/14983336), so guids are not an option here, however collisions are still to be avoided.
I am looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
First, as others have noted, Random is not thread-safe; using it from multiple threads can cause it to corrupt its internal data structures so that it always produces the same sequence.
Second, Random is seeded based on the current time. Two instances of Random created within the same millisecond (recall that a millisecond is several million processor cycles on modern hardware) will have the same seed, and therefore will produce the same sequence.
Third, I lied. Random is not seeded based on the current time; it is seeded based on the amount of time the machine has been active. The seed is a 32 bit number, and since the granularity is in milliseconds, that's only a few weeks until it wraps around. But that's not the problem; the problem is: the time period in which you create that instance of Random is highly likely to be within a few minutes of the machine booting up. Every time you power-cycle a machine, or bring a new machine online in a cluster, there is a small window in which instances of Random are created, and the more that happens, the greater the odds are that you'll get a seed that you had before.
(UPDATE: Newer versions of the .NET framework have mitigated some of these problems; in those versions you no longer have every Random created within the same millisecond have the same seed. However there are still many problems with Random; always remember that it is only pseudo-random, not crypto-strength random. Random is actually very predictable, so if you are relying on unpredictability, it is not suitable.)
As other have said: if you want a primary key for your database then have the database generate you a primary key; let the database do its job. If you want a globally unique identifier then use a guid; that's what they're for.
And finally, if you are interested in learning more about the uses and abuses of guids then you might want to read my "guid guide" series; part one is here:
https://ericlippert.com/2012/04/24/guid-guide-part-one/
As written in other answers, my implementation had a few severe problems:
Thread safety: Random is not thread safe.
Predictability: the method couldn't be used for security critical identifiers like session tokens due to the nature of the Random class.
Collisions: Even though the method created 20 'random' numbers, the probability of a collision is not (number of possible chars)^20 due to the seed value only being 31 bits, and coming from a bad source. Given the same seed, any length of sequence will be the same.
Guid.NewGuid() would be fine, except we don't want to use ugly GUIDs in urls and .NETs NewGuid() algorithm is not known to be cryptographically secure for use in session tokens - it might give predictable results if a little information is known.
Here is the code we're using now, it is secure, flexible and as far as I know it's very unlikely to create collisions if given enough length and character choice:
class RandomStringGenerator
{
RNGCryptoServiceProvider rand = new RNGCryptoServiceProvider();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
{
byte[] intBytes = new byte[4];
rand.GetBytes(intBytes);
uint randomInt = BitConverter.ToUInt32(intBytes, 0);
s += chars[randomInt % chars.Length];
}
return s;
}
}
"Auto generating user ids and post ids for identification in the database"...why not use a database sequence or identity to generate keys?
To me your question is really, "What is the best way to generate a primary key in my database?" If that is the case, you should use the conventional tool of the database which will either be a sequence or identity. These have benefits over generated strings.
Sequences/identity index better. There are numerous articles and blog posts that explain why GUIDs and so forth make poor indexes.
They are guaranteed to be unique within the table
They can be safely generated by concurrent inserts without collision
They are simple to implement
I guess my next question is, what reasons are you considering GUID's or generated strings? Will you be integrating across distributed databases? If not, you should ask yourself if you are solving a problem that doesn't exist.
Your custom method has two problems:
It uses a global instance of Random, but doesn't use locking. => Multi threaded access can corrupt its state. After which the output will suck even more than it already does.
It uses a predictable 31 bit seed. This has two consequences:
You can't use it for anything security related where unguessability is important
The small seed (31 bits) can reduce the quality of your numbers. For example if you create multiple instances of Random at the same time(since system startup) they'll probably create the same sequence of random numbers.
This means you cannot rely on the output of Random being unique, no matter how long it is.
I recommend using a CSPRNG (RNGCryptoServiceProvider) even if you don't need security. Its performance is still acceptable for most uses, and I'd trust the quality of its random numbers over Random. If you you want uniqueness, I recommend getting numbers with around 128 bits.
To generate random strings using RNGCryptoServiceProvider you can take a look at my answer to How can I generate random 8 character, alphanumeric strings in C#?.
Nowadays GUIDs returned by Guid.NewGuid() are version 4 GUIDs. They are generated from a PRNG, so they have pretty similar properties to generating a random 122 bit number (the remaining 6 bits are fixed). Its entropy source has much higher quality than what Random uses, but it's not guaranteed to be cryptographically secure.
But the generation algorithm can change at any time, so you can't rely on that. For example in the past the Windows GUID generation algorithm changed from v1 (based on MAC + timestamp) to v4 (random).
Use System.Guid as it:
...can be used across all computers and networks wherever a unique identifier is required.
Note that Random is a pseudo-random number generator. It is not truly random, nor unique. It has only 32-bits of value to work with, compared to the 128-bit GUID.
However, even GUIDs can have collisions (although the chances are really slim), so you should use the database's own features to give you a unique identifier (e.g. the autoincrement ID column). Also, you cannot easily turn a GUID into a 4 or 20 (alpha)numeric number.
Contrary to what some people have said in the comment, a GUID generated by Guid.NewGuid() is NOT dependent on any machine-specific identifier (only type 1 GUIDs are, Guid.NewGuid() returns a type 4 GUID, which is mostly random).
As long as you don't need cryptographic security, the Random class should be good enough, but if you want to be extra safe, use System.Security.Cryptography.RandomNumberGenerator. For the Guid approach, note that not all digits in a GUID are random. Quote from wikipedia:
In the canonical representation, xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx, the most significant bits of N indicates the variant (depending on the variant; one, two or three bits are used). The variant covered by the UUID specification is indicated by the two most significant bits of N being 1 0 (i.e. the hexadecimal N will always be 8, 9, A, or B).
In the variant covered by the UUID specification, there are five versions. For this variant, the four bits of M indicates the UUID version (i.e. the hexadecimal M will either be 1, 2, 3, 4, or 5).
Regarding your edit, here is one reason to prefer a GUID over a generated string:
The native storage for a GUID (uniqueidentifier) in SQL Server is 16 bytes. To store a equivalent-length varchar (string), where each "digit" in the id is stored as a character, would require somewhere between 32 and 38 bytes, depending on formatting.
Because of its storage, SQL Server is also able to index a uniqueidentifier column more efficiently than a varchar column as well.
I wanted to have your opinion on what is the best way to manage time series in c# according to you. I need to have a 2 dimensions matrix-like with Datetime object as an index of rows (ordered and without duplicate) and each columns would represent the stock value for the relevant Datetime. I would like to know if any of those objects would be able to handle missing data for a date: adding a column or a time serie would add the missing date in the row index and would add "null" or "N/a" for missing values for existing dates.
A lot of stuff are already available in c# compared to c++ and I don't want to miss something obvious.
TeaFiles.Net is a library for time series storage in flat files. As I understand you only want to have the data in memory, in which case you would use a MemoryStream and pass it to the ctor.
// the time series item type
struct Tick
{
public DateTime Time;
public double Price;
public int Volume;
}
// create file and write some values
var ms = new MemoryStream();
using (var tf = TeaFile<Tick>.Create(ms))
{
tf.Write(new Tick { Price = 5, Time = DateTime.Now, Volume = 700 });
tf.Write(new Tick { Price = 15, Time = DateTime.Now.AddHours(1), Volume = 1700 });
// ...
}
ms.Position = 0; // reset the stream
// read typed
using (var tf = TeaFile<Tick>.OpenRead(ms))
{
Tick value = tf.Read();
Console.WriteLine(value);
}
https://github.com/discretelogics/TeaFiles.Net
You can install the library via NuGet packages Manager "TeaFiles.Net"
A vsix sample Project is also available in the VS Gallery.
You could use a mapping between the date and the stock value, such as Dictionary<DateTime, decimal>. This way the dates can be sparse.
If you need the prices of multiple stocks at each date, and not every stock appears for every date, then you could choose between Dictionary<DateTime, Dictionary<Stock, decimal>> and Dictionary<Stock, Dictionary<DateTime, decimal>>, depending on how you want to access the values afterwards (or even both if you don't mind storing the values twice).
The DateTime object in C# is a value Type which means it initializes with its default value and that is Day=1 Month=1 Year=1 Hour=1 Minute=1 Second=1. (or was it hour=12, i am not quite sure).
If I understood you right you need a datastructure that holds DateTime objects that are ordered in some way and when you insert a new object the adjacent dateTime objects will change to retain your order.
In this case I would focus mor on the datastructure than on the dateTime object.
Write a simple class that inherits from Lits<> for example and include the functionality you want on an insert oder delete operation.
Something like:
public class DateTimeList : List<DateTime> {
public void InsertDateTime (int position, DateTime dateTime) {
// insert the new object
this.InsertAt(position, dateTime)
// then take the adjacent objects (take care of integrity checks i.e.
// exists the index/object? in not null ? etc.
DateTime previous = this.ElementAt<DateTime>(position - 1);
// modify the previous DateTime obejct according to your needs.
DateTime next = this.ElementAt<DateTime>(position + 1);
// modify the next DateTime obejct according to your needs.
}
}
As you mentioned in your comment to Marc's answer, I believe the SortedList is a more appropriate structure to hold your time series data.
UPDATE
As zmbq mentioned in his comment to Marc's question, the SortedList is implemented as an array, so if faster insertion/removal times are needed then the SortedDictionary would be a better choice.
See Jon Skeet's answer to this question for an overview of the performance differences.
The is a time series library called TimeFlow, which allows smart creation and handling of time series.
The central TimeSeries class knows its timezone and is internally based on a sorted list of DatimeTimeOffset/Decimal pairs with specific frequency (Minute, Hour, Day, Month or even custom periods). The frequency can be changed during resample operations (e.g. hours -> days). It is also possible to combine time series unsing the standard operators (+,-,*,/) or advanced join operations using cusom methods.
Further more, the TimeFrame class combines multiple time series of same timezone and frequency (similar to pythons DataFrame but restricted to time series) for easier access.
Additional there is the great TimeFlow.Reporting library that provides advanced reporting / visualization (currently Excel and WPF) of time frames.
Disclaimer: I am the creator of these libraries.
I have to store many gigabytes of data across multiple machines. The files are uniquely identified by Guid and one file can be hosted on one machine only. I was wondering if I could use the Guid as a partition key to determine which machine should I use to store the data. If so, what would be my partition function?
Otherwise, how could I partition my data in such way that all the machine get a very similar load?
Thanks!
P.S. I am not using Sql Server, Oracle or any other DB. This is all in-house code.
P.S.S. The Guid are generated using the .NET function Guid.NewGuid().
As James said in his comment, you need something that has a good, uniform distribution. Guids do not have this property. I would recommend a hash, even one as simple as a hash of the Guid itself.
A SHA-1 hash has a good distribution. I wouldn't recommend even/odd hashing unless you plan on only distributing between 2 machines.
Because GUIDs are random you could distribute them by storing the odd GUIDs on one machine and the even GUIDs on the other...
static void Main(string[] args)
{
var tests = new List<Guid>();
for (int i = 0; i < 100000; i++)
{
tests.Add(Guid.NewGuid());
}
Console.WriteLine("Even: " + tests.Where(g => g.ToByteArray().Last() % 2 == 0).Count());
Console.WriteLine("Odd : " + tests.Where(g => g.ToByteArray().Last() % 2 == 1).Count());
Console.ReadKey(true);
}
Gives a near equal distribution.
EDIT
Indeed this will not work when splitting across more than 2 machines although you could then split again on an other byte being odd or even.
If you want to round robin your distribution I would be looking at the possibility of a synchronized counter which you % the number of machines you have in a classical round robin manner.
The synchronized counter could be a field in a database, it could be a single web service, or a file on the network etc. Anything which could be incremented every time a file gets placed.