c# string.GetHashCode. Unique number generation [duplicate] - c#

This question already has answers here:
What is hashCode used for? Is it unique?
(5 answers)
Closed 2 years ago.
I have below c# code which return
public static void Main()
{
//var bytes = "7A5EC53415E34F288269EBB05B73AFD4".ToByteArray();
for(int i=0;i<1000;i++)
{
var a = Guid.NewGuid().ToString("N").ToUpper();
Console.WriteLine($"{a}->{a.GetHashCode()}");
}
}
Which results
C5D9DB1ACBFB43E2B0DC2812F22C3899-> -1816084758
42B7C0BF6C0341DB96DE5084BA403193-> 1197814565
B4062E3C129E478BA8E69552DDC700F2-> 1349563395
863C9FBF0369496E90A1B0246F855D6E-> -772372816
B42562EDB97346F48DE37ADE3FB6620E-> 2019192158
My Question here is, will the return values be a unique value so that I can use it for generating some random unique number?
Sometime GetHashCode()return negative value and applying the Absolute will be considered a bad idea since i need without sign.

Sorry, no. It's harder than that, especially since you mentioned in the comments that this runs across multiple processes.
Ideas that won't work
GUIDs are somewhat unique across machines (in C# they use the MAC address and other strategies), but GetHashCode() is a big problem since that is in no way guaranteed to be unique.
You also can't rely on Random() for multiple reasons:
It uses Environment.Tick for its seed, and could indeed result in the same seed twice if called within a very narrow interval (low tens of milliseconds - 16ms?) from different processes.
Even if the seed was unique, some of the resulting values won't be. At least not at scale. We are looking for uniqueness, not just randomness.
One idea that will definitely work
If you want to be absolutely certain of uniqueness in a mechanical way across distributed processes running on different machines (or, worse, the same machine...), you could delegate the creation of unique values to a service. That service could save all unique values to a database table, forever. Or for some period of time if your requirements allow... That would be much easier.
Any time the service generates a new value, it would check the table first. If that value was ever used in the past, try again until a unique is found. Then save it to the database and return the value.
Doing this efficiently at scale would probably require a bit of engineering. There may be a more elegant purely mathematical solution, but this would be reliable and easy to understand (it does not rely on magic).

Related

Safe to use random numbers to make filenames unique?

I'm writing a program which basically processes data and outputs many files. There is no way it will be producing more than 10-20 files each use. I just wanted to know if using this method to generate unique filenames is a good idea? is it possible that rand will choose, lets say x and then within 10 instances, choose x again. Is using random(); a good idea? Any inputs will be appreciated!
Random rand = new Random ();
int randNo = rand.Next(100000,999999)l
using (var write = new StreamWriter("C:\\test" + randNo + ".txt")
{
// Stuff
}
I just wanted to know if using this method to generate unique filenames is a good idea?
No. Uniqueness isn't a property of randomness. Random means that the resulting value is not in any way dependent upon previous state. Which means repeats are possible. You could get the same number many times in a row (though it's unlikely).
If you want values which are unique, use a GUID:
Guid.NewGuid();
As pointed out in the comments below, this solution isn't perfect. But I contend that it's good enough for the problem at hand. The idea is that Random is designed to be random, and Guid is designed to be unique. Mathematically, "random" and "unique" are non-trivial problems to solve.
Neither of these implementations is 100% perfect at what it does. But the point is to simply use the correct one of the two for the intended functionality.
Or, to use an analogy... If you want to hammer a nail into a piece of wood, is it 100% guaranteed that the hammer will succeed in doing that? No. There exists a non-zero chance that the hammer will shatter upon contacting the nail. But I'd still reach for the hammer rather than jury rigging something with the screwdriver.
No, this is not correct method to create temporary file names in .Net.
The right way is to use either Path.GetTempFileName (creates file immediatedly) or Path.GetRandomFileName (creates high quality random name).
Note that there is not much wrong with Random, Guid.NewGuid(), DateTime.Now to generate small number of file names as covered in other answers, but using functions that are expected to be used for particular purpose leads to code that is easier to read/prove correctness.
If you want to generate a unique value, there's a tool specifically designed for generating unqiue identifying values, a Globally Unique IDentifier (GUID).
var guid = Guid.NewGuid();
Leave the problem of figuring out the best way of creating such a unique value to others.
There is what is called the Birthday Paradox... If you generate some random numbers (any number > 1), the possibility of encountering a "collision" increases... If you generate sqrt(numberofpossiblevalues) values, the possibility of a collision is around 50%... so you have 799998 possible values... sqrt(799998) is 894... It is quite low... With 45-90 calls to your program you have a 50% chance of a collision.
Note that random being random, if you generate two random numbers, there is a non-zero possibility of a collision, and if you generate numberofpossiblevalues + 1 random numbers, the possibility of a collision is 1.
Now... Someone will tell you that Guid.NewGuid will generate always unique values. They are sellers of very good snake oil. As written in the MSDN, in the Guid.NewGuid page...
The chance that the value of the new Guid will be all zeros or equal to any other Guid is very low.
The chance isn't 0, it is very (very very I'll add) low! Here the Birthday Paradox activates... Now... Microsoft Guid have 122 bits of "random" part and 6 bits of "fixed" part, the 50% chance of a collision happens around 2.3x10^18 . It is a big number! The 1% chance of collision is after 3.27x10^17... still a big number!
Note that Microsoft generates these 122 bits with a strong random number generator: https://msdn.microsoft.com/en-us/library/bb417a2c-7a58-404f-84dd-6b494ecf0d13#id9
Windows uses the cryptographic PRNG from the Cryptographic API (CAPI) and the Cryptographic API Next Generation (CNG) for generation of Version 4 GUIDs.
So while the whole Guid generated by Guid.NewGuid isn't totally random (because 6 bits are fixed), it is still quite random.
I would think it would be a good idea to add in the date & time the file was created in the file name in order to make sure it is not duplicated. You could also add random numbers to this if you want to make it even more unique (in the case your 10 files are saved at the exact same time).
So the files name might be file06182015112300.txt (showing the month, day, year, hour, minute & seconds)
If you want to use files of that format, and you know you won't run out of unused numbers, it's safer to check that the random number you generate isn't already used as follows:
Random rand = new Random();
string filename = "";
do
{
int randNo = rand.Next(100000, 999999);
filename = "C:\\test" + randNo + ".txt";
} while (File.Exists(filename));
using (var write = new StreamWriter(filename))
{
//Stuff
}

Guid.NewGuid() VS a random string generator from Random.Next()

My colleague and I are debating which of these methods to use for auto generating user ID's and post ID's for identification in the database:
One option uses a single instance of Random, and takes some useful parameters so it can be reused for all sorts of string-gen cases (i.e. from 4 digit numeric pins to 20 digit alphanumeric ids). Here's the code:
// This is created once for the lifetime of the server instance
class RandomStringGenerator
{
public const string ALPHANUMERIC_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
public const string ALPHA_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public const string NUMERIC = "1234567890";
Random rand = new Random();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
s += chars[rand.Next() % chars.Length];
return s;
}
}
and the other option is simply to use:
Guid.NewGuid();
see Guid.NewGuid on MSDN
We're both aware that Guid.NewGuid() would work for our needs, but I would rather use the custom method. It does the same thing but with more control.
My colleague thinks that because the custom method has been cooked up ourselves, it's more likely to generate collisions. I'll admit I'm not fully aware of the implementation of Random, but I presume it is just as random as Guid.NewGuid(). A typical usage of the custom method might be:
RandomStringGenerator stringGen = new RandomStringGenerator();
string id = stringGen.GetRandomString(20, RandomStringGenerator.ALPHANUMERIC_CAPS.ToCharArray());
Edit 1:
We are using Azure Tables which doesn't have an auto increment (or similar) feature for generating keys.
Some answers here just tell me to use NewGuid() "because that's what it's made for". I'm looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
Edit 2:
We were also using the cooked up method to generate post ID's which, unlike session tokens, need to look pretty for display in the url of our website (like http://mywebsite.com/14983336), so guids are not an option here, however collisions are still to be avoided.
I am looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
First, as others have noted, Random is not thread-safe; using it from multiple threads can cause it to corrupt its internal data structures so that it always produces the same sequence.
Second, Random is seeded based on the current time. Two instances of Random created within the same millisecond (recall that a millisecond is several million processor cycles on modern hardware) will have the same seed, and therefore will produce the same sequence.
Third, I lied. Random is not seeded based on the current time; it is seeded based on the amount of time the machine has been active. The seed is a 32 bit number, and since the granularity is in milliseconds, that's only a few weeks until it wraps around. But that's not the problem; the problem is: the time period in which you create that instance of Random is highly likely to be within a few minutes of the machine booting up. Every time you power-cycle a machine, or bring a new machine online in a cluster, there is a small window in which instances of Random are created, and the more that happens, the greater the odds are that you'll get a seed that you had before.
(UPDATE: Newer versions of the .NET framework have mitigated some of these problems; in those versions you no longer have every Random created within the same millisecond have the same seed. However there are still many problems with Random; always remember that it is only pseudo-random, not crypto-strength random. Random is actually very predictable, so if you are relying on unpredictability, it is not suitable.)
As other have said: if you want a primary key for your database then have the database generate you a primary key; let the database do its job. If you want a globally unique identifier then use a guid; that's what they're for.
And finally, if you are interested in learning more about the uses and abuses of guids then you might want to read my "guid guide" series; part one is here:
https://ericlippert.com/2012/04/24/guid-guide-part-one/
As written in other answers, my implementation had a few severe problems:
Thread safety: Random is not thread safe.
Predictability: the method couldn't be used for security critical identifiers like session tokens due to the nature of the Random class.
Collisions: Even though the method created 20 'random' numbers, the probability of a collision is not (number of possible chars)^20 due to the seed value only being 31 bits, and coming from a bad source. Given the same seed, any length of sequence will be the same.
Guid.NewGuid() would be fine, except we don't want to use ugly GUIDs in urls and .NETs NewGuid() algorithm is not known to be cryptographically secure for use in session tokens - it might give predictable results if a little information is known.
Here is the code we're using now, it is secure, flexible and as far as I know it's very unlikely to create collisions if given enough length and character choice:
class RandomStringGenerator
{
RNGCryptoServiceProvider rand = new RNGCryptoServiceProvider();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
{
byte[] intBytes = new byte[4];
rand.GetBytes(intBytes);
uint randomInt = BitConverter.ToUInt32(intBytes, 0);
s += chars[randomInt % chars.Length];
}
return s;
}
}
"Auto generating user ids and post ids for identification in the database"...why not use a database sequence or identity to generate keys?
To me your question is really, "What is the best way to generate a primary key in my database?" If that is the case, you should use the conventional tool of the database which will either be a sequence or identity. These have benefits over generated strings.
Sequences/identity index better. There are numerous articles and blog posts that explain why GUIDs and so forth make poor indexes.
They are guaranteed to be unique within the table
They can be safely generated by concurrent inserts without collision
They are simple to implement
I guess my next question is, what reasons are you considering GUID's or generated strings? Will you be integrating across distributed databases? If not, you should ask yourself if you are solving a problem that doesn't exist.
Your custom method has two problems:
It uses a global instance of Random, but doesn't use locking. => Multi threaded access can corrupt its state. After which the output will suck even more than it already does.
It uses a predictable 31 bit seed. This has two consequences:
You can't use it for anything security related where unguessability is important
The small seed (31 bits) can reduce the quality of your numbers. For example if you create multiple instances of Random at the same time(since system startup) they'll probably create the same sequence of random numbers.
This means you cannot rely on the output of Random being unique, no matter how long it is.
I recommend using a CSPRNG (RNGCryptoServiceProvider) even if you don't need security. Its performance is still acceptable for most uses, and I'd trust the quality of its random numbers over Random. If you you want uniqueness, I recommend getting numbers with around 128 bits.
To generate random strings using RNGCryptoServiceProvider you can take a look at my answer to How can I generate random 8 character, alphanumeric strings in C#?.
Nowadays GUIDs returned by Guid.NewGuid() are version 4 GUIDs. They are generated from a PRNG, so they have pretty similar properties to generating a random 122 bit number (the remaining 6 bits are fixed). Its entropy source has much higher quality than what Random uses, but it's not guaranteed to be cryptographically secure.
But the generation algorithm can change at any time, so you can't rely on that. For example in the past the Windows GUID generation algorithm changed from v1 (based on MAC + timestamp) to v4 (random).
Use System.Guid as it:
...can be used across all computers and networks wherever a unique identifier is required.
Note that Random is a pseudo-random number generator. It is not truly random, nor unique. It has only 32-bits of value to work with, compared to the 128-bit GUID.
However, even GUIDs can have collisions (although the chances are really slim), so you should use the database's own features to give you a unique identifier (e.g. the autoincrement ID column). Also, you cannot easily turn a GUID into a 4 or 20 (alpha)numeric number.
Contrary to what some people have said in the comment, a GUID generated by Guid.NewGuid() is NOT dependent on any machine-specific identifier (only type 1 GUIDs are, Guid.NewGuid() returns a type 4 GUID, which is mostly random).
As long as you don't need cryptographic security, the Random class should be good enough, but if you want to be extra safe, use System.Security.Cryptography.RandomNumberGenerator. For the Guid approach, note that not all digits in a GUID are random. Quote from wikipedia:
In the canonical representation, xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx, the most significant bits of N indicates the variant (depending on the variant; one, two or three bits are used). The variant covered by the UUID specification is indicated by the two most significant bits of N being 1 0 (i.e. the hexadecimal N will always be 8, 9, A, or B).
In the variant covered by the UUID specification, there are five versions. For this variant, the four bits of M indicates the UUID version (i.e. the hexadecimal M will either be 1, 2, 3, 4, or 5).
Regarding your edit, here is one reason to prefer a GUID over a generated string:
The native storage for a GUID (uniqueidentifier) in SQL Server is 16 bytes. To store a equivalent-length varchar (string), where each "digit" in the id is stored as a character, would require somewhere between 32 and 38 bytes, depending on formatting.
Because of its storage, SQL Server is also able to index a uniqueidentifier column more efficiently than a varchar column as well.

Datastructure choices for highspeed and memory efficient detection of duplicate of strings

I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.
This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.
The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.
It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.
The naive implementation would be simply (in C#):
Dictionary<String,DateTime>
though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.
So, given these constraints, what would you use?
EDIT, some additional information that might change proposed implementations:
99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.
I don't belive it is possible to construct "perfect hash" without knowing complete set of values first (especially in case of C# int with limited number of values). So any kind of hashing requires ability to compare original values too.
I think dictionary is the best you can get with out of box data structures. Since you can store objects with custom comparisons defined you can easily avoid keeping strings in memeory and simply save location where whole string can be obtained. I.e. object with following values:
stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;
Where cutomom comparer will return saved hashCode or retrive string from file if needed and perform comparison.
a unique hash of the string would be fine, providing there is no false
positives due to collisions
That's not possible, if you want the hash code to be shorter than the strings.
Using hash codes implies that there are false positives, only that they are rare enough not to be a performance problem.
I would even consider to create the hash code from only part of the string, to make it faster. Even if that means that you get more false positives, it could increase the overall performance.
Provided the memory footprint is tolerable, I would suggest a Hashset<string> for the strings, and a queue to store a Tuple<DateTime, String>. Something like:
Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();
Now, when a string comes in:
if (Strings.Add(s))
{
// string is new. process it.
// and add it to the expiration queue
Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}
And, somewhere you'll have to check for the expirations. Perhaps every time you get a new string, you do this:
while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
var e = Expirations.Dequeue();
Strings.Remove(e.Item2);
}
It'd be hard to beat the performance of Hashset here. Granted, you're storing the strings, but that's going to be the only way to guarantee no false positives.
You might also consider using a time stamp other than DateTime.Now. What I typically do is start a Stopwatch when the program starts, and then use the ElapsedMilliseconds value. That avoids potential problems that occur during Daylight Saving Time changes, when the system automatically updates the clock (using NTP), or when the user changes the date/time.
Whether the above solution works for you is going to depend on whether you can stand the memory hit of storing the strings.
Added after "Additional information" was posted:
If this will be accessed by multiple threads, I'd suggest using ConcurrentDictionary rather than Hashset, and BlockingCollection rather than Queue. Or, you could use lock to synchronize access to the non-concurrent data structures.
If it's true that 99% of the strings will not be duplicate, then you'll almost certainly need an expiration queue that can remove things from the dictionary.
If memory footprint of storing whole strings is not acceptable, you have only two choices:
1) Store only hashes of strings, which implies possibility of hash collisions (when hash is shorter than strings). Good hash function (MD5, SHA1, etc.) makes this collision nearly impossible to happen, so it only depends whether it is fast enough for your purpose.
2) Use some kind of lossless compression. Strings have usually good compression ratio (about 10%) and some algorithms such as ZIP let you choose between fast (and less efficient) and slow (with high compression ratio) compression. Another way to compress strings is convert them to UTF8, which is fast and easy to do and has nearly 50% compression ratio for non-unicode strings.
Whatever way you choose, it's always tradeoff between memory footprint and hashing/compression speed. You will probably need to make some benchmarking to choose best solution.

Do I need to verify the uniqueness of a GUID?

I saw this function in a source written by my coworker
private String GetNewAvailableId()
{
String newId = Guid.NewGuid().ToString();
while (clientsById.ContainsKey(newId))
{
newId = Guid.NewGuid().ToString();
}
return newId;
}
I wonder if there is a scenario in which the guid might not be unique?
The code is used in a multithread scenario and clientsById is a dictionary of GUID and an object
This should be completely unneccessary - the whole point of GUIDs is to eliminate the need for these sorts of checks :-)
You may be interesting in reading this interesting post on GUID generation algorithms:
GUIDs are globally unique, but substrings of GUIDs aren't (The Old New Thing)
The goal of this algorithm is to use the combination of time and location ("space-time coordinates" for the relativity geeks out there) as the uniqueness key. However, timekeeping is not perfect, so there's a possibility that, for example, two GUIDs are generated in rapid succession from the same machine, so close to each other in time that the timestamp would be the same. That's where the uniquifier comes in. When time appears to have stood still (if two requests for a GUID are made in rapid succession) or gone backward (if the system clock is set to a new time earlier than what it was), the uniquifier is incremented so that GUIDs generated from the "second time it was five o'clock" don't collide with those generated "the first time it was five o'clock".
The only real way that you might ever have a collision is if someone was generating thousands of GUIDs on the same machine while also repeatedly setting the timestamp back to the same exact point in time.
By definition, GUID's are unique (Globally unique identifier). It's unnecessary to check for uniqueness as uniqueness is the purpose of GUID's.
The total number of unique keys is 2128 or 3.4×1038. This number is so
large that the probability of the same number being generated randomly
twice is negligible.
Quote taken from Wikipedia
This check is not needed at all - a GUID is guaranteed to be as unique as it can be, period, and has a really low chance of ever being duplicated, ever.
From MSDN:
A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated.
And, again from MSDN:
The chance that the value of the new Guid will be all zeros or equal to any other Guid is very low.
To be sure, you'd be the most unlucky developer in the universe if you were to get a single conflicting GUID out of a collection of one thousand within your whole lifetime.
Number of unique GUID's. If you really want to you can put in that check but I don't really see why with these odds.
Number of GUIDs 340,282,366,920,938,463,463,374,607,431,770,000,000 *

'Proper' collection to use to obtain items in O(1) time in C# .NET?

Something I do often if I'm storing a bunch of string values and I want to be able to find them in O(1) time later is:
foreach (String value in someStringCollection)
{
someDictionary.Add(value, String.Empty);
}
This way, I can comfortably perform constant-time lookups on these string values later on, such as:
if (someDictionary.containsKey(someKey))
{
// etc
}
However, I feel like I'm cheating by making the value String.Empty. Is there a more appropriate .NET Collection I should be using?
If you're using .Net 3.5, try HashSet. If you're not using .Net 3.5, try C5. Otherwise your current method is ok (bool as #leppie suggests is better, or not as #JonSkeet suggests, dun dun dun!).
HashSet<string> stringSet = new HashSet<string>(someStringCollection);
if (stringSet.Contains(someString))
{
...
}
You can use HashSet<T> in .NET 3.5, else I would just stick to you current method (actually I would prefer Dictionary<string,bool> but one does not always have that luxury).
something you might want to add is an initial size to your hash. I'm not sure if C# is implemented differently than Java, but it usually has some default size, and if you add more than that, it extends the set. However a properly sized hash is important for achieving as close to O(1) as possible. The goal is to get exactly 1 entry in each bucket, without making it really huge. If you do some searching, I know there is a suggested ratio for sizing the hash table, assuming you know beforehand how many elements you will be adding. For example, something like "the hash should be sized at 1.8x the number of elements to be added" (not the real ratio, just an example).
From Wikipedia:
With a good hash function, a hash
table can typically contain about
70%–80% as many elements as it does
table slots and still perform well.
Depending on the collision resolution
mechanism, performance can begin to
suffer either gradually or
dramatically as more elements are
added. To deal with this, when the
load factor exceeds some threshold, it
is necessary to allocate a new, larger
table, and add all the contents of the
original table to this new table. In
Java's HashMap class, for example, the
default load factor threshold is 0.75.
I should probably make this a question, because I see the problem so often. What makes you think that dictionaries are O(1)? Technically, the only thing likely to be something like O(1) is access into a standard integer-indexed fixed-bound array using an integer index value (there being no look-up in arrays implemented that way).
The presumption that if it looks like an array reference it is O(1) when the "index" is a value that must be looked up somehow, however behind the scenes, means that it is not likely an O(1) scheme unless you are lucky to obtain a hash function with data that has no collisions (and probably a lot of wasted cells).
I see these questions and I even see answers that claim O(1) [not on this particular question, but I do seem them around], with no justification or explanation of what is required to make sure O(1) is actually achieved.
Hmm, I guess this is a decent question. I will do that after I post this remark here.

Categories