GUIDs get used a lot in creating session keys for web applications. I've always wondered about the safety of this practice. Since the GUID is generated based on information from the machine, and the time, along with a few other factors, how hard is it to guess of likely GUIDs that will come up in the future. Let's say you started 1000, or 10000 new sessions, to get a good dataset of the GUIDs being generated. Would this make it any easier to generate a GUID that might be used for another session. You wouldn't even have to guess a specific GUID, but just keep on trying GUIDs that might be generated at a specific period of time.
Here is some stuff from Wikipedia (original source):
V1 GUIDs which contain a MAC address
and time can be identified by the
digit "1" in the first position of the
third group of digits, for example
{2f1e4fc0-81fd-11da-9156-00036a0f876a}.
In my understanding, they don't really hide it.
V4 GUIDs use the later algorithm,
which is a pseudo-random number. These
have a "4" in the same position, for
example
{38a52be4-9352-453e-af97-5c3b448652f0}.
More specifically, the 'data3' bit
pattern would be 0001xxxxxxxxxxxx in
the first case, and 0100xxxxxxxxxxxx
in the second. Cryptanalysis of the
WinAPI GUID generator shows that,
since the sequence of V4 GUIDs is
pseudo-random, given the initial state
one can predict up to next 250 000
GUIDs returned by the function
UuidCreate1. This is why GUIDs
should not be used in cryptography, e.
g., as random keys.
GUIDs are guaranteed to be unique and that's about it. Not guaranteed to be be random or difficult to guess.
TO answer you question, at least for the V1 GUID generation algorithm if you know the algorithm, MAC address and the time of the creation you could probably generate a set of GUIDs one of which would be one that was actually generated. And the MAC address if it's a V1 GUID can be determined from sample GUIDs from the same machine.
Additional tidbit from wikipedia:
The OSF-specified algorithm for
generating new GUIDs has been widely
criticized. In these (V1) GUIDs, the
user's network card MAC address is
used as a base for the last group of
GUID digits, which means, for example,
that a document can be tracked back to
the computer that created it. This
privacy hole was used when locating
the creator of the Melissa worm. Most
of the other digits are based on the
time while generating the GUID.
.NET Web Applications call Guid.NewGuid() to create a GUID which is in turn ends up calling the CoCreateGuid() COM function a couple of frames deeper in the stack.
From the MSDN Library:
The CoCreateGuid function calls the
RPC function UuidCreate, which creates
a GUID, a globally unique 128-bit
integer. Use the CoCreateGuid function
when you need an absolutely unique
number that you will use as a
persistent identifier in a distributed
environment.To a very high degree of
certainty, this function returns a
unique value – no other invocation, on
the same or any other system
(networked or not), should return the
same value.
And if you check the page on UuidCreate:
The UuidCreate function generates a
UUID that cannot be traced to the
ethernet/token ring address of the
computer on which it was generated. It
also cannot be associated with other
UUIDs created on the same computer.
The last contains sentence is the answer to your question. So I would say, it is pretty hard to guess unless there is a bug in Microsoft's implementation.
If someone kept hitting a server with a continuous stream of GUIDs it would be more of a denial of service attack than anything else.
The possibility of someone guessing a GUID is next to nil.
Depends. It is hard if the GUIDs are set up sensibly, e.g. using salted secure hashes and you have plenty of bits. It is weak if the GUIDs are short and obvious.
You may well want to be taking steps to stop someone create 10000 new sessions anyway due to the server load this might create.
"GUIDs are guaranteed to be unique and that's about it". GUIDs are not garanteed to be unique. At least the ones generated by CoCreateGuid: "To a very high degree of certainty, this function returns a unique value – no other invocation, on the same or any other system (networked or not), should return the same value."
Related
We have an application that
Generates a hash code on a string
Saves that hash code into a DB along with associated data
Later, it queries the DB using the string hash code for retrieving the data
This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we'd like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.
The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.
Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?
A few more notes:
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
Then just let the database index the strings for you!
Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.
Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:
string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
unchecked { hash = hash * 23 + c.GetHashCode(); }
}
Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.
Edit (the below was written before the question was edited with new salient information):
You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.
There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.
The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.
Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.
To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.
Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.
I note the documentation for String.GetHashCode:
The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.
And from Object.GetHashCode:
The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.
Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.
You should just use SHA512.
Note that hashes are not (and cannot be) unique.
If you need it to be unique, just use the identity function as your hash.
You can use one of the managed cryptography classes (such as SHA512Managed) to compute a platform independent hash via ComputeHash. This will require converting the string to a byte array (ie: using Encoding.GetBytes or some other method), and be slow, but be consistent.
That being said, a hash is not guaranteed unique, and is really not a proper mechanism for uniqueness in a database. Using a hash to store data is likely to cause data to get lost, as the first hash collision will overwrite old data (or throw away new data).
I'm indexing some URLs based on their hash code and use this hash to retrieve them. I have 2 questions in this matter:
Do you think this is a good approach? I mean sometimes two different URLs can produce the same hash but I don't seem to have any other choice since URLs can be very long and I need to produce a file name for them.
[More important] Sometimes two different URLs are actually reffering to the same page (e.g. http://www.stackoverflow.com and http://stackoverflow.com and sometimes URLs with % characters) but I need to produce the same hash code for these URLs. What do you suggest?
Thanks.
Definitely don't use the .NET String hash code - there's no guarantee it'll do the same thing between versions (and did actually change between .NET 1.1 and .NET 2.0). It's also quite likely to have collisions, and is very short 32 bits).
If you really have to use a hash, use a cryptographic hash as that's much less likely to result in collisions - you could use SHA-256, for example. Note that crypto hashes generally work in terms of binary data, so you'll need to convert the URL to a byte array first, e.g. with Encoding.UTF8.GetBytes(text). This isn't foolproof, but it's at least "very unlikely" to produce collisions. Of course, as the hash is rather bigger, your output filename will be bigger too. (You'll need to convert from a byte[] to a string as well, I assume - I suggest you use Convert.ToBase64String).
Does your filename really have to be derived from the URL though? Couldn't you just generate random filenames (or increment a counter) and then store the mapping between URL and filename somewhere? That's a much more sensible approach IMO - and it's reversible (so you can tell which URL generated a particular file).
As for your second question - basically you'll need to find some way of deriving a canonical URL from any given URL, so that all "equivalent" URLs are converted to the same canonical one... and that's what you hash or store.
Indexing based on hash codes is a path to bugs. Hash codes are not unique and do have collisions. If you index on a hash code it will lead to a situation where two non-equal values end up retrieving the same mapped value from your data table.
After lots of discussion and thinking, since there is no answer that completely answers my questions, I'm going to answer my own question. The one thing important is that the comment posted by Morten Mertner is the closest thing to my answer but I cannot select it as an answser.
There is no other way for me except using a hash algorithm. But to reduce the risk of duplicate, I should use better algorithms like SHA-2 ones.
As Morten Mertner said, in some cases the mentioned URLs are NOT actually the same and I cannot assume that the website is configured correctly. The only thing I can do is to remove the bookmarks and either use ecoded/decoded version of the URL. (The versions with/without % characters).
Thanks for all of the help guys.
Let's say I want to set a guid to be my application's assembly guid. As searched from internet, we can use (new Guid()).Next() to get a new unique value.
I cannot figure out how my guid is warranted to be unique against others? Please explain if you know how to.
The only guarantee you have is that probability is on your side. 2^128 possible GUIDs and some cleverness in the creation process makes it very unlikely you will ever see a duplicate.
It seems V4 is the standard GUID on Windows now. If that one is purely based on a pseudo-random number generator, as Wikipedia seems to indicate, it's affected by the Birthday problem.
I've seen several examples using 128-bits to show that a duplicate is almost impossible. Those often miss two things. The Birthday problem and that a V4 GUID actually is 124 bits.
You need 1/2+sqrt(1/4-2*2^124*ln(0,5)) ≈ 5.4*10^18 GUIDs to reach a 50% chance of a duplicate. That is still a lot, but 50% may not be the deal you are looking for. Say you want it to be one in a million to get a duplicate, then you can have sqrt(2*2^124*ln(1/(1-0,000001))) ≈ 6,5*10^15 GUIDs. If you create a thousand GUIDs per second you could keep on doing that for almost 206667 years before reaching a one to a million risk of getting a duplicate. 6,52191054316287e15/(3600*24*365,25*1000) ≈ 206666,874006986
The chance of me getting all of those calculations correct →0.
It isnt, but the way it is generated and the way it is represented makes probability of generating two same GUIDs in this milenium almost zero.
See: Simple proof that GUID is not unique
From http://en.wikipedia.org/wiki/Globally_unique_identifier:
Algorithm
In the OSF-specified algorithm for generating new (V1) GUIDs, the user's network card MAC address is used as a base for the last group of GUID digits, which means, for example, that a document can be tracked back to the computer that created it. This privacy hole was used when locating the creator of the Melissa worm[2]. Most of the other digits are based on the time while generating the GUID.
V1 GUIDs which contain a MAC address and time can be identified by the digit "1" in the first position of the third group of digits, for example {2f1e4fc0-81fd-11da-9156-00036a0f876a}.
V4 GUIDs use the later algorithm, which is a pseudo-random number. These have a "4" in the same position, for example {38a52be4-9352-453e-af97-5c3b448652f0}. More specifically, the 'data3' bit pattern would be 0001xxxxxxxxxxxx in the first case, and 0100xxxxxxxxxxxx in the second. Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random; given full knowledge of the internal state, it is possible to predict previous and subsequent values.[3].
I need to generate and validate product keys and have been thinking about using a public/private key system.
I generate our product keys based on
a client name (which could be a variable length string)
a 6 digit serial number.
It would be good if the product key would be of a manageable length (16 characters or so)
I need to encrypt them at the base and then distrubute the decryption/validation system. As our system is written in managed code (.NET) we dont want to distribute the encryption system, only the decryption. I need a public private key seems a good way to do this, encrypt with the one key that i keep and distribute the other key needed for decrpytion/verification.
What is an appropriate mechanism to do this with the above requirements?
NOTE: It's not to stop piracy; it's to reduce the likelyhood of novice users installing components they dont need/unauthorised to use.
.NET supports public key encryption in various ways, such as http://msdn.microsoft.com/en-us/library/ms867080.aspx. Having said this, all you'd gain is some confidence that someone with full access to the released code would not have the ability to issue their own product keys. None of this stops them from patching the client to accept anything as a valid key. That's where obfuscation fits in.
Don't even try to get fancy with anti-piracy. It's not worth it. I've cracked countless applications (hush) and .NET ones are by FAR the easiest. But in reality, they're all relatively easy with enough experience. If you don't believe me, check out isohunt some time.
tl;dr: It's a losing battle. Don't fight it. If you really want to win, sue infringments - but even that makes you lose.
I did something very similar. But in my case it was a simple telephone authorisation code. User would phone a number, give their company name and the operation they were performing, get a code, type it into the application and then be able to proceed.
What I did was serialise a piece of data into binary. The data included the hashed company name, operation code/expiration date, and had space to spare for future requirements. I then scattered the bits around the array to confuse it. Then I mapped each 5 bits of the binary array onto a 32 character auth-code alphabet (0-9,a-z,excluding I/O/Q/S for readability over telephone).
This resulted in a nice auth-code which was 16 characters, displayed as 4x4 blocks (####-####-####-####). It could be easily read out over the telephone, as the user only had to listen to four characters at a time, or even sent via SMS.
As with your problem, it wasn't intended to stop the code crackers at Bletchley Park, but was enough to stop the average office worker from doing something without following company procedure. And, given that scope, has been very effective.
We use GUIDs extensively in our database design; Business Object properties provide Guid.Empty GUIDs for DB null values and null is always saved to the database if the value is Guid.Empty.
Apart from Guid.Empty (00000000-0000-0000-0000-000000000000) how likely is it that a GUID will be generated with all the same characters e. g.: 11111111-1111-1111-1111-111111111111
Just considering using GUIDs like these for specific values.
In short: For GUID generated according to the published standards and specifications it simply can't happen. A GUID has a structure and some of the fields actually have a meaning. What's more, .NET generates GUIDs of version 4, where it absolutely can't happen. They are defined in a way that there won't be such a GUID. For details, see below ;-)
There are five to seven bits which are the main pitfall here. Those are the version identifier (the first four bits of part three) and the variant field specifying what variant of GUID this is.
The version can be anything between 1 and 5 currently. So the only valid hex digits for which we could get such a GUID at this point are – obviously – 1 through 5.
Let's dissect the versions a little:
MAC address and timestamp. Both are probably hard to coax into all-1 digits.
MAC address and timestamp as well as user IDs. Same as for v1.
MD5 hash. Could possibly even work.
PRNG. Can't ever work since the first digit of the fourth part is always either 8, 9, A or B. This contradicts the 4 for the version number.
SHA-1 hash. Could possibly even work.
So far we ruled out version 4 as impossible, others as highly unlikely. Lets take a look at the variant field.
The variant field specifies some bit patterns for backwards compatibility (x is a don't care), namely:
0 x x Reserved. NCS backward compatibility.
1 0 x The only pattern that currently can appear
1 1 0 Reserved, Microsoft Corporation backward compatibility
1 1 1 Reserved for future definition.
Since this pattern is at the very start of the fourth part, this means that the most significant bit is always set for the very first hex digit of the fourth part. This means that this very digit can never be 1, 2, 3 or 5. Not counting already generated GUIDs, of course. But those with the MSB set to 0 happen to be either v1 or v2. And the timestamp part of those means that they would have to be generated some millenia in the future for that to work out.
There are exactly 5,316,911,983,139,663,491,615,228,241,121,400,000 possible combinations, so even if it wasn't designed to always be unique, the chances would be pretty remote anyway.
Source: http://msdn.microsoft.com/en-us/library/aa446557.aspx
About as likely as any other randomly generated guids will collide. So, highly unlikely.
Though, you may want to rethink using guids to "store" data like that. They are really used to uniquely identify objects and components.
Your question has already been answered, but I thought I'd be pragmatic here.
1) You will only give yourself 8 "hard-coded" options using this convention.
2) You could just create a real GUID for each these "special" cases instead of hand-cranking them. That way, it is guaranteed to be unique and you'll be able to have more than 8.
That's not a direct answer, I know, but it is probably a sensible suggestion given your intentions.
GUID's are usually generated using an algorithm, rather than being a genuinely random string of hex characters. If you can be sure what algorithm is being used to generate them you can then be sure if the GUIDs you want to use as "magic numbers" are going to collide with generated ones.
The Wikipedia page on GUIDs has a decent amount of information regarding the algorithms that are used, so that may be able to give you a definitive answer. Or, running Reflector over the Guid.NewGuid() method in the .net framework, though based on looking at the reference source for the method this calls out to CoCreateGuid in OLE32.
Why use a specially designed GUID? Beside the recognizable aspect, why not just use a properly generated GUID? (You know it will be unique, that is the point)
For my last company, we used guids as primary keys for tables for all of our databases. In all we instantiated more than 1,000,000,000 objects and never had any issues.
Very, very low. The GUID format includes a few bits identifying the scheme. If you'd "generate" them yourselves, the GUID scheme would most likely be wrong.
I've been sitting here at https://www.guidgenerator.com/online-guid-generator.aspx
hitting the generate guid button every second since january 28 2010. It's now february 10 2021, and i haven't gotten a duplicate.