Uniqueness for a shortened guid

Uniqueness for a shortened guid - c#

I have to append a unique code as a querystring, for every url generated.
So, the option I chose is to shorten a guid (found here on SO).
public static string CreateGuid()
{
Guid guid = Guid.NewGuid();
return Convert.ToBase64String(guid.ToByteArray());
}
Will this be as unique as a guid, cause I have several urls to generate and this guid will be saved in DB.

Yup, the default string representation of a guid is base16. By reformatting the same value as base64, you get a shorter (but possibly uglier) string.

You should watch out if you are using this in the url. While the string will be shorter, it will potentially have characters that are illegal in urls, so you may need to run it past
HttpUtility.UrlEncode()
to be safe. Of course, once you do that, it will get a little longer again.
Edit:
Your comment makes it seem like you want some sort of math, so here goes:
Let's assume that you have 24 alphanumeric characters all the time, and casing does not matter. That means each character can be 0-9 + a-z or 36 possibilities. That makes it 24 ^ 36 different possible strings. Refer to this website then:
http://davidjohnstone.net/pages/hash-collision-probability
Which lets you plug in possible values and the number of times you will need to run your code. 24^36 is equivalent to 2^100 (I arrived at this number after some googling, may be incorrect). plugging in 100 into the "number of bits in your hash" field at the link above means if you run your code 1000000 times, you will still only have 3.944300×10^19 odds of a collision, or the same value coming up twice. That's miniscule, but you may run into issues if you are writing something that will be used many many more times than that.

Related

How to I encode a number so that small changes result in very different encodings?

I'm working in C#. I have an unsigned 32-bit integer i that is incremented gradually in response to an outside user controlled event. The number is displayed in hexadecimal as a unique ID for the user to be able to enter and look up later. I need i to display a very different 8 character string if it is incremented or two integers are otherwise close together in value (say, distance < 256). So for example, if i = 5 and j = 6 then:
string a = Encoded(i); // = "AF293E5B"
string b = Encoded(j); // = "CD2429A4"
The limitations on this are:
I don't want an obvious pattern in how the string changes in each increment.
The process needs to be reversible, so if given the string I can generate the original number.
Each generated string needs to be unique for the entire range of a 32-bit unsigned integers, so that two numbers don't ever produce the same string.
The algorithm to produce the string should be fairly easy to implement and maintain for both encoding and decoding (maybe 30 lines each or less).
However:
The algorithm does not need to be cryptographically secure. The goal is obfuscation more than encryption. The number itself is not secret, it just needs to not obviously be an incrementing number.
It is alright if looking at a large list of incremented numbers a human can discern a pattern in how the strings are changing. I just don't want it to be obvious if they are "close".
I recognize that a Minimal Perfect Hash Function meets these requirements, but I haven't been able to find one that will do what I need or learn how to derive one that will.
I have seen this question, and while it is along similar lines, I believe my question is more specific and precise in its requirements. The answer given for that question (as of this writing) references 3 links for possible implementations, but not being familiar with Ruby I'm not sure how to get at the code for the "obfuscate_id" (first link), Skipjack feels like overkill for what I need (2nd link), and Base64 does not use the character set I'm interested in (hex).

y = p * x mod q is reversible if p and q are co-primes. In particular, mod 2^32 is easy, and any odd number is a co-prime of 2^32. Now 17,34,51,... is a bit too easy, but the pattern is less obvious for 2^31 < p < 2^32-2^30 (0x8000001-0xBFFFFFFF).

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input

Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.

First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Coupon code generation

I would like to generate coupon codes , e.g. AYB4ZZ2. However, I would also like to be able to mark the used coupons and limit their global number, let's say N. The naive approach would be something like "generate N unique alphanumeric codes, put them into database and perform a db search on every coupon operation."
However, as far as I realize, we can also attempt to find a function MakeCoupon(n), which converts the given number into a coupon-like string with predefined length.
As far as I understand, MakeCoupon should fullfill the following requirements:
Be bijective. It's inverse MakeNumber(coupon) should be effectively computable.
Output for MakeCoupon(n) should be alphanumeric and should have small and constant length - so that it could be called human readable. E.g. SHA1 digest wouldn't pass this requirement.
Practical uniqueness. Results of MakeCoupon(n) for every natural n <= N should be totally unique or unique in the same terms as, for example, MD5 is unique (with the same extremely small collision probability).
(this one is tricky to define) It shouldn't be obvious how to enumerate all remaining coupons from a single coupon code - let's say MakeCoupon(n) and MakeCoupon(n + 1) should visually differ.
E.g. MakeCoupon(n), which simply outputs n padded with zeroes would fail this requirement, because 000001 and 000002 don't actually differ "visually".
Q:
Does any function or function generator, which fullfills the following requirements, exist? My search attempts only lead me to [CPAN] CouponCode, but it does not fullfill the requirement of the corresponding function being bijective.

Basically you can split your operation into to parts:
Somehow "encrypt" your initial number n, so that two consecutive numbers yield (very) different results
Construct your "human-readable" code from the result of step 1
For step 1 I'd suggest to use a simple block cipher (e.g. a Feistel cipher with a round function of your choice). See also this question.
Feistel ciphers work in several rounds. During each round, some round function is applied to one half of the input, the result is xored with the other half and the two halves are swapped. The nice thing about Feistel ciphers is that the round function hasn't to be two-way (the input to the round function is retained unmodified after each round, so the result of the round function can be reconstructed during decryption). Therefore you can choose whatever crazy operation(s) you like :). Also Feistel ciphers are symmetric, which fulfills your first requirement.
A short example in C#
const int BITCOUNT = 30;
const int BITMASK = (1 << BITCOUNT/2) - 1;
static uint roundFunction(uint number) {
return (((number ^ 47894) + 25) << 1) & BITMASK;
}
static uint crypt(uint number) {
uint left = number >> (BITCOUNT/2);
uint right = number & BITMASK;
for (int round = 0; round < 10; ++round) {
left = left ^ roundFunction(right);
uint temp = left; left = right; right = temp;
}
return left | (right << (BITCOUNT/2));
}
(Note that after the last round there is no swapping, in the code the swapping is simply undone in the construction of the result)
Apart from fulfilling your requirements 3 and 4 (the function is total, so for different inputs you get different outputs and the input is "totally scrambled" according to your informal definition) it is also it's own inverse (thus implicitely fulfilling requirement 1), i.e. crypt(crypt(x))==x for each x in the input domain (0..2^30-1 in this implementation). Also it's cheap in terms of performance requirements.
For step 2 just encode the result to some base of your choice. For instance, to encode a 30-bit number, you could use 6 "digits" of an alphabet of 32 characters (so you can encode 6*5=30 bits).
An example for this step in C#:
const string ALPHABET= "AG8FOLE2WVTCPY5ZH3NIUDBXSMQK7946";
static string couponCode(uint number) {
StringBuilder b = new StringBuilder();
for (int i=0; i<6; ++i) {
b.Append(ALPHABET[(int)number&((1 << 5)-1)]);
number = number >> 5;
}
return b.ToString();
}
static uint codeFromCoupon(string coupon) {
uint n = 0;
for (int i = 0; i < 6; ++i)
n = n | (((uint)ALPHABET.IndexOf(coupon[i])) << (5 * i));
return n;
}
For inputs 0 - 9 this yields the following coupon codes
0 => 5VZNKB
1 => HL766Z
2 => TMGSEY
3 => P28L4W
4 => EM5EWD
5 => WIACCZ
6 => 8DEPDA
7 => OQE33A
8 => 4SEQ5A
9 => AVAXS5
Note, that this approach has two different internal "secrets": First, the round function together with the number of rounds used and second, the alphabet you use for encoding the encyrpted result. But also note, that the shown implementation is in no way secure in a cryptographical sense!
Also note, that the shown function is a total bijective function, in the sense, that every possible 6-character code (with characters out of your alphabet) will yield a unique number. To prevent anyone from entering just some random code, you should define some kind of restictions on the input number. E.g. only issue coupons for the first 10.000 numbers. Then, the probability of some random coupon code to be valid would be 10000/2^30=0.00001 (it would require about 50000 attempts to find a correct coupon code). If you need more "security", you can just increase the bit size/coupon code length (see below).
EDIT: Change Coupon code length
Changing the length of the resulting coupon code requires some math: The first (encrypting) step only works on a bit string with even bit count (this is required for the Feistel cipher to work).
In the the second step, the number of bits that can be encoded using a given alphabet depends on the "size" of chosen alphabet and the length of the coupon code. This "entropy", given in bits, is, in general, not an integer number, far less an even integer number. For example:
A 5-digit code using a 30 character alphabet results in 30^5 possible codes which means ld(30^5)=24.53 bits/Coupon code.
For a four-digit code, there is a simple solution: Given a 32-Character alphabet you can encode *ld(32^4)=5*4=20* Bits. So you can just set the BITCOUNT to 20 and change the for loop in the second part of the code to run until 4 (instead of 6)
Generating a five-digit code is a bit trickier and somhow "weakens" the algorithm: You can set the BITCOUNT to 24 and just generate a 5-digit code from an alphabet of 30 characters (remove two characters from the ALPHABET string and let the for loop run until 5).
But this will not generate all possible 5-digit-codes: with 24 bits you can only get 16,777,216 possible values from the encryption stage, the 5 digit codes could encode 24,300,000 possible numbers, so some possible codes will never be generated. More specifically, the last position of the code will never contain some characters of the alphabet. This can be seen as a drawback, because it narrows down the set of valid codes in an obvious way.
When decoding a coupon code, you'll first have to run the codeFromCoupon function and then check, if bit 25 of the result is set. This would mark an invalid code that you can immediately reject. Note that, in practise, this might even be an advantage, since it allows a quick check (e.g. on the client side) of the validity of a code without giving away all internals of the algorithm.
If bit 25 is not set you'll call the crypt function and get the original number.

Though I may get docked for this answer I feel like I need to respond - I really hope that you hear what I'm saying as it comes from a lot of painful experience.
While this task is very academically challenging, and software engineers tend to challenge their intelect vs. solving problems, I need to provide you with some direction on this if I may. There is no retail store in the world, that has any kind of success anyway, that doesn't keep very good track of each and every entity that is generated; from each piece of inventory to every single coupon or gift card they send out those doors. It's just not being a good steward if you are, because it's not if people are going to cheat you, it's when, and so if you have every possible item in your arsenal you'll be ready.
Now, let's talk about the process by which the coupon is used in your scenario.
When the customer redeems the coupon there is going to be some kind of POS system in front right? And that may even be an online business where they are then able to just enter their coupon code vs. a register where the cashier scans a barcode right (I'm assuming that's what we're dealing with here)? And so now, as the vendor, you're saying that if you have a valid coupon code I'm going to give you some kind of discount and because our goal was to generate coupon codes that were reversable we don't need a database to verify that code, we can just reverse it right! I mean it's just math right? Well, yes and no.
Yes, you're right, it's just math. In fact, that's also the problem because so is cracking SSL. But, I'm going to assume that we all realize the math used in SSL is just a bit more complex than anything used here and the key is substantially larger.
It does not behoove you, nor is it wise for you to try and come up with some kind of scheme that you're just sure nobody cares enough to break, especially when it comes to money. You are making your life very difficult trying to solve a problem you really shouldn't be trying to solve because you need to be protecting yourself from those using the coupon codes.
Therefore, this problem is unnecessarily complicated and could be solved like this.
// insert a record into the database for the coupon
// thus generating an auto-incrementing key
var id = [some code to insert into database and get back the key]
// base64 encode the resulting key value
var couponCode = Convert.ToBase64String(id);
// truncate the coupon code if you like
// update the database with the coupon code
Create a coupon table that has an auto-incrementing key.
Insert into that table and get the auto-incrementing key back.
Base64 encode that id into a coupon code.
Truncate that string if you want.
Store that string back in the database with the coupon just inserted.

What you want is called Format-preserving encryption.
Without loss of generality, by encoding in base 36 we can assume that we are talking about integers in 0..M-1 rather than strings of symbols. M should probably be a power of 2.
After choosing a secret key and specifying M, FPE gives you a pseudo-random permutation of 0..M-1 encrypt along with its inverse decrypt.
string GenerateCoupon(int n) {
Debug.Assert(0 <= n && n < N);
return Base36.Encode(encrypt(n));
}
boolean IsCoupon(string code) {
return decrypt(Base36.Decode(code)) < N;
}
If your FPE is secure, this scheme is secure: no attacker can generate other coupon codes with probability higher than O(N/M) given knowledge of arbitrarily many coupons, even if he manages to guess the number associated with each coupon that he knows.
This is still a relatively new field, so there are few implementations of such encryption schemes. This crypto.SE question only mentions Botan, a C++ library with Perl/Python bindings, but not C#.
Word of caution: in addition to the fact that there are no well-accepted standards for FPE yet, you must consider the possibility of a bug in the implementation. If there is a lot of money on the line, you need to weigh that risk against the relatively small benefit of avoiding a database.

You can use a base-36 number system. Assume that you want 6 characters in the coupen output.
pseudo code for MakeCoupon
MakeCoupon(n)
{
Have an byte array of fixed size, say 6. Initialize all the values to 0.
convert the number to base - 36 and store the 'digits' in an array
(using integer division and mod operations)
Now, for each 'digit' find the corresponding ascii code assuming the
digits to start from 0..9,A..Z
With this convension output six digits as a string.
}
Now the calculating the number back is the reverse of this operation.
This would work with very large numbers (35^6) with 6 allowed characters.

Choose a cryptographic function c. There are a few requirements on c, but for now let us take SHA1.
choose a secret key k.
Your coupon code generating function could be, for number n:
concatenate n and k as "n"+"k" (this is known as salting in password management)
compute c("n"+"k")
the result of SHA1 is 160bits, encode them (for instance with base64) as an ASCII string
if the result is too long (as you said it is the case for SHA1), truncate it to keep only the first 10 letters and name this string s
your coupon code is printf "%09d%s" n s, i.e. the concatenation of zero-padded n and the truncated hash s.
Yes, it is trivial to guess n the number of the coupon code (but see below). But it is hard to generate another valid code.
Your requirements are satisfied:
To compute the reverse function, just read the first 9 digits of the code
The length is always 19 (9 digits of n, plus 10 letters of hash)
It is unique, since the first 9 digits are unique. The last 10 chars are too, with high probability.
It is not obvious how to generate the hash, even if one guesses that you used SHA1.
Some comments:
If you're worried that reading n is too obvious, you can obfuscate it lightly, like base64 encoding, and alternating in the code the characters of n and s.
I am assuming that you won't need more than a billion codes, thus the printing of n on 9 digits, but you can of course adjust the parameters 9 and 10 to your desired coupon code length.
SHA1 is just an option, you could use another cryptographic function like private key encryption, but you need to check that this function remains strong when truncated and when the clear text is provided.
This is not optimal in code length, but has the advantage of simplicity and widely available libraries.

Generate serial number using letters and digits

I'm developing an application for taking orders in C# and DevExpress, and I need a function that generates a unique order number. The order number must contain letters and digits and has a length of 20 ..
I've seen things like Guid.NewGuid() but I don't want it to be totally random, nor to be just an auto increment number ..
Can anyone help? even if it's a script in a different language, I need ideas desperately :)

You can create type of your own .
lets say yyyyMMddWWW-YYY-XXXXXXX where WWW is the store number, YYY the cashier id XXXXXXX is a hexadecimal number ( -> maybe an actual autoincrement number that you turn it into hex ) . This is just an idea . Im afraid you have to decide by the elements of your system how it will be .
edited : also if you can apply a check digit algorithm on it will also help in avoiding mistakes

Two different methods:
Create MD5 or SHA1 hash of current time
Hash of increment number

One thought comes to mind.
Take the DateTime.Now.Ticks convert it to hexadecimal string.
Voila, String.Format("{0:X}", value);
If not long enough , you said you need 20 digits, you can always pad with zeros.

Get the mother board ID
Get the hdd ID
Merge it by any way
Add your secret code
Apply MD5
Apply Base54
Result: the serial code which is linked to the currect client PC :)

My two cents.
If you need ideas then take a look at the Luhn and Luhn mod N algorithms.
While these algorithms are not unique code generators, they may give you some ideas on how to generate codes that can be validated (such that you could validate the code for correctness before sending it off to the database).

Like Oded suggested, Guid is not random (well, not if you have a network card). It's based on time and location coordinates. See Raymond Chens blog post for a detailed explanation.
You are best off using an auto incremented int for order ids. I don't understand why you wouldn't want to use it or failing that a Guid?
I can't think of any way other then an auto id to maintain uniqueness and represent the order of your different orders in your system.

Do I need to verify the uniqueness of a GUID?

I saw this function in a source written by my coworker
private String GetNewAvailableId()
{
String newId = Guid.NewGuid().ToString();
while (clientsById.ContainsKey(newId))
{
newId = Guid.NewGuid().ToString();
}
return newId;
}
I wonder if there is a scenario in which the guid might not be unique?
The code is used in a multithread scenario and clientsById is a dictionary of GUID and an object

This should be completely unneccessary - the whole point of GUIDs is to eliminate the need for these sorts of checks :-)
You may be interesting in reading this interesting post on GUID generation algorithms:
GUIDs are globally unique, but substrings of GUIDs aren't (The Old New Thing)
The goal of this algorithm is to use the combination of time and location ("space-time coordinates" for the relativity geeks out there) as the uniqueness key. However, timekeeping is not perfect, so there's a possibility that, for example, two GUIDs are generated in rapid succession from the same machine, so close to each other in time that the timestamp would be the same. That's where the uniquifier comes in. When time appears to have stood still (if two requests for a GUID are made in rapid succession) or gone backward (if the system clock is set to a new time earlier than what it was), the uniquifier is incremented so that GUIDs generated from the "second time it was five o'clock" don't collide with those generated "the first time it was five o'clock".
The only real way that you might ever have a collision is if someone was generating thousands of GUIDs on the same machine while also repeatedly setting the timestamp back to the same exact point in time.

By definition, GUID's are unique (Globally unique identifier). It's unnecessary to check for uniqueness as uniqueness is the purpose of GUID's.
The total number of unique keys is 2128 or 3.4×1038. This number is so
large that the probability of the same number being generated randomly
twice is negligible.
Quote taken from Wikipedia

This check is not needed at all - a GUID is guaranteed to be as unique as it can be, period, and has a really low chance of ever being duplicated, ever.
From MSDN:
A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated.
And, again from MSDN:
The chance that the value of the new Guid will be all zeros or equal to any other Guid is very low.
To be sure, you'd be the most unlucky developer in the universe if you were to get a single conflicting GUID out of a collection of one thousand within your whole lifetime.

Number of unique GUID's. If you really want to you can put in that check but I don't really see why with these odds.
Number of GUIDs 340,282,366,920,938,463,463,374,607,431,770,000,000 *

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.