I'm trying to create a url shortener system in c# and asp.net mvc. I know about hashtable and I know how to create a redirect system etc. The problem is indexing long urls in database. Some urls may have up to 4000 character length, and it seems it is a bad idea to index this kind of strings. The question is: How can I create a unique short string for each url? for example MD5 can help me? Is MD5 really unique for each string?
NOTE: I see that Gravatar uses MD5 for emails, so if each email address is unique, then its MD5 hashed value is unique. Is it right? Can I use same solution for urls?
You can use MD5 or SHA1 for such purposes as your described.
Hashes aren't completely unique. As example if you have 4000 bytes array, that's mean that you potentially have 256^4000 combinaton. And MD5 has will have 256^16 combination. So, there is a possibility of collisions. However, for all practical purposes (except cryptography), you don't never to worry about collisions.
If you are interested to real about collission vulnerability of MD5 (related to cryptographical use), you can do it here
A perfect hash function is one that guarantees no collisions. Since your application cannot accomodate hash chains, a perfect hash is the way to go.
The hashing approaches already mentioned will work fine for creating unique short strings that will probably uniquely identify your URL's. However, I'd like to propose an alternate approach.
Create a database table with two columns, ID (an integer) and URL (a string). Create a row in the table for each URL you wish to track. Then, refer to each URL by its ID. Make the ID auto-incrementing, this will ensure uniqueness.
This addresses the problem of how to translate from the shortened version to the longer version: simply join on the table in the database. With hashing, this would become a problem because hashing is one-way. The resulting page identifiers will also be shorter than MD5 hashes, and will only contain digits so they will be easy to include in URL query strings, etc.
I think you could try to make from the url string a byte(each char can be a byte) array and then use encoding (Base64 for example, or you can create one yourself if you want to go that far), Then if you want to decode you just use base 64 decoding and make from the bytes (in the array) again chars. However I am not sure or this will be a long string or not, but I am pretty sure it will be unique.
(PS you should ofc apply some logic first like always remove http:// and add it again later when decoding)
Related
I have a list of 10 to max 300 string codes (40 word characters capitalized) that need to be stored inside an oauth2 Access Token (Claims based authorization);
I have to keep the token small as much as I can (header size problem) so I'm searching a way to create a small unique identifier representing the original string inside the token.
I would then create a lookup table where I will put the uid and the original string.
When the Token will be sent by the client, through the uid and the lookup table I will get the original string back.
I've read that it is possible to truncate the first bytes of a hash (MD5, SHA1) and I would like to know if I can follow this path safely.
Is it possible to safely (collision wise) create a list of hashes (unique) of these strings where each hash would be 4/5 bytes max?
Edit:
I can't pre-generate a random string as a index (or just a list index for example) because this list could change and increase in size (when the server application is deployed for example and new codes are added to this list) so I have to be sure that when I get the token back from the client, the uid will be bound to the correct code.
Yes, any of those hash algorithms give a uniform hash code where each bit isn't supposed to carry more information than any other. You can just take any 4-5 bytes of it (as long as you take the same bytes from each code) and use as a smaller hash code.
Naturally the collision risk gets higher the shorter the hash code is, but you will still get the lowest possible collision risk for that hash code length.
Edit:
As the question changed; No, you can't create unique identifiers using a hash code. With a long enough hash code you can make collisions rare enough that the hash code can be used as a unique identifer for almost any practical application, but a 32 bit hash code doesn't do that, a 128 bit hash code would do that.
We are trying to convert text "HELLOWORLDTHISISALARGESTRINGCONTENT" into a smaller text. while doing it using MD5 hash we are getting the 16 byte, since it is a one way encryption we are not able to decrypt it. Is there any other way to convert this large string to smaller and revert back the same data? If so please let us know how to do it
Thanks in advance.
Most compression algorithms won't be able to do much with a sequence that short (or may actually make it bigger) - so no: there isn't much you can do to magically shrink it. Your best bet would probably be just generate a guid, and store the full value keyed against the guid (in a database or whatever), and then use the short value as a one-time usage key, to look up the long value (and then erase the record).
It heavily depends on the input data. In general - the worst case - you can't lessen the size of a string through compression if the input data is not long enough and has a high entropy.
Hashing is the wrong approach as a hashing function tries to map a large input data to a short one, but it does not guarantee (by itself) that you can't find a second set of data to map to the same string.
What you can try to do is to imlement a compression algorithm or a lookback table.
Compression can be done by ziplib or any other compression library (just google for it). The lookback approach requires a second place to store the lookup information. For example, when you get the first input string, you map it to the number 1 and save the information 1 maps to {input data} somewhere else. For every subsequent data set you add another mapping entry. If the input data set is finite, this approach may save you space.
I have the following three pieces of information. A group name, a group type, and group ranking.
As a quick example
"Mom's cats", "Cats", "Top10"
The example is way off from what I'm doing with this, but you get the basic idea.
The group name is a large selection of possible values (like around 20k) and the group type and group ranking are smaller amounts (like 10 each)
Trying to find a better way to come up with a short unique identifier for these group of things rather than having to use a sha1 with a huge ugly URL.
Any better ideas?
Open to all language solutions, so just pinning a lot of programmers here since I can't think of a better tag to assign to this.
Thanks.
EDIT: One solution that I found elsewhere a while back stated about taking the last few characters in the SHA-1 and converting them to a decimal value. Not sure how reliable this idea is and the chance of collision.
EDIT2: Using mongoDB and storing this sha1 value in the DB along with the members to make querying easy at the moment. Trying to find an alternative solution to creating an autoincrement field in a seperate table/collection which means a lot more queries when running updating scripts.
For python mappings you could use (grouptype, groupranking, groupname) as a dictionary key or you could reduce the size of the dictionaries by splitting something like a dictionary with a key of grouptype -> groupranking -> groupname.
For generating a unique url what is wrong with grouptype.rank.name or the same with / as a seperator - you could use the valid url type functions to replace invalid chars in each with %nn format.
You could use urllib.quote('/'.join([baseurl, grouptype, groupranking, groupname]) to generate such a path or even baseurl + urllib.urlencode({'grouptype':grouptype,'groupranking':groupranking,'groupname':groupname}) - the latter will result in the typical query format of baseurl?grouptype=Whatever&....
I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.
I have a structure that I am converting to a byte array of length 37, then to a string from that.
I am writing a very basic activation type library, and this string will be passed between people. So I want to shorten it from length 37 to something more manageable to type.
Right now:
Convert the structure to a byte array,
Convert the byte array to a base 64 string (which is still too long).
What is a good way to shorten this string, yet still maintain the data stored in it?
Thanks.
In the general case, going from an arbitrary byte[] to a string requires more data, since we assume we want to avoid non-printable characters. The only way to reduce it is to compress before the base-whatever (you can get a little higher than base-64, but not much - and it certainly isn't any more "friendly") - but compression won't really kick in for such a short size. Basically, you can't do that. You are trying to fit a quart in a pint pot, and that doesn't work.
You may have to rethink your requirements. Perhaps save the BLOB internally, and issue a shorter token (maybe 10 chars, maybe a guid) that is a key to the actual BLOB.
Data compression may be a possiblity to check out, but you can't just compress a 40-byte message to 6 bytes (for example).
If the space of possible strings/types is limited, map them to a list (information coding).
I don't know of anything better than base-64 if you actually have to pass the value around and if users have to type it in.
If you have a central data store they can all access, you could just give them the ID of the row where you saved it. This of course depends on how "secret" this data needs to be.
But I suspect that if you're trying to use this for activation, you need them to have an actual value.
How will the string be passed? Can you expect users to perhaps just copy/paste? Maybe some time spent on clearing up superfluous line breaks that come from an email reader or even your "Copy from here" and "Copy to here" lines might bear more fruit!
Can the characters in your string have non-printable chars? If so, you don't need to base64-encode the bytes, you can simply create the string from them (saved 33%)
string str = new string(byteArray.Cast<char>().ToArray());
Also, are the values in the byte array restricted somehow? If they fall into a certain range (i.e., not all of the 256 possible values), you can consider stuffing two of each in each character of the string.
If you really have 37 bytes of non-redundant information, then you are out of luck. Compression may help in some cases, but if this is an activation key, I would recommend having keys of same length (and compression will not enforce this).
If this code is going to be passed over e-mail, then I see no problem in having an even larger key. Another option might be to insert hyphens every 5-or-so characters, to break it into smaller chunks (e.g. XXXXX-XXXXX-XXXXX-XXXXX-XXXXX).
Use a 160bit hash and hope no collisions? It would be much shorter. If you can use a look-up table, just use a 128 or even 64bit incremental value. Much much shorter than your 37 chars.