Convert a list of strings into one unique sha512 - c#

I want to know if there is a way to convert fast a whole list of string into a one unique sha512 hash string.
For now I use this method for get a unique sha512 hash, but this way become slower and slower when the list have more and more string.
string hashDataList = string.Empty;
for (int i = 0; i < ListOfElement.Count; i++)
{
if (i < ListOfElement.Count)
{
hashDataList += ListOfElement[i];
}
}
hashDataList = MakeHash(HashDataList);
Console.WriteLine("Hash: "+hashDataList);
Edit:
Method for make the hash:
public static string MakeHash(string str)
{
using (var hash = SHA512.Create())
{
var bytes = Encoding.UTF8.GetBytes(str);
var hashedInputBytes = hash.ComputeHash(bytes);
var hashedInputStringBuilder = new StringBuilder(128);
foreach (var b in hashedInputBytes)
hashedInputStringBuilder.Append(b.ToString("X2"));
str = hashedInputStringBuilder.ToString();
hashedInputStringBuilder.Clear();
GC.SuppressFinalize(bytes);
GC.SuppressFinalize(hashedInputBytes);
GC.SuppressFinalize(hashedInputStringBuilder);
return str;
}
}

Try this, using built-in SHA512:
StringBuilder sb = new StringBuilder();
foreach(string s in ListOfElement)
{
sb.Append(s);
}
hashDataList = BitConverter.ToString (new System.Security.Cryptography.SHA512CryptoServiceProvider()
.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString()))).Replace("-", String.Empty).ToUpper();
Console.WriteLine("Hash: "+hashDataList);
Performance depends a lot on MakeHash() implementation as well.

I think the problem might be a bit misstated here. First from a performance standpoint:
Any method of hashing a list of strings will take longer as the number (and length) of the strings increases. The only way to avoid this would be to ignore some of the data in (at least some of) the strings, and then you lose the assurances that a hash should give you.
So you can try to make the whole thing faster, so that you can process more (and/or longer) strings in an acceptable time frame. Without knowing the performance characteristics of the hashing function, we can't say if that's possible; but as farbiondriven's answer suggests, about the only plausible strategy is to assemble a single string and hash that once.
The potential objection to this, I suppose, would be: does it affect the uniqueness of the hash. There are two factors to consider:
First, if you just concatenate all the strings together, then you would get the same output hash for
["element one and ", "element two"]
as for
["element one ", "and element two"]
because the concatenated data is the same. One way to correct this is to insert each string's length before the string (with a delimiter to show the end of the length). For example you could build
"16:element one and 11:element two"
for the first array above, and
"12:element one 15:and element two"
for the second.
The other possible concern (though it isn't really valid) could arise if the individual strings are never longer than a single SHA512 hash, but the total amount of data in the array is. In that case, your method (hashing each string and concatenating them) might seem safer, because whenever you has data that's longer than the actual hash, it's mathematically possible for a hash collision to occur. But as I say, this concern is not valid for at least one, and possibly two reasons.
The biggest reason is: hash collisions in a 512-bit hash are ridiculously unlikely. Even though the math says it could happen, it is beyond safe to assume that it literally never will. If you're going to worry about a hash collision at that level, you might as well also worry about your data being spontaneously corrupted due to RAM errors that occur in just such a pattern as to avoid detection. At that level of improbability, you simply can't program around a vast number of catastrophic things that "could" (but won't) happen, and you really might as well count hash collisions among them.
The second reason is: if you're paranoid enough not to buy the first reason, then how can you be sure that hashing shorter strings guarantees uniqueness?
What concatenating a hash per string does do if the individual strings are less than 512 bits, is it means that the hash ends up being longer than the source data - which defeats the typical purposes of a hash. If that's acceptable, then you probably want an encryption algorithm instead of a hash.

Related

Generate integer based on any given string (without GetHashCode)

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.
I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string
It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.
The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.
The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.
Any insight at all would be greatly appreciated.
MD5 hashing returns a byte array which could be converted to an integer:
var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);
Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.
If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.
If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.
As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.
Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.
That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.
This code map any string to int between 0-100
int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}
BigInteger requires Org.BouncyCastle.Math

Datastructure choices for highspeed and memory efficient detection of duplicate of strings

I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.
This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.
The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.
It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.
The naive implementation would be simply (in C#):
Dictionary<String,DateTime>
though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.
So, given these constraints, what would you use?
EDIT, some additional information that might change proposed implementations:
99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.
I don't belive it is possible to construct "perfect hash" without knowing complete set of values first (especially in case of C# int with limited number of values). So any kind of hashing requires ability to compare original values too.
I think dictionary is the best you can get with out of box data structures. Since you can store objects with custom comparisons defined you can easily avoid keeping strings in memeory and simply save location where whole string can be obtained. I.e. object with following values:
stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;
Where cutomom comparer will return saved hashCode or retrive string from file if needed and perform comparison.
a unique hash of the string would be fine, providing there is no false
positives due to collisions
That's not possible, if you want the hash code to be shorter than the strings.
Using hash codes implies that there are false positives, only that they are rare enough not to be a performance problem.
I would even consider to create the hash code from only part of the string, to make it faster. Even if that means that you get more false positives, it could increase the overall performance.
Provided the memory footprint is tolerable, I would suggest a Hashset<string> for the strings, and a queue to store a Tuple<DateTime, String>. Something like:
Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();
Now, when a string comes in:
if (Strings.Add(s))
{
// string is new. process it.
// and add it to the expiration queue
Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}
And, somewhere you'll have to check for the expirations. Perhaps every time you get a new string, you do this:
while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
var e = Expirations.Dequeue();
Strings.Remove(e.Item2);
}
It'd be hard to beat the performance of Hashset here. Granted, you're storing the strings, but that's going to be the only way to guarantee no false positives.
You might also consider using a time stamp other than DateTime.Now. What I typically do is start a Stopwatch when the program starts, and then use the ElapsedMilliseconds value. That avoids potential problems that occur during Daylight Saving Time changes, when the system automatically updates the clock (using NTP), or when the user changes the date/time.
Whether the above solution works for you is going to depend on whether you can stand the memory hit of storing the strings.
Added after "Additional information" was posted:
If this will be accessed by multiple threads, I'd suggest using ConcurrentDictionary rather than Hashset, and BlockingCollection rather than Queue. Or, you could use lock to synchronize access to the non-concurrent data structures.
If it's true that 99% of the strings will not be duplicate, then you'll almost certainly need an expiration queue that can remove things from the dictionary.
If memory footprint of storing whole strings is not acceptable, you have only two choices:
1) Store only hashes of strings, which implies possibility of hash collisions (when hash is shorter than strings). Good hash function (MD5, SHA1, etc.) makes this collision nearly impossible to happen, so it only depends whether it is fast enough for your purpose.
2) Use some kind of lossless compression. Strings have usually good compression ratio (about 10%) and some algorithms such as ZIP let you choose between fast (and less efficient) and slow (with high compression ratio) compression. Another way to compress strings is convert them to UTF8, which is fast and easy to do and has nearly 50% compression ratio for non-unicode strings.
Whatever way you choose, it's always tradeoff between memory footprint and hashing/compression speed. You will probably need to make some benchmarking to choose best solution.

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

How to hash a URL quickly

I have a unique situation where I need to produce hashes on the fly. Here is my situation. This question is related to here. I need to store a many urls in the database which need to be indexed. A URL can be over 2000 characters long. The database complains that a string over 900 bytes cannot be indexed. My solution is to hash the URL using MD5 or SHA256. I am not sure which hashing algorithm to use. Here are my requirements
Shortest character length with minimal collision
Needs to be very fast. I will be hashing the referurl on every page request
Collisions need to be minimized since I may have millions of urls in the database
I am not worried about security. I am worried about character length, speed, and collisions. Anyone know of a good algorithm for this?
In your case, I wouldn't use any of the cryptographic hash functions (i.e. MD5, SHA), since they were designed with security in mind: They mainly want to make it as hard as possible to finde two different strings with the same hash. I think this wouldn't be a problem in your case. (the possibility of random collisions is inherent to hashing, of course)
I'd strongly not suggest to use String.GetHashCode(), since the implementation is not known and MSDN says that it might vary between different versions of the framework. Even the results between x86 and x64 versions may be different. So you'll get into troubles when trying to access the same database using a newer (or different) version of the .NET framework.
I found the algorithm for the Java implementation of hashCode on Wikipedia (here), it seems quite easy to implement. Even a straightforward implementation would be faster than an implementation of MD5 or SHA imo. You could also use long values which reduces the probability of collisions.
There is also a short analysis of the .NET GetHashCode implementation here (not the algorithm itself but some implementation details), you could also use this one I guess. (or try to implement the Java version in a similar way ...)
a quick one :
URLString.GetHashCode().ToString("x")
While both MD5 and SHA1 have been proved ineffective where collision prevention is essential I suspect for your application either would be sufficient. I don't know for sure but I suspect that MD5 would be the simpler and quicker of the two algorithms.
Use the System.Security.Cryptography.SHA1Cng class, I would suggest. It's 160 bits or 20 bytes long, so that should definitely be small enough. If you need it to be a string, it will only require 40 characters, so that should suit your needs well. It should also be fast enough, and as far as I know, no collisions have yet been found.
I'd personally use String.GetHashCode(). This is the basic hash function. I honestly have no idea how it performs compared to other implementations but it should be fine.
Either of the two hashing functions that you name should be quick enough that you won't notice much difference between them. Unless this site requires ultra-high performance I would not worry too much about them. I'd personally probably go for MD5. This can be formatted as a string as hexdecimal in 64 characters or as a base 64 string in 44 characters.
The reason I'd go for MD5 is because you are very unlikely to run into collisions and even if you do you can structure your queries with "where urlhash = #hash and url = #url". The database engine should work out that one is indexed and the other isn't and use that information to do a sensible search.
If there are colisions the indexed scan on urlhash will return a handful of results which will be easy to do text comparisons on to get the right one. This is unlikely to be relevant very often though. You've pretty low chances of getting collisions this way.
Reflected source code of GetHashCode function in .net 4.0
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}
There was O(n) simple operations(+, <<, ^) and one multiplication. So this is very fast.
I've tested this function on 3 mln DB contains strings lengths up to 256 characters and about 97% of strings has no collision. (Maximum 5 strings have the same hash)
You may want to look at the following project:
CMPH - C Minimal Perfect Hashing Library
And check out the following hot topics listing for perfect hashes:
Hottest 'perfect-hash' Answers - Stack Overflow
You could also consider using a full text index in SQL rather than hashing:
CREATE FULLTEXT INDEX (Transact-SQL)

Generate a short code based on a unique string in C#

I'm just about to launch the beta of a new online service. Beta subscribers will be sent a unique "access code" that allows them to register for the service.
Rather than storing a list of access codes, I thought I would just generate a code based on their email, since this itself is unique.
My initial thought was to combine the email with a unique string and then Base64 encode it. However, I was looking for codes that are a bit shorter, say 5 digits long.
If the access code itself needs to be unique, it will be difficult to ensure against collisions. If you can tolerate a case where two users might, by coincidence, share the same access code, it becomes significantly easier.
Taking the base-64 encoding of the e-mail address concatenated with a known string, as proposed, could introduce a security vulnerability. If you used the base64 output of the e-mail address concatenated with a known word, the user could just unencode the access code and derive the algorithm used to generate the code.
One option is to take the SHA-1-HMAC hash (System.Cryptography.HMACSHA1) of the e-mail address with a known secret key. The output of the hash is a 20-byte sequence. You could then truncate the hash deterministically. For instance, in the following, GetCodeForEmail("test#example.org") gives a code of 'PE2WEG' :
// define characters allowed in passcode. set length so divisible into 256
static char[] ValidChars = {'2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H',
'J','K','L','M','N','P','Q',
'R','S','T','U','V','W','X','Y','Z'}; // len=32
const string hashkey = "password"; //key for HMAC function -- change!
const int codelength = 6; // lenth of passcode
string GetCodeForEmail(string address)
{
byte[] hash;
using (HMACSHA1 sha1 = new HMACSHA1(ASCIIEncoding.ASCII.GetBytes(hashkey)))
hash = sha1.ComputeHash(UTF8Encoding.UTF8.GetBytes(address));
int startpos = hash[hash.Length -1] % (hash.Length - codelength);
StringBuilder passbuilder = new StringBuilder();
for (int i = startpos; i < startpos + codelength; i++)
passbuilder.Append(ValidChars[hash[i] % ValidChars.Length]);
return passbuilder.ToString();
}
You may create a special hash from their email, which is less than 6 chars, but it wouldn't really make that "unique", there will always be collisions in such a small space. I'd rather go with a longer key, or storing pre-generated codes in a table anyway.
So, it sounds like what you want to do here is to create a hash function specifically for emails as #can poyragzoglu pointed out. A very simple one might look something like this:
(pseudo code)
foreach char c in email:
running total += [large prime] * [unicode value]
then do running total % large 5 digit number
As he pointed out though, this will not be unique unless you had an excellent hash function. You're likely to have collisions. Not sure if that matters.
What seems easier to me, is if you already know the valid emails, just check the user's email against your list of valid ones upon registration? Why bother with a code at all?
If you really want a unique identifier though, the easiest way to do this is probably to just use what's called a GUID. C# natively supports this. You could store this in your Users table. Though, it would be far too long for a user to ever remember/type out, it would almost certainly be unique for each one if that's what you're trying to do.

Categories