Hash a string for duplicate detection

Hash a string for duplicate detection - c#

I'm writing a C# API which stored SWIFT messages types. I need to write a class that takes the entire string message and create a hash of it, store this hash in the database, so that when a new message is processed, it creates another hash, and checks this hash against ones in the database.
I have the following
public static byte[] GetHash(string inputString)
{
HashAlgorithm algorithm = MD5.Create(); // SHA1.Create()
return algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));
}
and I need to know, if this will do?
Global Comment*
So, I receive the files in a secure network, so we have full control over their validity - What I need to control is duplicate payments being made. I could split the record down into it's respective tag elemenents (SWFIT terminology) and then check them individually, but this then need to compare against records in the database, and the cost isn't something that can happen.
I need to check if the entire message is a duplicate of a message already processed, which is why i used this approach.

It depends on what you want to do. If you are expecting messages to never be intentionally tampered with, even CRC64 will do just fine.
If you want a .NET provided solution that is fast and provides no cryptographic security, MD5 is just fine and will work for what you need.
If you need to determine if a message is different from another, and you expect someone to tamper with the data in transit and it may potentially be modified with bit twiddling techniques to force a hash collision, you should use SHA-256 or SHA-512.
Collisions shouldn't be a problem unless you are hashing billions of messages or someone is tampering with the data in transit. If someone is tampering with the data in transit, you have bigger problems.

You could implement it the way that Dictionary implements it. The Bucket system.
Have a Hash value in the database, and store the raw data.
----------------
| Hash | Value |
----------------
By searching through the hashes first the query will be faster, and if there are multiple hits, as there at some point will be with MD5, you can just iterate through them, and match them more closely to see if they really are the same.
But as Michael J. Gray says, the probability of a collision is very small, on smaller datasets.

Related

How to "sign" a big string to be identified later?

I put in place a system to make two systems communicate each others.
System A put a message M in the DB, system B process the message and put a message M' in the DB.
System A is "subscribed" to M'.
To "sign" this message (a big xml file) I use the hashcode, so I can understand that message M' = M.
This is working for most messages but for other is not working properly (I opened in notepad the two files M and M' and they are the same).
Probably system B format (without of course changing the content) the message, causing a different hashcode in the way back.
Does it sound reasonable?
How to sign the message in a more robust way?
So far, I'm using C#, .NET3.5 to do this, and I cannot change tecnology.
I'm reading M (and generating hashcode) from fs in this way:
_currentHashCode = File.ReadAllText(file.FullName).GetHashCode();
After all the processing in B, I've been notified by B, that send me M' in an object:
object messageObj;
....
int hash = messageObj.ToString().GetHashCode();
Thanks

GetHashCode() does not return cryptographic hash. You need to use one of the mechanisms in the System.Security.Cryptography space to create a hash in the way you want to use it.
From MSDN:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a
permanent value. For this reason:
Do not serialize hash code values or store them in databases.
Do not use the hash code as the key to retrieve an object from a keyed collection.
Do not send hash codes across application domains or processes. In some cases, hash codes may be computed on a per-process or
per-application domain basis.
Do not use the hash code instead of a value returned by a cryptographic hashing function if you need a cryptographically strong
hash. For cryptographic hashes, use a class derived from the
System.Security.Cryptography.HashAlgorithm or
System.Security.Cryptography.KeyedHashAlgorithm class.
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To
test for equality, call the ReferenceEquals or Equals method.

Create a 5 chars unique identifier from a 40 characters string

I have a list of 10 to max 300 string codes (40 word characters capitalized) that need to be stored inside an oauth2 Access Token (Claims based authorization);
I have to keep the token small as much as I can (header size problem) so I'm searching a way to create a small unique identifier representing the original string inside the token.
I would then create a lookup table where I will put the uid and the original string.
When the Token will be sent by the client, through the uid and the lookup table I will get the original string back.
I've read that it is possible to truncate the first bytes of a hash (MD5, SHA1) and I would like to know if I can follow this path safely.
Is it possible to safely (collision wise) create a list of hashes (unique) of these strings where each hash would be 4/5 bytes max?
Edit:
I can't pre-generate a random string as a index (or just a list index for example) because this list could change and increase in size (when the server application is deployed for example and new codes are added to this list) so I have to be sure that when I get the token back from the client, the uid will be bound to the correct code.

Yes, any of those hash algorithms give a uniform hash code where each bit isn't supposed to carry more information than any other. You can just take any 4-5 bytes of it (as long as you take the same bytes from each code) and use as a smaller hash code.
Naturally the collision risk gets higher the shorter the hash code is, but you will still get the lowest possible collision risk for that hash code length.
Edit:
As the question changed; No, you can't create unique identifiers using a hash code. With a long enough hash code you can make collisions rare enough that the hash code can be used as a unique identifer for almost any practical application, but a 32 bit hash code doesn't do that, a 128 bit hash code would do that.

Encrypt larger string to smaller string using C# like reverse md5

We are trying to convert text "HELLOWORLDTHISISALARGESTRINGCONTENT" into a smaller text. while doing it using MD5 hash we are getting the 16 byte, since it is a one way encryption we are not able to decrypt it. Is there any other way to convert this large string to smaller and revert back the same data? If so please let us know how to do it
Thanks in advance.

Most compression algorithms won't be able to do much with a sequence that short (or may actually make it bigger) - so no: there isn't much you can do to magically shrink it. Your best bet would probably be just generate a guid, and store the full value keyed against the guid (in a database or whatever), and then use the short value as a one-time usage key, to look up the long value (and then erase the record).

It heavily depends on the input data. In general - the worst case - you can't lessen the size of a string through compression if the input data is not long enough and has a high entropy.
Hashing is the wrong approach as a hashing function tries to map a large input data to a short one, but it does not guarantee (by itself) that you can't find a second set of data to map to the same string.
What you can try to do is to imlement a compression algorithm or a lookback table.
Compression can be done by ziplib or any other compression library (just google for it). The lookback approach requires a second place to store the lookup information. For example, when you get the first input string, you map it to the number 1 and save the information 1 maps to {input data} somewhere else. For every subsequent data set you add another mapping entry. If the input data set is finite, this approach may save you space.

Which collection type should I use to store a bunch of hashes?

I have a bunch of long strings which I have to manipulate. They can occur again and again and I want to ignore them if they appear twice. I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
Requirements:
Be able to add items (hashes) to my collection
Be able to (quickly) check whether a particular hash is already in the collection.
Not too memory intensive. I might end up with ~100,000 of these hashes.
I don't need to go backwards (key -> value) if that makes any difference.
Any suggestions on which .NET data type would be most efficient?

I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
No, don't do that. Two reasons:
Hashes only tell you if two values might be the same; they don't tell you if they are the same.
You'd be doing a lot of work which has already been done for you.
Basically, you should just keep a HashSet<String>. That should be fine, have a quick lookup, and you don't need to implement it yourself.
The downside is that you will end up keeping all the strings in memory. If that's a problem then you'll need to work out an alternative strategy... which may indeed end up keeping just the hashes in memory. The exact details will probably depend on where the strings come from, and what sort of problem it would cause if you got a false positive. For example, you could keep an MD5 hash of each string, as a "better than just hashCode" hash - but that would still allow an attacker to present you with another string with the same hash. Is that a problem? If so, a more secure hash algorithm (e.g. SHA-256) might help. It still won't guarantee that you end up with different hashes for different strings though.
If you really want to be sure, you'd need to keep the hashes in memory but persist the actual string data (to disk or a database) - then when you've got a possible match (because you've seen the same hash before) you'd need to compare the stored string with the fresh one.
If you're storing the hashes in memory, the best approach will depend on the size of hash you're using. For example, for just a 64-bit hash you could use a Long per hash and keep it in a HashSet<Long>. For longer hashes, you'd need an object which can easily be compared etc. At that point, I suggest you look at Guava and its HashCode class, along with the factory methods in HashCodes (Deprecated since Guava v16).

Use a set.
ISet<T> interface is implemented by e.g. HashSet<T>
Add and Contains are expected O(1), unless you have a really poor hashing function, then the worst case is O(n).

Creating a short unique string for each unique long string

I'm trying to create a url shortener system in c# and asp.net mvc. I know about hashtable and I know how to create a redirect system etc. The problem is indexing long urls in database. Some urls may have up to 4000 character length, and it seems it is a bad idea to index this kind of strings. The question is: How can I create a unique short string for each url? for example MD5 can help me? Is MD5 really unique for each string?
NOTE: I see that Gravatar uses MD5 for emails, so if each email address is unique, then its MD5 hashed value is unique. Is it right? Can I use same solution for urls?

You can use MD5 or SHA1 for such purposes as your described.
Hashes aren't completely unique. As example if you have 4000 bytes array, that's mean that you potentially have 256^4000 combinaton. And MD5 has will have 256^16 combination. So, there is a possibility of collisions. However, for all practical purposes (except cryptography), you don't never to worry about collisions.
If you are interested to real about collission vulnerability of MD5 (related to cryptographical use), you can do it here

A perfect hash function is one that guarantees no collisions. Since your application cannot accomodate hash chains, a perfect hash is the way to go.

The hashing approaches already mentioned will work fine for creating unique short strings that will probably uniquely identify your URL's. However, I'd like to propose an alternate approach.
Create a database table with two columns, ID (an integer) and URL (a string). Create a row in the table for each URL you wish to track. Then, refer to each URL by its ID. Make the ID auto-incrementing, this will ensure uniqueness.
This addresses the problem of how to translate from the shortened version to the longer version: simply join on the table in the database. With hashing, this would become a problem because hashing is one-way. The resulting page identifiers will also be shorter than MD5 hashes, and will only contain digits so they will be easy to include in URL query strings, etc.

I think you could try to make from the url string a byte(each char can be a byte) array and then use encoding (Base64 for example, or you can create one yourself if you want to go that far), Then if you want to decode you just use base 64 decoding and make from the bytes (in the array) again chars. However I am not sure or this will be a long string or not, but I am pretty sure it will be unique.
(PS you should ofc apply some logic first like always remove http:// and add it again later when decoding)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Hash a string for duplicate detection - c#

Related

How to "sign" a big string to be identified later?

Create a 5 chars unique identifier from a 40 characters string

Encrypt larger string to smaller string using C# like reverse md5

Which collection type should I use to store a bunch of hashes?

Creating a short unique string for each unique long string

Categories

Resources