How to "sign" a big string to be identified later?

How to "sign" a big string to be identified later? - c#

I put in place a system to make two systems communicate each others.
System A put a message M in the DB, system B process the message and put a message M' in the DB.
System A is "subscribed" to M'.
To "sign" this message (a big xml file) I use the hashcode, so I can understand that message M' = M.
This is working for most messages but for other is not working properly (I opened in notepad the two files M and M' and they are the same).
Probably system B format (without of course changing the content) the message, causing a different hashcode in the way back.
Does it sound reasonable?
How to sign the message in a more robust way?
So far, I'm using C#, .NET3.5 to do this, and I cannot change tecnology.
I'm reading M (and generating hashcode) from fs in this way:
_currentHashCode = File.ReadAllText(file.FullName).GetHashCode();
After all the processing in B, I've been notified by B, that send me M' in an object:
object messageObj;
....
int hash = messageObj.ToString().GetHashCode();
Thanks

GetHashCode() does not return cryptographic hash. You need to use one of the mechanisms in the System.Security.Cryptography space to create a hash in the way you want to use it.
From MSDN:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a
permanent value. For this reason:
Do not serialize hash code values or store them in databases.
Do not use the hash code as the key to retrieve an object from a keyed collection.
Do not send hash codes across application domains or processes. In some cases, hash codes may be computed on a per-process or
per-application domain basis.
Do not use the hash code instead of a value returned by a cryptographic hashing function if you need a cryptographically strong
hash. For cryptographic hashes, use a class derived from the
System.Security.Cryptography.HashAlgorithm or
System.Security.Cryptography.KeyedHashAlgorithm class.
Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To
test for equality, call the ReferenceEquals or Equals method.

Related

In sign and verification data why should have original data to verification?

I was reading this article, there are two byte array, one for signed data and one for the original data.
byte[] originalData = ByteConverter.GetBytes(dataString);
byte[] signedData;
We sign the data, this part is ok but I can not understand to verification why should we use original data?
// Hash and sign the data.
signedData = HashAndSignBytes(originalData, Key);
// Verify the data and display the result to the
// console.
VerifySignedHash(originalData, signedData, Key);
As an example we sign a data in the server and send it to the client, Clients want to find I sent that data or not, why should I send original data until the client can verifying it?
There is some post who did it in the same way:
Signing and verifying signatures with RSA C#
C# Signing and verifying signatures with RSA. Encoding issue

When passing the signedData the other part doesn't know what the originalData is, just by that.
To verify, you need both the signedData and the [ originalData and public-key ].
The VerifySignedHash function in the code mentioned above, calls to RSACryptoServiceProvider.VerifyData.
From the docs:
Verifies that a digital signature is valid by determining the hash value in the signature using the provided public key and comparing it to the hash value of the provided data.

A cryptographic hash function hash(x) has certain desirable properties:
One-way: hash(x) gives you y. Given x, it is easy to compute y. But the reverse, given y, finding x may be very difficult or impossible.
Collisions: Since the size (in bits) of the input is much larger than the hash size (eg: we can compute SHA-256 of Gigabytes of data), multiple inputs can theoretically produce the same hash. Although this is theoretically the case, hash algorithms are designed to keep collisions to a minimum in practical settings.
Unpredictability: A small change in the input causes completely different hashes to be generated. This helps in detecting data tampering (eg: changing a payment from $100.00 to $10000)
These are some of the properties that make hashes suitable for cryptographic signatures and verification of those signatures.
why should we use original data for verification (paraphrased)
Sending the original data allows the recipient to recompute the hash independently and compare the hash signature value sent by the sender to ensure that the data received is the same as what the sender sent.

Hash a string for duplicate detection

I'm writing a C# API which stored SWIFT messages types. I need to write a class that takes the entire string message and create a hash of it, store this hash in the database, so that when a new message is processed, it creates another hash, and checks this hash against ones in the database.
I have the following
public static byte[] GetHash(string inputString)
{
HashAlgorithm algorithm = MD5.Create(); // SHA1.Create()
return algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));
}
and I need to know, if this will do?
Global Comment*
So, I receive the files in a secure network, so we have full control over their validity - What I need to control is duplicate payments being made. I could split the record down into it's respective tag elemenents (SWFIT terminology) and then check them individually, but this then need to compare against records in the database, and the cost isn't something that can happen.
I need to check if the entire message is a duplicate of a message already processed, which is why i used this approach.

It depends on what you want to do. If you are expecting messages to never be intentionally tampered with, even CRC64 will do just fine.
If you want a .NET provided solution that is fast and provides no cryptographic security, MD5 is just fine and will work for what you need.
If you need to determine if a message is different from another, and you expect someone to tamper with the data in transit and it may potentially be modified with bit twiddling techniques to force a hash collision, you should use SHA-256 or SHA-512.
Collisions shouldn't be a problem unless you are hashing billions of messages or someone is tampering with the data in transit. If someone is tampering with the data in transit, you have bigger problems.

You could implement it the way that Dictionary implements it. The Bucket system.
Have a Hash value in the database, and store the raw data.
----------------
| Hash | Value |
----------------
By searching through the hashes first the query will be faster, and if there are multiple hits, as there at some point will be with MD5, you can just iterate through them, and match them more closely to see if they really are the same.
But as Michael J. Gray says, the probability of a collision is very small, on smaller datasets.

Which collection type should I use to store a bunch of hashes?

I have a bunch of long strings which I have to manipulate. They can occur again and again and I want to ignore them if they appear twice. I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
Requirements:
Be able to add items (hashes) to my collection
Be able to (quickly) check whether a particular hash is already in the collection.
Not too memory intensive. I might end up with ~100,000 of these hashes.
I don't need to go backwards (key -> value) if that makes any difference.
Any suggestions on which .NET data type would be most efficient?

I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
No, don't do that. Two reasons:
Hashes only tell you if two values might be the same; they don't tell you if they are the same.
You'd be doing a lot of work which has already been done for you.
Basically, you should just keep a HashSet<String>. That should be fine, have a quick lookup, and you don't need to implement it yourself.
The downside is that you will end up keeping all the strings in memory. If that's a problem then you'll need to work out an alternative strategy... which may indeed end up keeping just the hashes in memory. The exact details will probably depend on where the strings come from, and what sort of problem it would cause if you got a false positive. For example, you could keep an MD5 hash of each string, as a "better than just hashCode" hash - but that would still allow an attacker to present you with another string with the same hash. Is that a problem? If so, a more secure hash algorithm (e.g. SHA-256) might help. It still won't guarantee that you end up with different hashes for different strings though.
If you really want to be sure, you'd need to keep the hashes in memory but persist the actual string data (to disk or a database) - then when you've got a possible match (because you've seen the same hash before) you'd need to compare the stored string with the fresh one.
If you're storing the hashes in memory, the best approach will depend on the size of hash you're using. For example, for just a 64-bit hash you could use a Long per hash and keep it in a HashSet<Long>. For longer hashes, you'd need an object which can easily be compared etc. At that point, I suggest you look at Guava and its HashCode class, along with the factory methods in HashCodes (Deprecated since Guava v16).

Use a set.
ISet<T> interface is implemented by e.g. HashSet<T>
Add and Contains are expected O(1), unless you have a really poor hashing function, then the worst case is O(n).

Chained Hash Table and understanding Deflate

I am currently trying to create a custom Deflate implementation in C#.
I am currently trying to implement the "pattern search" part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.
The RFC 1951 which defines Deflate says about that process:
The compressor uses a chained hash table to find duplicated strings,
using a hash function that operates on 3-byte sequences. At any
given point during compression, let XYZ be the next 3 input bytes to
be examined (not necessarily all different, of course). First, the
compressor examines the hash chain for XYZ. If the chain is empty,
the compressor simply writes out X as a literal byte and advances one
byte in the input. If the hash chain is not empty, indicating that
the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
same hash function value) has occurred recently, the compressor
compares all strings on the XYZ hash chain with the actual input data
sequence starting at the current point, and selects the longest
match.
I do know what a hash function is, and do know what a HashTable is as well. But what is a "chained hash table" and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn't understand how the structure described in the RFC works.
What kind of hash function could I choose (what would make sense)?
Thank you in advance!

A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.
A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key.
In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.
Many hashtable/dictionary implementations store both the key and the data for every item.
It's not necessary to store the key in the table for DEFLATE, but it doesn't hurt anything other than using slightly more memory during compression.
Some hashtable/dictionary implementations such as the C++ STL unordered_map insist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item.
That does hurt -- if you accidentally use the C++ STL unordered_map or similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STL hash_multimap.
Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file.
A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.
As Nick Johnson said, the default hash function used in your standard "hashtable" or "dictionary" implementation is probably more than adequate.
http://en.wikipedia.org/wiki/Hashtable#Separate_chaining

In this case, they're describing a hashtable where each element contains a list of strings - in this case, all the strings starting with the three character prefix specified. You should simply be able to use standard .net hashtable or dictionary primitives - there's no need to replicate their exact implementation details.
32k is not a lot of data, so you don't have to worry about scaling your hashtable - and even if you did, the built-in primitives are likely to be more efficient than anything you could write yourself.

C# getting unique hash from all objects

I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you

"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".

Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.

A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.

As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.