How can I (as fast as possible) determine if two bitmaps are the same, by value, and not by reference? Is there any fast way of doing it?
What if the comparison doesn't need to be very precise?
you can check the dimensions first - and abort the comparison if they differ.
For the comparison itself you can use a variaty of ways:
CRC32
very fast but possibly wrong... can be used as a first check, if it differs they are dfferent... otherwise further checking needed
MD5 / SHA1 / SHA512
not so fast but rather precise
XOR
XOR the image content... abort when the first difference comes up...
You can just use a simple hash like MD5 to determine if their contents hash to the same value.
You will need a very precise definition of "not very precise".
All the Checksum or Hash methods already posted work for an exact (pixel and bit) match only.
If you want an answer that corresponds to "they look (somewhat) alike" you will need something more complicated.
some preprocessing based on their aspect ratio. Can a 600x400 picture be like a 300x300 one?
use a graphics algorithm to scale them down to, say, 100x100.
Also reduce the colors.
Then compare the results pixel by pixel (and set an error treshold).
Try comparing the hashs of the two files
using System;
using System.IO;
using System.Security.Cryptography;
class FileComparer
{
static void Compare()
{
// Create the hashing object.
using (HashAlgorithm hashAlg = HashAlgorithm.Create())
{
using (FileStream fsA = new FileStream("c:\\test.txt", FileMode.Open),
fsB = new FileStream("c:\\test1.txt", FileMode.Open)){
// Calculate the hash for the files.
byte[] hashBytesA = hashAlg.ComputeHash(fsA);
byte[] hashBytesB = hashAlg.ComputeHash(fsB);
// Compare the hashes.
if (BitConverter.ToString(hashBytesA) == BitConverter.ToString(hashBytesB))
{
Console.WriteLine("Files match.");
} else {
Console.WriteLine("No match.");
}
}
}
}
}
Related
I have a bunch of IDs, that are in the String form, like "enemy1", "enemy2".
I want to save a progress, depends on how many of each enemies I killed. For that goal I use a dictionary like { { "enemy1", 0 }, { "enemy2", 1 } }.
Then I want to share player's save between few machines he can play into (like PC and laptop) via network (serialize it in JSON file first). For size decreasing and perfomance inreasing, i use hashes instead full string, using that alg (becouse MDSN said, that default hash alg can be different on different machines):
int hash_ = 0;
public override int GetHashCode()
{
if(hash_ == 0)
{
hash_ = 5381;
foreach(var ch in id_)
hash_ = ((hash_ << 5) + hash_) ^ ch;
}
return hash_;
}
So, the question is: is that alg in C# will return the same results in any machine player will use.
UPD: in comments i note that the main part of question was unclear.
So. If i can guarantee that all files before deserialization will be in the same encoding, is char representation on every machine that player can use will be the same and operation ^ ch will give same result? I mean WinX64/WinX32/Mac/Linux/... machines
Yes, that code will give the same result on every platform, for the same input. A char is a UTF-16 code unit, regardless of platform, and any given char will convert to the same int value on every platform. As normal with hash codes computed like this, you shouldn't assume that equal hash codes implies equal original values. (It's unclear how you're intending to use the hash, to be honest.)
I would point out that your code isn't thread-safe though - if two threads call GetHashCode at basically the same time, one may see a value of 0 (and therefore start hashing) whereas the second may see an interim result (as computed by the first thread) and assume that's the final hash. If you really believe caching is important here (and I'd test that first) you should compute the complete hash using a local variable, then copy it to the field only when you're done.
Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256.
In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.
public static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Then I read this topic and somehow change my code according what they said to :
public static string GetChecksumBuffered(Stream stream)
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
But It doesn't have such a affection and takes about 9 mins.
Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !
Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.
My Questions are :
What causes such different between the above code and Linux sha256sum in time ?
What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)
Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?
How can I make my implementation as fast as sha256sum in C#?
public string SHA256CheckSum(string filePath)
{
using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = File.OpenRead(filePath))
return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
}
}
My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.
Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.
The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.
A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.
No idea, beyond those mentioned above. You're doing it right.
For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?
Edit in response to question in comment
The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.
The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.
As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.
Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?
Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:
SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code.
HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. As a result, you're not really going to see much difference using a BufferedStream for this call.
HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream.
Your best bet would be to use a third party implementation that's optimized for your use case.
using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = System.IO.File.OpenRead(filePath))
{
string result = "";
foreach (var hash in SHA256.ComputeHash(fileStream))
{
result += hash.ToString("x2");
}
return result;
}
}
For Reference: https://www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/
Im building a website which will store millions of images so i need a unique id for each image. What Cryptography is best for storing images. Right now this is what my code looks like im using SHA1.
Is there a standard hash used beside sha1 and is it possible that two images could have the same hash code?
Image img = Image.FromFile("image.jpg");
ImageConverter converter = new ImageConverter();
byte[] byteArray = (byte[])converter.ConvertTo(img, typeof(byte[]));
string hash;
using (SHA1CryptoServiceProvidersha1 = new SHA1CryptoServiceProvider())
{
hash = Convert.ToBase64String(sha1.ComputeHash(byteArray));
}
If I understand correctly you want to assign an SHA1 value as a filename so you can detect whether you have that image in your collection already. I don't think this is the best approach (if you're not running a database then maybe it is) but still, if you're planning to have millions of images then (for practical reasons) just think that it's impossible for collisions to occur.
For this purpose I would not recommend SHA256 since the main two advantages (collision resistance + immunity to some theoretical attacks) are not really worth it because it's something around 10 times slower than SHA1 (and you'll be hashing a lot of fairly big files).
You shouldn't be scared about it's 128 bitlength: In order to have a 50% chance of finding a collision in 128 bits you will need to have 18446744073709600000 images in your collection (sqrt of 2^128).
Oh and I don't wanna sound conceited or anything, but hash and cryptography are too different things. In fact, I'd say that hashing is closer to code signing/digital signatures than to cryptography.
You can use both mechanisms.
Use a GUID as a unique file identifier (file system, database, etc.)
Calculate and store an SHA1 or MD5 hash on your image and use that to check for duplicates.
So when an image is uploaded, you can use the hash to check for a possible duplicate. However, if one is found, then you can do a more deterministic check (ie. check the bytes of the files). Realistically speaking, you will probably never get a hash match without the files being the same, but this second check will determine for sure.
Then, once uniqueness is determined, use the GUID for the file identifier or reuse the existing file.
Can two different images have the same hash code? Unlikely. On the other hand, can two copies of the same image have different hashes? Absolutely.
Take a lossless png, open it, and resave it as uncompressed. The pixels of both images will be identical, but the file hashes will be different.
Aside from the pixels, your images will also contain metadata fields such as geolocation, date/time, camera maker, camera model, ISO speed, focal length, etc.
So your hash will be affected by the type of compression and metadata when using the image file in its entirety.
The main question here is: What makes a picture "unique" to you?
For example, if an image is already uploaded, then I download it and wipe out the camera model or comments and re-upload it, would it be a different image to you, or is still the same as the original? How about the location field?
What if I download a lossless png and save it as a lossless tiff which will have the same pixel data?
Based on your requirements and which fields are important, you'll need to create a hash of the combination of the relevant metadata fields (if any) + the actual uncompressed pixel data of the image instead of making a hash using an image file in its entirety.
Of the standard hash algorithms provided in System.Security.Cryptography you'll probably find MD5 to be best suited to this application. But by all means play around with the different ones and see which one works best for you.
Here's a code sample that gets you a hash for the combination of metadata fields and image pixels:
public class ImageHash
{
public string GetHash(string filePath)
{
using (var image = (Bitmap) Image.FromFile(filePath))
return GetHash(image);
}
public string GetHash(Bitmap bitmap)
{
var formatter = new BinaryFormatter();
using (var memoryStream = new MemoryStream())
{
var metafields = GetMetaFields(bitmap).ToArray();
if(metafields.Any())
formatter.Serialize(memoryStream, metafields);
var pixelBytes = GetPixelBytes(bitmap);
memoryStream.Write(pixelBytes, 0, pixelBytes.Length);
using (var hashAlgorithm = GetHashAlgorithm())
{
memoryStream.Seek(0, SeekOrigin.Begin);
var hash = hashAlgorithm.ComputeHash(memoryStream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}
private static HashAlgorithm GetHashAlgorithm() => MD5.Create();
private static byte[] GetPixelBytes(Bitmap bitmap, PixelFormat pixelFormat = PixelFormat.Format32bppRgb)
{
var lockedBits = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.ReadOnly, pixelFormat);
var bufferSize = lockedBits.Height * lockedBits.Stride;
var buffer = new byte[bufferSize];
Marshal.Copy(lockedBits.Scan0, buffer, 0, bufferSize);
bitmap.UnlockBits(lockedBits);
return buffer;
}
private static IEnumerable<KeyValuePair<string,string>> GetMetaFields(Image image)
{
string manufacturer = System.Text.Encoding.ASCII.GetString(image.PropertyItems[1].Value);
yield return new KeyValuePair<string, string>("manufacturer", manufacturer);
// return any other fields you may be interested in
}
}
And obviously, you'd use this as:
var hash = new ImageHash().GetHash(#"some file path");
Whilst a decent start, this method has areas that can be improved on, such as:
How about the same image after being resized? If that doesn't make it a different picture (as in, if you need tolerance to image resize), then you'll want to resize the input images first to a pre-determined size before hashing.
How about changes in ambient light? Would that make it a different picture? If the answer is no, then you'll need take that into effect too and make the algorithm robust in the face of brightness changes, etc to still result in the same hash regardless of the image brightness having changed.
How about geometric transformations? e.g., if I rotate or mirror an image before re-uploading it, is it still the same image as the original? If so, the algorithm would need to be intelligent enough to produce the same hash after those types of transformations.
How would you like to handle cases where a border is added to an image? There are many such scenarios in the realm of image processing. Some of which have fairly standard solutions, while many others are still being actively worked on.
Performance: this current code may prove time and resource consuming depending on the number & size of images and how much time you can afford to spend on the hashing of each image. If you need it to run faster and/or use up less memory, you may want to downsize your images to a pre-determined size before getting their hash.
I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.
At the moment I do this:
public static class Crypto
{
public static string Encode(string original)
{
var md5 = new MD5CryptoServiceProvider();
var originalBytes = Encoding.Default.GetBytes(original);
var encodedBytes = md5.ComputeHash(originalBytes);
return BitConverter.ToString(encodedBytes);
}
}
I hear that I should use some key to encode stuff. Should I? Is it needed here? How to do this?
I ended up doing this http://encrypto.codeplex.com/ (sha1managed + random salt)
what you're talking about is called a "salt" this is a sequence of random data which you append to the original plain text string. This is commonly used for passwords, preventing rainbow table / dictionary attacks.
Read up on this on: http://en.wikipedia.org/wiki/Salt_%28cryptography%29
For C# there's a good article: http://www.aspheute.com/english/20040105.asp
You should use Base64 encoding for representation. I.e.:
StringBuilder hash = new StringBuilder();
for (int i = 0; i < encodedBytes.Length; i++)
{
hash.Append(encodedBytes[i].ToString("X2"));
}
This represents a string, rather than using a bit converter, which is the string representation of bytes directly (and cannot be reversed back to bits easily).
A couple of notes (please read this):
MD5 is a non-reversible hashing function (and not a great one at that)
If you are actually wanting to encrypt passwords using key-based encryption, such as AES, don't. Use the hashing method, but use a stronger one. Have a look at this answer here for more information on stengthening passwords.
Another note, in your implementation you can access the IDisposable interface, I.e.:
public static string Encode(string original)
{
byte[] encodedBytes;
using (var md5 = new MD5CryptoServiceProvider())
{
var originalBytes = Encoding.Default.GetBytes(original);
encodedBytes = md5.ComputeHash(originalBytes);
}
return Convert.ToBase64String(encodedBytes);
}
Since SHA is considered more safe than MD5 I would recommend using it.
byte[] data = new byte[SALT_SIZE+DATA_SIZE];
byte[] result;
SHA256 shaM = new SHA256Managed();
result = shaM.ComputeHash(salt+data);
MD5 isn't an encryption algorithm, but a hashing algorithm and as such doesn't require a key. It also means that you can't reverse (/de-hash) the process. Hashing only works one way and is usefull when you have to original (unhashed) data to compare the hash to.
edit: if you do want to really encrypt your data in a reversable way. Try to look into AES encryption for example.
I posted this link in a comment to sled's answer, but worth its own post.
http://codahale.com/how-to-safely-store-a-password/
It sounds like you've been given advice to salt your password hashes (I think you're calling the salt a "key"). This is better than just a hash, since it makes rainbow tables useless. A rainbow table takes a wide range of possible passwords (eg, a range of passwords, like a rainbow has a range of colours) and calculates their md5 hashes up-front. Then, to reverse an md5, simply look the md5 up in the table.
However, the advice is rapidly getting out of date. Hardware is now fast enough that rainbow tables are unnecessary: you can compute hashes really quickly, it's fast enough to just brute force the password from scratch every time, especially if you know the salt. So, the solution is to use a more computationally expensive hash, which will make a brute force much much slower.
The gold-standard tool to do this is bcrypt.
Here are a couple points that expand on the ones in #Kyle Rozendo's answer.
You should avoid using the default encoding Encoding.Default. Unless you have a good reason to do otherwise, always use UTF-8 en/de-coding. This is easy to do with the .NET System.Text namespace.
MD5 output is unconstrained binary data and cannot be reliably converted directly to a string. You must use a special encoding designed to convert from binary output to valid strings and back. #Kyle Rozendo shows one method; you can also use the method in the System.Convert class.
Class example:
using System.Security.Cryptography;
using System.Text;
private static string MD5(string Metin)
{
MD5CryptoServiceProvider MD5Code = new MD5CryptoServiceProvider();
byte[] byteDizisi = Encoding.UTF8.GetBytes(Metin);
byteDizisi = MD5Code.ComputeHash(byteDizisi);
StringBuilder sb = new StringBuilder();
foreach (byte ba in byteDizisi)
{
sb.Append(ba.ToString("x2").ToLower());
}
return sb.ToString();
}
MessageBox.Show(MD5("isa")); // 165a1761634db1e9bd304ea6f3ffcf2b