Im building a website which will store millions of images so i need a unique id for each image. What Cryptography is best for storing images. Right now this is what my code looks like im using SHA1.
Is there a standard hash used beside sha1 and is it possible that two images could have the same hash code?
Image img = Image.FromFile("image.jpg");
ImageConverter converter = new ImageConverter();
byte[] byteArray = (byte[])converter.ConvertTo(img, typeof(byte[]));
string hash;
using (SHA1CryptoServiceProvidersha1 = new SHA1CryptoServiceProvider())
{
hash = Convert.ToBase64String(sha1.ComputeHash(byteArray));
}
If I understand correctly you want to assign an SHA1 value as a filename so you can detect whether you have that image in your collection already. I don't think this is the best approach (if you're not running a database then maybe it is) but still, if you're planning to have millions of images then (for practical reasons) just think that it's impossible for collisions to occur.
For this purpose I would not recommend SHA256 since the main two advantages (collision resistance + immunity to some theoretical attacks) are not really worth it because it's something around 10 times slower than SHA1 (and you'll be hashing a lot of fairly big files).
You shouldn't be scared about it's 128 bitlength: In order to have a 50% chance of finding a collision in 128 bits you will need to have 18446744073709600000 images in your collection (sqrt of 2^128).
Oh and I don't wanna sound conceited or anything, but hash and cryptography are too different things. In fact, I'd say that hashing is closer to code signing/digital signatures than to cryptography.
You can use both mechanisms.
Use a GUID as a unique file identifier (file system, database, etc.)
Calculate and store an SHA1 or MD5 hash on your image and use that to check for duplicates.
So when an image is uploaded, you can use the hash to check for a possible duplicate. However, if one is found, then you can do a more deterministic check (ie. check the bytes of the files). Realistically speaking, you will probably never get a hash match without the files being the same, but this second check will determine for sure.
Then, once uniqueness is determined, use the GUID for the file identifier or reuse the existing file.
Can two different images have the same hash code? Unlikely. On the other hand, can two copies of the same image have different hashes? Absolutely.
Take a lossless png, open it, and resave it as uncompressed. The pixels of both images will be identical, but the file hashes will be different.
Aside from the pixels, your images will also contain metadata fields such as geolocation, date/time, camera maker, camera model, ISO speed, focal length, etc.
So your hash will be affected by the type of compression and metadata when using the image file in its entirety.
The main question here is: What makes a picture "unique" to you?
For example, if an image is already uploaded, then I download it and wipe out the camera model or comments and re-upload it, would it be a different image to you, or is still the same as the original? How about the location field?
What if I download a lossless png and save it as a lossless tiff which will have the same pixel data?
Based on your requirements and which fields are important, you'll need to create a hash of the combination of the relevant metadata fields (if any) + the actual uncompressed pixel data of the image instead of making a hash using an image file in its entirety.
Of the standard hash algorithms provided in System.Security.Cryptography you'll probably find MD5 to be best suited to this application. But by all means play around with the different ones and see which one works best for you.
Here's a code sample that gets you a hash for the combination of metadata fields and image pixels:
public class ImageHash
{
public string GetHash(string filePath)
{
using (var image = (Bitmap) Image.FromFile(filePath))
return GetHash(image);
}
public string GetHash(Bitmap bitmap)
{
var formatter = new BinaryFormatter();
using (var memoryStream = new MemoryStream())
{
var metafields = GetMetaFields(bitmap).ToArray();
if(metafields.Any())
formatter.Serialize(memoryStream, metafields);
var pixelBytes = GetPixelBytes(bitmap);
memoryStream.Write(pixelBytes, 0, pixelBytes.Length);
using (var hashAlgorithm = GetHashAlgorithm())
{
memoryStream.Seek(0, SeekOrigin.Begin);
var hash = hashAlgorithm.ComputeHash(memoryStream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}
private static HashAlgorithm GetHashAlgorithm() => MD5.Create();
private static byte[] GetPixelBytes(Bitmap bitmap, PixelFormat pixelFormat = PixelFormat.Format32bppRgb)
{
var lockedBits = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.ReadOnly, pixelFormat);
var bufferSize = lockedBits.Height * lockedBits.Stride;
var buffer = new byte[bufferSize];
Marshal.Copy(lockedBits.Scan0, buffer, 0, bufferSize);
bitmap.UnlockBits(lockedBits);
return buffer;
}
private static IEnumerable<KeyValuePair<string,string>> GetMetaFields(Image image)
{
string manufacturer = System.Text.Encoding.ASCII.GetString(image.PropertyItems[1].Value);
yield return new KeyValuePair<string, string>("manufacturer", manufacturer);
// return any other fields you may be interested in
}
}
And obviously, you'd use this as:
var hash = new ImageHash().GetHash(#"some file path");
Whilst a decent start, this method has areas that can be improved on, such as:
How about the same image after being resized? If that doesn't make it a different picture (as in, if you need tolerance to image resize), then you'll want to resize the input images first to a pre-determined size before hashing.
How about changes in ambient light? Would that make it a different picture? If the answer is no, then you'll need take that into effect too and make the algorithm robust in the face of brightness changes, etc to still result in the same hash regardless of the image brightness having changed.
How about geometric transformations? e.g., if I rotate or mirror an image before re-uploading it, is it still the same image as the original? If so, the algorithm would need to be intelligent enough to produce the same hash after those types of transformations.
How would you like to handle cases where a border is added to an image? There are many such scenarios in the realm of image processing. Some of which have fairly standard solutions, while many others are still being actively worked on.
Performance: this current code may prove time and resource consuming depending on the number & size of images and how much time you can afford to spend on the hashing of each image. If you need it to run faster and/or use up less memory, you may want to downsize your images to a pre-determined size before getting their hash.
Related
I am trying to validate an image submitted to backend in Base64 string format by parsing it into an Image object, extracting it from the same Image object and finally comparing input byte array and output byte array assuming these two should be the same or there was something wrong in the input image. Here is the code:
private void UpdatePhoto(string photoBase64)
{
var imageDataInBytes = Convert.FromBase64String(photoBase64);
ValidateImageContent(imageDataInBytes);
}
private void ValidateImageContent(byte[] imageDataInBytes)
{
using (var inputMem = new MemoryStream(imageDataInBytes))
{
var img = Image.FromStream(inputMem, false, true);
using (MemoryStream outputMemStream = new MemoryStream())
{
img.Save(outputMemStream, img.RawFormat);
var outputSerialized = outputMemStream.ToArray();
if (!outputSerialized.SequenceEqual(imageDataInBytes))
throw new Exception("Invalid image. Identified extra data in the input. Please upload another photo.");
}
}
}
and it fails on an image that I know is a valid one.
Is my assumption wrong that output of Image.Save must be the same as what Image.FromStream is fed with? Is there a way to correct this logic to achieve this way of validation correctly?
If you compare the original image with the created image, you will notice a few differences in the metadata: For my sample image, I could observe that some metadata was stripped (XMP data was completely removed). In addition, while the EXIF data was preserved, the endianness it is written in was reversed from little endian to big endian. This alone explains why the data won’t match.
In my example, the actual image data was identical but you won’t be able to tell easily from just looking at the bytes.
If you wanted to produce a result identical to the source, you would have to produce the metadata in the exact same way as the source did. You won’t be able to do so without actually looking closely at the metadata of the original photo though. .NET’s Image simply isn’t able to pertain all the metadata that a file could contain. And even if you were able to extract all the metadata and store it in the right format again, there are lots of fine nuances between metadata serializers that make it very difficult to produce the exact same result.
So if you wanted to compare the images, you should probably strip the metadata and just compare the image data. But then, when you think about how you save the image (Raw), then you will just get the exact same blob of data again, so I wouldn’t expect differences there.
I have C# program which compare 2 .jpg files,
I was using this function I found in the internet to do that, it’s working well but it’s very slow ( takes more than a second to compare )
public static bool ImageCompareString(Bitmap firstImage, Bitmap secondImage)
{
MemoryStream ms = new MemoryStream();
firstImage.Save(ms, System.Drawing.Imaging.ImageFormat.Png);
String firstBitmap = Convert.ToBase64String(ms.ToArray());
ms.Position = 0;
secondImage.Save(ms, System.Drawing.Imaging.ImageFormat.Png);
String secondBitmap = Convert.ToBase64String(ms.ToArray());
if (firstBitmap.Equals(secondBitmap))
{
return true;
}
else
{
return false;
}
}
Now I was wondering why not use checksum which is faster to do the compare ?
Does the results byte to byte comparison are more accurate ?
The reason I need to compare jpg file:
On my PC I have thousands of jpg files taken from my Camera and Smartphone but many of them are duplicated (Identical pictures with same name exists on different sub folders and some having the same name but are not same pictures)
I want to move all the unique pictures to a new folder and delete those that are duplicate so in case I have 2 pictures that have the same name I need to compare them.
With such a prototype, the function does not compare JPEG image files but decompressed bitmaps, wherever they are coming from. Pixel-by-pixel comparison will be efficient (using LockBits, as advised by Franck). Computing and comparing checksums can be faster, as this is a branchless operation, but be sure to use a fast formula.
If your goal is to compare files, not images, avoid loading the files as bitmaps, as this involves costly decompression and increases the memory footprint. Anyway this will also detect differences in tags/file organization even if the images are the same.
Last but not least, comparing images for similarity is yet a completely different story.
I'm having an image variable which contains a .png picture.
In order to calculate how big it would be on disk I'm currently asving it to a memory cache and then using the length of that "memory file" to calculate if it is within the file sizes I want.
As that seems pretty inefficient to me (a "real" calculation is probably faster and less memory intense) I'm wondering if there is a way to do it in a different way.
Example:
private bool IsImageTooLarge(Image img, long maxSize)
{
using (MemoryStream ms = new MemoryStream())
{
img.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg);
if (ms.ToArray().Length > maxSize)
{
return true;
}
}
return false;
}
Additional infos:
The source code is part of what will be a .dll thus web specific things won't work there as I need to do things with C# itself.
You can save on memory by implementing your own Stream, say, PositionNullStream, which would be similar to NullStream class behind the Stream.Null object, but with the position counter. Your implementation would provide a write-only stream to the Save method of the image, and collect the current position from it when the Save has finished.
private bool IsImageTooLarge(Image img, long maxSize)
{
using (var ps = new PositionNullStream()) {
img.Save(ps, System.Drawing.Imaging.ImageFormat.Png);
return ps.Position > maxSize;
}
}
You can find a sample implementation of NullStream on lines 1445..1454 here. Change the implementation to store the current position when writing and re-positioning methods are called (NullStream is hardcoded to return zero).
No, there is no way. Because you do not need the size of the picture but the size of the file. The only way to get it - with anything involving compression - is to compress it and see what comes out.
Without that you can say X * Y * Bytes Per Pixel - but that is the bitmap, not anything with compression.
Is it possible to store(Encode) image/Pictures into pdf417 barcode? if so is there any tutorial or sample code?
The barcode cannot just hold a reference to an image in a database. The customer also expect to be able to store any image he wants.
Thank you.
As ssasa mentionned you could store the image as a byte array:
public static byte[] GetBytes(Image image)
{
byte[] byteArray = new byte[0];
using (MemoryStream stream = new MemoryStream())
{
// you may want to choose another image format than PNG
image.Save(stream, System.Drawing.Imaging.ImageFormat.Png);
stream.Close();
byteArray = stream.ToArray();
}
return byteArray;
}
... or, if it MUST be a string, you could base64 encode it:
public static string GetBase64(Image image)
{
Image yourImage;
// using the function from the first example
var imageBytes = GetBytes(yourImage);
var encodedString = Convert.ToBase64String(imageBytes);
return Encoding.UTF8.GetBytes(encodedString);
}
Remember, though: a PDF417 barcode allows storing up to 2710 characters. While this is more than enough for most structures you'd ever want to encode, it's rather limitating for an image. It may be enough for small-sized bitmaps, monochrome images and/or highly compressed JPEGs, but don't expect being able to do much more than that, especially if you want to be able to store other data along.
If your customers expects to be able to store, as you say, any picture they want, you'd better be lowering their expectations as soon as possible before writing any code.
If it's an option, you may want to consider using QR Codes instead. Not that you'll work miracles with those either but you may like the added storage capacity.
Yes, Department of Defense Common Access Cards (CAC) store JPEG image of the cardholder:
How can I (as fast as possible) determine if two bitmaps are the same, by value, and not by reference? Is there any fast way of doing it?
What if the comparison doesn't need to be very precise?
you can check the dimensions first - and abort the comparison if they differ.
For the comparison itself you can use a variaty of ways:
CRC32
very fast but possibly wrong... can be used as a first check, if it differs they are dfferent... otherwise further checking needed
MD5 / SHA1 / SHA512
not so fast but rather precise
XOR
XOR the image content... abort when the first difference comes up...
You can just use a simple hash like MD5 to determine if their contents hash to the same value.
You will need a very precise definition of "not very precise".
All the Checksum or Hash methods already posted work for an exact (pixel and bit) match only.
If you want an answer that corresponds to "they look (somewhat) alike" you will need something more complicated.
some preprocessing based on their aspect ratio. Can a 600x400 picture be like a 300x300 one?
use a graphics algorithm to scale them down to, say, 100x100.
Also reduce the colors.
Then compare the results pixel by pixel (and set an error treshold).
Try comparing the hashs of the two files
using System;
using System.IO;
using System.Security.Cryptography;
class FileComparer
{
static void Compare()
{
// Create the hashing object.
using (HashAlgorithm hashAlg = HashAlgorithm.Create())
{
using (FileStream fsA = new FileStream("c:\\test.txt", FileMode.Open),
fsB = new FileStream("c:\\test1.txt", FileMode.Open)){
// Calculate the hash for the files.
byte[] hashBytesA = hashAlg.ComputeHash(fsA);
byte[] hashBytesB = hashAlg.ComputeHash(fsB);
// Compare the hashes.
if (BitConverter.ToString(hashBytesA) == BitConverter.ToString(hashBytesB))
{
Console.WriteLine("Files match.");
} else {
Console.WriteLine("No match.");
}
}
}
}
}