I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?
If it is, some code samples would be appreciated, because I don't have much experience with cryptography.
It's very simple using System.Security.Cryptography.MD5:
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return md5.ComputeHash(stream);
}
}
(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)
How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)
If you need to represent the hash as a string, you could convert it to hex using BitConverter:
static string CalculateMD5(string filename)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
var hash = md5.ComputeHash(stream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}
This is how I do it:
using System.IO;
using System.Security.Cryptography;
public string checkMD5(string filename)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return Encoding.Default.GetString(md5.ComputeHash(stream));
}
}
}
I know this question was already answered, but this is what I use:
using (FileStream fStream = File.OpenRead(filename)) {
return GetHash<MD5>(fStream)
}
Where GetHash:
public static String GetHash<T>(Stream stream) where T : HashAlgorithm {
StringBuilder sb = new StringBuilder();
MethodInfo create = typeof(T).GetMethod("Create", new Type[] {});
using (T crypt = (T) create.Invoke(null, null)) {
byte[] hashBytes = crypt.ComputeHash(stream);
foreach (byte bt in hashBytes) {
sb.Append(bt.ToString("x2"));
}
}
return sb.ToString();
}
Probably not the best way, but it can be handy.
Here is a slightly simpler version that I found. It reads the entire file in one go and only requires a single using directive.
byte[] ComputeHash(string filePath)
{
using (var md5 = MD5.Create())
{
return md5.ComputeHash(File.ReadAllBytes(filePath));
}
}
I know that I am late to party but performed test before actually implement the solution.
I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.
DateTime current = DateTime.Now;
string file = #"C:\text.iso";//It's 2.5 Gb file
string output;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file))
{
byte[] checksum = md5.ComputeHash(stream);
output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
}
}
For dynamically-generated PDFs.
The creation date and modified dates will always be different.
You have to remove them or set them to a constant value.
Then generate md5 hash to compare hashes.
You can use PDFStamper to remove or update dates.
In addition to the methods answered above if you're comparing PDFs you need to amend the creation and modified dates or the hashes won't match.
For PDFs generated with QuestPdf youll need to override the CreationDate and ModifiedDate in the Document Metadata.
public class PdfDocument : IDocument
{
...
DocumentMetadata GetMetadata()
{
return new()
{
CreationDate = DateTime.MinValue,
ModifiedDate = DateTime.MinValue,
};
}
...
}
https://www.questpdf.com/concepts/document-metadata.html
Related
I'm using a could service to upload files to an Azure Storage service, so I want to check the file's integrity using MD5 checksum, so first I get the checksum from a function.
public static string GetMD5HashFromFile(Stream stream)
{
using (var md5 = MD5.Create())
{
return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", string.Empty);
}
}
for the test file I'm using I'm getting: 1dffc245282f4e0a45a9584fe90f12f2 and I got the same result when I use an online tool like this.
Then I upload the file to Azure and get it from my code like this: (In order to avoid include the validations let's assume the file and directories do exist.)
public bool CompareCheckSum(string fileName, string checksum)
{
this.storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("MyConnectionString"));
this.fileClient = this.storageAccount.CreateCloudFileClient();
this.shareReference = this.fileClient.GetShareReference(CloudStorageFileShareSettings.StorageFileShareName);
this.rootDir = this.shareReference.GetRootDirectoryReference();
this.directoryReference = this.rootDir.GetDirectoryReference("MyDirectory");
this.fileReference = this.directoryReference.GetFileReference(fileName);
Stream stream = new MemoryStream();
this.fileReference.DownloadToStream(stream);
string azureFileCheckSum = GetMD5HashFromFile(stream);
return azureFileCheckSum.ToLower() == checksum.ToLower();
}
I also tried to get the checksum using a different process like this:
public bool CompareCheckSum(string fileName, string checksum)
{
this.storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("MyConnectionString"));
this.fileClient = this.storageAccount.CreateCloudFileClient();
this.shareReference = this.fileClient.GetShareReference(CloudStorageFileShareSettings.StorageFileShareName);
this.rootDir = this.shareReference.GetRootDirectoryReference();
this.directoryReference =
this.rootDir.GetDirectoryReference("MyDirectory");
this.fileReference = this.directoryReference.GetFileReference(fileName);
this.fileReference.FetchAttributes();
string azureFileCheckSum = this.fileReference.Metadata["md5B64"];
return azureFileCheckSum.ToLower() == checksum.ToLower();
}
Finally, for the azureFileCheckSum I'm getting: d41d8cd98f00b204e9800998ecf8427e not sure if am I doing something wrong or if something change when I upload the file to the ftp...
Before you call md5.ComputeHash(stream), you need to reset the stream's position to the beginning.
stream.Position = 0;
Of course, this will fail with a NotSupportedException if the stream type doesn't support seeking, but in your case it should work.
I've a byte array which I get from an API.
byte[] sticker = db.call_API_print_sticker(Id);
I have to call this method a number of times and then convert to pdf. I want to store it in an array of arrays and then convert them once I have all them
How do I store it and then combine the byte array pdfs to one.
Using PDFSharp as a Nuget, I wrote the following C# method that purely works with byte arrays:
public byte[] CombinePDFs(List<byte[]> srcPDFs)
{
using (var ms = new MemoryStream())
{
using (var resultPDF = new PdfDocument(ms))
{
foreach (var pdf in srcPDFs)
{
using (var src = new MemoryStream(pdf))
{
using (var srcPDF = PdfReader.Open(src, PdfDocumentOpenMode.Import))
{
for (var i = 0; i < srcPDF.PageCount; i++)
{
resultPDF.AddPage(srcPDF.Pages[i]);
}
}
}
}
resultPDF.Save(ms);
return ms.ToArray();
}
}
}
So the above method takes an array list of source PDFs and combine them and returns a single byte array for the result PDF.
The byte[] is just one pdf probably. I would think that you could just do
System.IO.File.WriteAllBytes(#"sticker.pdf", sticker);
If that is not the case, the easiest way would be to use a nuget package ex: PdfSharp to combine multiple pdfs into one.
An example of combining pdfs
The gist (which assumes each sticker contains 1 page):
IEnumerable<byte[]> stickers;
using (var combinedPdf = new PdfDocument(#"stickers.pdf"))
foreach (var pdf in stickers)
using (MemoryStream ms = new MemoryStream(pdf))
{
var someSticker = PdfReader.Open(ms);
combinedPdf.AddPage(someSticker.Pages[0]);
}
I'm working on a encryptor application that works based on RSA Asymmetric Algorithm.
It generates a key-pair and the user have to keep it.
As key-pairs are long random strings, I want to create a function that let me compress generated long random strings (key-pairs) based on a pattern.
(For example the function get a string that contains 100 characters and return a string that contains 30 characters)
So when the user enter the compressed string I can regenerate the key-pairs based on the pattern I compressed with.
But a person told me that it is impossible to compress random things because they are Random!
What is your idea ?
Is there any way to do this ?
Thanks
It's impossible to compress (nearly any) random data. Learning a bit about information theory, entropy, how compression works, and the pigeonhole principle will make this abundantly clear.
One exception to this rule is if by "random string", you mean, "random data represented in a compressible form, like hexadecimal". In this sort of scenario, you could compress the string or (the better option) simply encode the bytes as base 64 instead to make it shorter. E.g.
// base 16, 50 random bytes (length 100)
be01a140ac0e6f560b1f0e4a9e5ab00ef73397a1fe25c7ea0026b47c213c863f88256a0c2b545463116276583401598a0c36
// base 64, same 50 random bytes (length 68)
vgGhQKwOb1YLHw5KnlqwDvczl6H+JcfqACa0fCE8hj+IJWoMK1RUYxFidlg0AVmKDDY=
You might instead give the user a shorter hash or fingerprint of the value (e.g. the last x bytes). Then by storing the full key and hash somewhere, you could give them the key when they give you the hash. You'd have to have this hash be long enough that security is not compromised. Depending on your application, this might defeat the purpose because the hash would have to be as long as the key, or it might not be a problem.
public static string ZipStr(String str)
{
using (MemoryStream output = new MemoryStream())
{
using (DeflateStream gzip =
new DeflateStream(output, CompressionMode.Compress))
{
using (StreamWriter writer =
new StreamWriter(gzip, System.Text.Encoding.UTF8))
{
writer.Write(str);
}
}
return Convert.ToBase64String(output.ToArray());
}
}
public static string UnZipStr(string base64)
{
byte[] input = Convert.FromBase64String(base64);
using (MemoryStream inputStream = new MemoryStream(input))
{
using (DeflateStream gzip =
new DeflateStream(inputStream, CompressionMode.Decompress))
{
using (StreamReader reader =
new StreamReader(gzip, System.Text.Encoding.UTF8))
{
return reader.ReadToEnd();
}
}
}
}
Take into account that this doesn't have to be shorter at all... depends on the contents of the string.
Try to use gzip compression and see if it helps you
I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request.
Ideally I want to use the standard library at both ends, so I am trying to use gzip.
My C# code is:
public static string Compress(string s) {
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream()) {
using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
The python code is:
s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird.
If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.
I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?
I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:
decompressed_data = gz.read().decode('utf-16')
There, decompressed_data should be Unicode and you can treat it as such for further work.
UPDATE: This worked for me:
C Sharp
static void Main(string[] args)
{
FileStream f = new FileStream("test", FileMode.CreateNew);
using (StreamWriter w = new StreamWriter(f))
{
w.Write(Compress("hello"));
}
}
public static string Compress(string s)
{
var bytes = Encoding.Unicode.GetBytes(s);
using (var msi = new MemoryStream(bytes))
using (var mso = new MemoryStream())
{
using (var gs = new GZipStream(mso, CompressionMode.Compress))
{
msi.CopyTo(gs);
}
return Convert.ToBase64String(mso.ToArray());
}
}
Python
import base64
import cStringIO
import gzip
f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)
with gzip.GzipFile(fileobj=buff) as gz:
decompressed_data = gz.read()
print decompressed_data.decode('utf-16')
Without decode('utf-16) it printed in the console:
>>>h e l l o
with it it did well:
>>>hello
Good luck, hope this helps!
It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}
That's because you're using Encoding.Unicode to convert the string to bytes to start with.
If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.
If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)
I'm also somewhat suspicious of this line:
buff = cStringIO.StringIO(s)
s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.
We have an embedded resource and need to get the md5 hash of the file before extracting it in order to know if it is different from an already existing file, (becouse if we have to extract it to compare them it would be better to replace the file directly)
Any suggestion is appreciated
What sort of embedded resource is it? If it's one you get hold of using Assembly.GetManifestResourceStream(), then the simplest approach is:
using (Stream stream = Assembly.GetManifestResourceStream(...))
{
using (MD5 md5 = MD5.Create())
{
byte[] hash = md5.ComputeHash(stream);
}
}
If that doesn't help, please give more information as to how you normall access/extract your resource.
You can use MemoryStream
using (MemoryStream ms = new MemoryStream(Properties.Resources.MyZipFile))
{
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
{
byte[] hash = md5.ComputeHash(ms);
string str = Convert.ToBase64String(hash);
// result for example: WgWKWcyl2YwlF/C8yLU9XQ==
}
}