Let's say I have a folder with five hundred pictures in it, and I want to check for repeats and delete them.
Here's the code I have right now:
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return md5.ComputeHash(stream);
}
}
Would this be viable to spot repeated MD5s in a specific folder, provided I loop it accordingly?
Creating hashes in order to identify identical files is OK, in any programming language, on any OS. It is slow, though, because you read the whole file even if that is not necessary.
I would recommend several passes for finding duplicates:
get the size of all files
for all files of equal size: get the hash of the first, say, 1k bytes
for all files of equal size and equal hash of first 1k: get the hash of the entire file
There is a risk of hash collisions. You cannot avoid it with hash algorithms. As MD5 uses 128 bits, the risk is 1 : (1 << 128) (roughly 0.0000000000000000000000000000000000000001) for two random files. Your chances of getting the jackpot in your national lottery four times in a row, using only one lottery ticket each week, are much better than getting a hash collision on a random pair of files.
Though the probability of a hash collision raises somewhat, if you compare the hash of many files. The mathematically interested and people implementing hash containers should look up the "birthday problem". Mere mortals trust MD5 hashes when they are not implementing cryptographic algorithms.
using System;
using System.IO;
using System.Collections.Generic;
internal static class FileComparer
{
public static void Compare(string directoryPath)
{
if(!Directory.Exists(directoryPath))
{
return;
}
FileComparer.Compare(new DirectoryInfo(directoryPath));
}
private static void Compare(DirectoryInfo info)
{
List<FileInfo> files = new List<FileInfo>(info.EnumerateFiles());
foreach(FileInfo file in files)
{
if(file.Exists)
{
byte[] array = File.ReadAllBytes(file.FullName);
foreach(FileInfo file2 in files)
{
int length = array.Length;
byte[] array2 = File.ReadAllBytes(file2.FullName);
if(array2.Length == length)
{
bool flag = true;
for(int current = 0; current < length; current++)
{
if(array[current] != array2[current])
{
flag = false;
break;
}
}
if(flag)
{
file2.Delete();
}
}
}
}
}
}
}
Related
This is my first post and I am very sorry if I made errors with the format.
I am trying to write a program to encrypt all kinds of file via XOR in a secure way. I know that XOR isn't the most secure encryption method but I wanted to give it a try.
So please have a look on my methode and tell me if it is complete bullshit or not :)
The password is a String, chosen by the user.
In the beginning I only XORed the file with the password, leading to an easy decryption if parts of the password were guessed correctly.
Here is my procedure:
TmpFile = File XOR (hash of password combined with the pw.length.toString) //to make sure that the password elements are in the right order
TmpFile = TmpFile XOR (XOR byte composed by each byte of the password)//ensure that the password to decode has exactly the right chars.
TmpFile= TmpFile XOR initial_password
Could the encrypted text be decrypted with the self-XOR-shifting technique?
Thanks for your advice! :)
edit: here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Security;
using System.IO;
using System.Windows;
namespace EncodeEverything
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("FileEncrypter v01 \n \n");
//get password
Console.WriteLine("Enter your Password (encryption key)");
string password = getPassword();
Console.WriteLine("");
while (true)
{
Console.WriteLine("");
Console.WriteLine("-----------------------");
Console.WriteLine("");
//get file to encrypt
Console.WriteLine("File to encrypt/decrypt:");
Console.Write(" ");
string path = Console.ReadLine();
//-------------------------------
//load, encrypt & save file
//-------------------------------
try {
Byte[] tmpBArr = encrypt(File.ReadAllBytes(path), getCustomHash(password));
File.WriteAllBytes(path, encrypt(tmpBArr, password));
Console.WriteLine(" done.");
}
catch(System.Exception e)
{
Console.WriteLine("!! Error while processing. Path correct? !!");
}
}
}
private static string getCustomHash(string word)
{
string output = "";
output += word.Length.ToString();
output += word.GetHashCode();
return output;
}
//encrypt bzw decrypt Byte[]
public static Byte[] encrypt(byte[] s, string key)
{
List<Byte> output = new List<byte>();
Byte[] codeword = Encoding.UTF8.GetBytes(key);
Byte keybyte =(Byte)( codeword[0]^ codeword[0]);
foreach(Byte b in codeword)
{
keybyte = (Byte)(b ^ keybyte);
}
for (int i = 0; i < s.Length; i++)
{
output.Add((Byte)(s[i] ^ codeword[i % codeword.Length] ^ keybyte));
}
return output.ToArray();
}
public static string getPassword()
{
Console.Write(" ");
string pwd = "";
while (true)
{
ConsoleKeyInfo i = Console.ReadKey(true);
if (i.Key == ConsoleKey.Enter)
{
break;
}
else if (i.Key == ConsoleKey.Backspace)
{
if (pwd.Length > 0)
{
pwd= pwd.Remove(pwd.Length - 1);
Console.Write("\b \b");
}
}
else
{
pwd+=(i.KeyChar);
Console.Write("*");
}
}
return pwd;
}
}
}
string.GetHashCode doesn't have a well defined return value. So you might not even be able to decrypt the file after you restart the process.
Your key consists of a 32 bit value plus the length of the password. Brute-forced in seconds on a single computer.
Once the file is longer than the hashed key, the key starts repeating, you get a many-time-pad. So even if we ignored the brute-force attack, it'd still be easy to break. It's essentially a xor based vigenere variant.
Ignoring the xor-ed parity byte, which is the same for each byte in the message, the key-stream bytes are ASCII digits, so each key byte has at best 3.3 bits of entropy. Comparing this with the approximately 1.5 bits of entropy per letter in English text, shows you that it's quite weak, even without key-stream repetitions.
=> it's buggy and insecure
You can ignore this answer if you're just trying to encrypt files as a learning exercise in cryptography, but if you're looking for a real-world solution to securing your file data, read on.
I'd really recommend that you use File encryption built into the .NET framework for this sort of thing if you're looking for a real-world solution to keeping your file data secure.
From Microsoft # https://msdn.microsoft.com/en-us/library/system.io.file.encrypt(v=vs.110).aspx
using System;
using System.IO;
using System.Security.AccessControl;
namespace FileSystemExample
{
class FileExample
{
public static void Main()
{
try
{
string FileName = "test.xml";
Console.WriteLine("Encrypt " + FileName);
// Encrypt the file.
AddEncryption(FileName);
Console.WriteLine("Decrypt " + FileName);
// Decrypt the file.
RemoveEncryption(FileName);
Console.WriteLine("Done");
}
catch (Exception e)
{
Console.WriteLine(e);
}
Console.ReadLine();
}
// Encrypt a file.
public static void AddEncryption(string FileName)
{
File.Encrypt(FileName);
}
// Decrypt a file.
public static void RemoveEncryption(string FileName)
{
File.Decrypt(FileName);
}
}
}
It is hard to say for sure that this is what you need, because other things may need to be taken into consideration such as whether you need to pass the file between different clients/servers etc, as well as how much data you're encrypting in each file.
Again, if you're looking for real-world cryptography using C#, I can't stress enough that you should be looking to built-in .NET encryption rather than trying to roll your own- especially if you don't have any formal training in the subject matter. I recommend you pore through Microsoft documentation on .NET framework encryption if you're interested in securing data in production:
https://msdn.microsoft.com/en-us/library/0ss79b2x(v=vs.110).aspx
Here is a nice walkthrough for creating an example file encrypting windows form application:
https://msdn.microsoft.com/en-us/library/bb397867(v=vs.110).aspx
I am writing a code that calculates the MD5/SHA256 of a program and later I want to be able to change it.
I wrote the code for calculating the MD5/SHA256, which is:
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(textBox1.Text))
{
MessageBox.Show(BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", ""));
}
}
using (var sha256 = SHA256.Create())
{
using (var stream = File.OpenRead(textBox1.Text))
{
MessageBox.Show(BitConverter.ToString(sha256.ComputeHash(stream)).Replace("-", ""));
}
}
Next I want to be able to change the values of MD5/SHA256 for the specified file. I have searched and all I found was this class:
class FileUtils
{
#region VARIABLES
private const int OFFSET_CHECKSUM = 0x12;
#endregion
#region METHODS
public static ushort GetCheckSum(string fileName)
{
if (!File.Exists(fileName))
throw new FileNotFoundException("Invalid fileName");
return GetCheckSum(File.ReadAllBytes(fileName));
}
public static ushort GetCheckSum(byte[] fileData)
{
if (fileData.Length < OFFSET_CHECKSUM + 1)
throw new ArgumentException("Invalid fileData");
return BitConverter.ToUInt16(fileData, OFFSET_CHECKSUM);
}
public static void WriteCheckSum(string sourceFile, string destFile, ushort checkSum)
{
if (!File.Exists(sourceFile))
throw new FileNotFoundException("Invalid fileName");
WriteCheckSum(File.ReadAllBytes(sourceFile), destFile, checkSum);
}
public static void WriteCheckSum(byte[] data, string destFile, ushort checkSum)
{
byte[] checkSumData = BitConverter.GetBytes(checkSum);
checkSumData.CopyTo(data, OFFSET_CHECKSUM);
File.WriteAllBytes(destFile, data);
}
#endregion
}
Which I don't really understand how it works and only changes the MD5. Is there an easier way to do this, for not so advanced users? If this class works for what I need, could someone explain to me how can I use it?
Edit: I am aware that the MD5 of the file can't be changed, my goal is not to change the MD5 of the actual file, I want to add some contents to the file which would change the MD5 and by doing that I want the file to remain unchanged in functionalities.
As far as I understand, you have or want two copies of the same PE executable file. Now you want to change either or both of these files, so that when you calculate a hash of the file's contents, they are different.
If you change the checksum, chances are the executable won't run anymore. If you're OK with that, you can easily use the class that you showed. It seems to assume a checksum consists of two bytes and is offset at byte 0x12 in the executable. I can't verify right now that it is correct, but at a glance it doesn't seem to be.
Anyway you can create your unique checksum per file and set it:
FileUtils.WriteCheckSum(sourceFile, destFile1, 1);
FileUtils.WriteCheckSum(sourceFile, destFile2, 2);
Now the two files will bear different contents, so the hash of their contents will be different.
You can't just decide that you want your file to have a different hash because the hash is a direct result of the data stored in that file. Two identical files, in terms of what they contain, will always produce the same hash, regardless of what their names are.
Any changes to the content the file itself will result in an entirely different hash value.
MD5 is computed by passing bytes(a file for example) and representing them uniquely in hexadecimal, You don't change the "MD5" of a file, the result MD5 will change as the file changes.
I need to combine multiple files (video & music & text) into single file with a custom file type (for example: *.abcd) and custom file data structure , and the file should only be read by my program and my program should be able to separate parts of this file. How do this in .net & c#?
https://www.imageupload.co.uk/image/ZWnh
Like M463 rightly pointed out, you could use System.IO.Compression to compress those files together and encrypt them...although encryption is a completely different art and another headache.
A better option would be to have some metadata in, say, the first few bytes of the file and then store the files' bytes as raw bytes. This would avoid any person from figuring the contents by just looking at it in a text editor. Again, if you want to really protect your data, encryption is unavoidable. But a simple algorithm to begin with would be this:
using System;
using System.IO;
using System.Collections.Generic;
namespace Compression
{
public class ClassName
{
public static void Compress(string[] fileNames, string resultantFileName)
{
List<byte> bytesToWrite = new List<byte>();
//add metadata about the number of files
int filesLength = fileNames.Length;
bytesToWrite.AddRange(BitConverter.GetBytes(filesLength));
List<byte[]> files = new List<byte[]>();
foreach(string fileName in fileNames)
{
byte[] bytes = File.ReadAllBytes(fileName);
//add metadata about the size of each file
bytesToWrite.AddRange(BitConverter.GetBytes(bytes.Length));
files.Add(bytes);
}
foreach(byte[] bytes in files)
{
//write the actual files itself
bytesToWrite.AddRange(bytes);
}
File.WriteAllBytes(resultantFileName, bytesToWrite.ToArray());
}
public static void Decompress(string fileName)
{
List<byte> bytes = new List<byte>(File.ReadAllBytes(fileName));
//this int represents the number of files in the byte array
int filesLength = BitConverter.ToInt32(bytes.ToArray(), 0);
List<int> sizes = new List<int>();
//get the size of each file
for(int i = 0; i < filesLength; i++)
{
//the first 2 bytes represent the number of files
//then each succeding int represents the size of each file
int size = BitConverter.ToInt32(bytes.ToArray(), 2 + i * 2);
sizes.Add(size);
}
//now read all the files
for(int i = 0; i < filesLength; i++)
{
int lastByteTillNow = 0;
for(int j = 0; j < i; j++)
lastByteTillNow += sizes[j];
File.WriteAllBytes("file " + i, bytes.GetRange(2 + 2 * filesLength + lastByteTillNow, sizes[i]).ToArray());
}
}
}
}
Now obviously this is not the best algorithm you've come across nor is it the most optimized snippet of code. After all, it is just what I could come up with in 10-15 mins. So I haven't even tested it yet. However, the point is, it gives you the idea doesn't it? I have limited the size of each file to the maximum length of an Int32 (however, changing it to Int64 a.k.a long wouldnt be much of a trouble). But it gives you an idea. You can even modify the snippet to load and write to and from the RAM via MemoryStreams (Systtem.IO.MemorySteam I think). But whatever, this should give you a start!
How about an encrypted .zip-container containing your files? Handling of .zip-files is already available in the .NET-Framework, take a look at the System.IO.Compression namespace. Or you could use some third party library.
You could even force a different file extension by just renaming the file, if you really want to...
I find this quite odd. I have a utility I'm working on which will be pointed at a folder, index the folder with relative path / filename / filesize / md5 hash / some other things. If the md5 hashes don't match, it updates the hash in the database, backs the file up again, and continues on it's way with the rest of the files. This is primarily for backup purposes, but also me leaning.
The first time I ran the program aimed at some of my web projects it used disk IO and grabbed file handles, both visible in Process Hacker. However, the second time of running it (as in, I shut it down and restarted it), it doesn't appear to be taking up any disk IO, and only periodically grabs a handle. Yet, a hash is appearing.
The code which iterates over the files:
foreach (string path in paths)
{
try
{
string relativePath = path.Replace(#"Z:\99_Projects\web\de.com\", "");
BackupFile backupFile = BackupFile.GetFile(relativePath, connection);
string md5hash = "";
long filesize = (new FileInfo(path)).Length;
using (var file = File.OpenRead(path))
{
md5hash = Hasher.ComputeMD5Hash(file);
//Console.WriteLine(md5hash);
if (backupFile == null)
{
BackupFile.NewBackupFile(relativePath, Path.GetFileName(path), md5hash, filesize, connection);
}
else
{
if (backupFile.md5 != md5hash)
backupFile.flags = CoreLib.Utils.Backup.Enums.BackupFileFlags.CHANGED;
else
backupFile.flags = CoreLib.Utils.Backup.Enums.BackupFileFlags.UNCHANGED;
backupFile.filesize = filesize;
backupFile.md5 = md5hash;
backupFile.Save(connection);
}
file.Close();
}
}
catch (IOException e)
{
Console.WriteLine("Access: " + Path.GetFileName(path));
}
catch (SQLiteException e)
{
Console.WriteLine("|E|");
}
catch (Exception e)
{
Console.WriteLine("|EG|");
throw e;
}
}
The code for the Hasher class as used, which is just really a small method wrapper for the cryptography MD5 hash calculator so I could reuse it (and other hash methods I stick in it) elsewhere in other code.
public class Hasher
{
public static string ComputeMD5Hash(Stream stream)
{
string hash = "";
using (var md5 = System.Security.Cryptography.MD5.Create())
{
hash = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
return hash;
}
}
I've tried several things, including debugging the application and verified that it's actually opening up file streams and computing a hash. I've also had it print out the hash to the console, as shown by the commented out line under where it's computed, but even when it's printing it out it's showing no disk IO at all.
Well, it looks as though it's simple file system cache mixed with Process Hacker's polling rate being too slow to pick up on it once cached. I changed a file, it detected the change but showed no measurable disk IO at all (which it does in bytes), so I'm assuming at this point it just didn't pick up on it due to how fast it was going.
I'm using iTextSharp to read the text from a PDF file. However, there are times I cannot extract text, because the PDF file is only containing images. I download the same PDF files everyday, and I want to see if the PDF has been modified. If the text and modification date cannot be obtained, is a MD5 checksum the most reliable way to tell if the file has changed?
If it is, some code samples would be appreciated, because I don't have much experience with cryptography.
It's very simple using System.Security.Cryptography.MD5:
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return md5.ComputeHash(stream);
}
}
(I believe that actually the MD5 implementation used doesn't need to be disposed, but I'd probably still do so anyway.)
How you compare the results afterwards is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.)
If you need to represent the hash as a string, you could convert it to hex using BitConverter:
static string CalculateMD5(string filename)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
var hash = md5.ComputeHash(stream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}
This is how I do it:
using System.IO;
using System.Security.Cryptography;
public string checkMD5(string filename)
{
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filename))
{
return Encoding.Default.GetString(md5.ComputeHash(stream));
}
}
}
I know this question was already answered, but this is what I use:
using (FileStream fStream = File.OpenRead(filename)) {
return GetHash<MD5>(fStream)
}
Where GetHash:
public static String GetHash<T>(Stream stream) where T : HashAlgorithm {
StringBuilder sb = new StringBuilder();
MethodInfo create = typeof(T).GetMethod("Create", new Type[] {});
using (T crypt = (T) create.Invoke(null, null)) {
byte[] hashBytes = crypt.ComputeHash(stream);
foreach (byte bt in hashBytes) {
sb.Append(bt.ToString("x2"));
}
}
return sb.ToString();
}
Probably not the best way, but it can be handy.
Here is a slightly simpler version that I found. It reads the entire file in one go and only requires a single using directive.
byte[] ComputeHash(string filePath)
{
using (var md5 = MD5.Create())
{
return md5.ComputeHash(File.ReadAllBytes(filePath));
}
}
I know that I am late to party but performed test before actually implement the solution.
I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.
DateTime current = DateTime.Now;
string file = #"C:\text.iso";//It's 2.5 Gb file
string output;
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(file))
{
byte[] checksum = md5.ComputeHash(stream);
output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
}
}
For dynamically-generated PDFs.
The creation date and modified dates will always be different.
You have to remove them or set them to a constant value.
Then generate md5 hash to compare hashes.
You can use PDFStamper to remove or update dates.
In addition to the methods answered above if you're comparing PDFs you need to amend the creation and modified dates or the hashes won't match.
For PDFs generated with QuestPdf youll need to override the CreationDate and ModifiedDate in the Document Metadata.
public class PdfDocument : IDocument
{
...
DocumentMetadata GetMetadata()
{
return new()
{
CreationDate = DateTime.MinValue,
ModifiedDate = DateTime.MinValue,
};
}
...
}
https://www.questpdf.com/concepts/document-metadata.html