Which is a better way to compare 2 files?

Which is a better way to compare 2 files? - c#

I have the following situation in C#:
ZipFile z1 = ZipFile.Read("f1.zip");
ZipFile z2 = ZipFile.Read("f2.zip");
MemoryStream ms1 = new MemoryStream();
MemoryStream ms2 = new MemoryStream()
ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry1 = zip2["f2.dll"];
zipentry1.Extract(ms1);
zipentry2.Extract(ms2);
byte[] b1 = new byte[ms1.Length];
byte[] b2 = new byte[ms2.Length];
ms1.Seek(0, SeekOrigin.Begin);
ms2.Seek(0, SeekOrigin.Begin);
what I have done here is opened 2 zip files f1.zip and f2.zip. Then I extract 2 files inside them (f1.txt and f2.txt inside f1.zip and f2.zip respectively) onto the MemoryStream objects. I now want to compare the files and find out if they are the same or not. I had 2 ways in mind:
1) Read the memory streams byte by byte and compare them.
For this I would use
ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);
and then run a for loop and compare each byte in b1 and b2.
2) Get the string values for both the memory streams and then do a string compare. For this I would use
string str1 = Encoding.UTF8.GetString(ms1.GetBuffer(), 0, (int)ms1.Length);
string str2 = Encoding.UTF8.GetString(ms2.GetBuffer(), 0, (int)ms2.Length);
and then do a simple string compare.
Now, I know comparing byte by byte will always give me a correct result. But the problem with it is, it will take a lot time as I have to do this for thousands of files. That is why I am thinking about the string compare method which looks to find out if the files are equal or not very quickly. But I am not sure if string compare will give me the correct result as the files are either dlls or media files etc and will contain special characters for sure.
Can anyone tell me if the string compare method will work correctly or not ?
Thanks in advance.
P.S. : I am using DotNetLibrary.

The baseline for this question is the native way to compare arrays: Enumerable.SequenceEqual. You should use that unless you have good reason to do otherwise.
If you care about speed, you could attempt to p/invoke to memcmp in msvcrt.dll and compare the byte arrays that way. I find it hard to imagine that could be beaten. Obviously you'd do a comparison of the lengths first and only call memcmp if the two byte arrays had the same length.
The p/invoke looks like this:
[DllImport("msvcrt.dll", CallingConvention=CallingConvention.Cdecl)]
static extern int memcmp(byte[] lhs, byte[] rhs, UIntPtr count);
But you should only contemplate this if you really do care about speed, and the pure managed alternatives are too slow for you. So, do some timings to make sure you are not optimising prematurely. Well, even to make sure that you are optimising at all.
I'd be very surprised if converting to string was fast. I'd expect it to be slow. And in fact I'd expect your code to fail because there's no reason for your byte arrays to be valid UTF-8. Just forget you ever had that idea!

Compare ZipEntry.Crc and ZipEntry.UncompressedSize of the two files, only if they match uncompress and do the byte comparison. If the two files are the same, their CRC and Size will be the same too. This strategy will save you a ton of CPU cycles.
ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry2 = zip2["f2.dll"];
if (zipentry1.Crc == zipentry2.Crc && zipentry1.UncompressedSize == zipentry2.UncompressedSize)
{
// uncompress
zipentry1.Extract(ms1);
zipentry2.Extract(ms2);
byte[] b1 = new byte[ms1.Length];
byte[] b2 = new byte[ms2.Length];
ms1.Seek(0, SeekOrigin.Begin);
ms2.Seek(0, SeekOrigin.Begin);
ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);
// perform a byte comparison
if (Enumerable.SequenceEqual(b1, b2)) // or a simple for loop
{
// files are the same
}
else
{
// files are different
}
}
else
{
// files are different
}

Related

Combining multiple Types into a byte array

What is the most efficient way to combine multiple variables of different Types into a single byte-array?
Take the following example data:
short a = 500;
byte b = 10;
byte[] c = new byte[4];
How could I combine these three variables into one byte array without wasting to much time and memory?
Think of it like this (Pseudocode):
var combinedArray = new byte[] { a, b, c };
I thought of different ways, including unsafe code, converting them to byte[] using BitConverter and using Linq's Concat.
I need an array in the end, not just an IEnumerable, because I need to send this data via udp.
Are there any methods I did not think of?

Use the BinaryWriter combined with a MemoryStream.
var buffer = new MemoryStream();
var writer = new BinaryWriter(buffer);
writer.Write(a);
writer.Write(b);
writer.Write(c);
writer.Close();
byte[] bytes = buffer.ToArray();
But do note that there is no padding or alignment. The array c will start at an odd offset.
You will also have to verify the Big Endian / Little Endian contract with your client.

How do I find strings inside a memory dumped byte array converted to UTF8 encoded string?

I'm working on a video game cheat engine with utilizes simple memory manipulation to achieve its goal. I have successfully been able to write a piece of code that dumps a process' memory into a byte[] and iterates over these arrays in search of the desired string. The piece of code that searches is thus:
public bool FindString(byte[] bytes, string pName, long offset)
{
string s = System.Text.Encoding.UTF8.GetString(bytes);
var match = Regex.Match(s, "test");
if (match.Success)
return true;
return false;
}
I then open up a 32-bit version of notepad (since that is what my dumping method is conditioned for) and type the word "test" in it and run my program in debug mode to see if the condition is ever hit. It does not.
Upon further inspect I check out the 's' string's contents on one of the iterations, it is thus:
\0\0\0\0\0\0\0\0���\f\0\u0001����\u0001\0\0\0 \u0001�\0\0\0\0\0 \u0001�\0\0\0\0\0\0\0�\0\0\0\0\0\0\0�\0\0\0\0\0\u0010\0\0\0\0\0\0\0 \a�\0\0\0\0\0\0\0�\0\0\0\0\0\u000f\0\0\0\u0001\0\0\0\0\0\0\0\0\0\0\0�\u000f�\0\0\0\0\0�\u000f�\0\0\0\0\0\0�\0\0\0\0\0\0\0\0\0\0\0\0\u0010\0\0\0\0\0\0\0\0\0����\f\0\0\0\0\0\0\0�\0\0����\0\0\0\0\0\0\u0010\0\0\0\0\0\0 \0\0\0\0\0\0\0\u0001\0\0\0\0\0\0\0\u0010\0\0\0\0\0\0�\0\0\0\0\0\0\0�����\u007f\0\0\u0002\0�\u0002\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0�\u000f�\0\0\0\0\0�\u000f�\0\0\0\0\0\u001f\0\0\0\0\0\0\0��������\u0010\u0001�\0\0\0\0\0\u0010\u0001�\0\0\0\0\0\u0018\0�\0\0\0\0\0\u0018\0�\0\0\0\0\0\0\0\0\0\0\0\0\0�\u0002�\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\00\a�\0\0\0\0\00\a�\0\0\0\0\0�\u0002�\0\0\0\0\0�M�^\u000e\u000e_\u007f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\u0001\0\0\0\0\0\0\u0010\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\u0001\0\0\0\u0001\0\0\0\0\0\0\0\0\0\0\0\b\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\u0001\0\0\0\b\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0`\a\0\0\0\0\0\0`\a\0\0\0\0\0\0\u0004\0\0\0\0\0\0\0\0�\u001f\0\0\0\0\0�\u001d\u0014)�\u007f\0\0����\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0�\a\0\u0002\0\0\0\0\0\0\0\0\0\0\0\0�\0\0\0\0\0\0\0\u0001\0\0\0\u0001\0\0\0\0\0\0\0\0\0\0\0P\u0001�\0\0\0\0\0\0\u0003�\0\0\0\0\0\u0010\u0003�\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0�
I continued to check each pass-through of this method for the 's' variable and found that I could not see any strings in this format.
My question is simple. What am I doing wrong that I cannot find this string? The dumping is succeeding, but something to do with my method of parsing is causing me trouble.
UPDATE (code for dumping memory)
void ScanProcess(Process process)
{
// getting minimum & maximum address
var sys_info = new SYSTEM_INFO();
GetSystemInfo(out sys_info);
var proc_min_address = sys_info.minimumApplicationAddress;
var proc_max_address = sys_info.maximumApplicationAddress;
var proc_min_address_l = (long)proc_min_address;
var proc_max_address_l = (long)proc_max_address;
//Opening the process with desired access level
var processHandle = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_WM_READ, false, process.Id);
var mem_basic_info = new MEMORY_BASIC_INFORMATION();
var bytesRead = 0; // number of bytes read with ReadProcessMemory
while (proc_min_address_l < proc_max_address_l)
{
VirtualQueryEx(processHandle, proc_min_address, out mem_basic_info, 28); //28 = sizeof(MEMORY_BASIC_INFORMATION)
//If this memory chunk is accessible
if (mem_basic_info.Protect == PAGE_READWRITE && mem_basic_info.State == MEM_COMMIT)
{
//Read everything into a buffer
byte[] buffer = new byte[mem_basic_info.RegionSize];
ReadProcessMemory((int)processHandle, mem_basic_info.BaseAddress, buffer, mem_basic_info.RegionSize, ref bytesRead);
var MemScanner = new MemScan();
Memscanner.FindString(buffer, process.ProcessName, proc_max_address_l);
}
// move to the next memory chunk
proc_min_address_l += mem_basic_info.RegionSize;
proc_min_address = new IntPtr(proc_min_address_l);
if (mem_basic_info.RegionSize == 0)
{
break;
mem_basic_info.RegionSize = 4096;
}
}
}

For starters you can't use NotePad (or any non-binary capable viewing tool to look at your bytes).
You need to use the BitConverter APIs:
https://msdn.microsoft.com/en-us/library/system.bitconverter(v=vs.110).aspx
...to walk the data and compose/search the data to find what you're looking for (keeping whatever encoding you dumped the data in in mind).
BTW - Here's a useful HexEditor: http://www.hexworkshop.com/

I don´t know what MemScan.FindString() does, but I guess the problem is that you are searching a string for a string, rather than for a byte array in a byte array.
By transforming the memory contents using System.Text.Encoding.UTF8.GetString(bytes); you assume that everything stored in memory can be interpreted as valid UTF8 encoding.
Your FindString() must accept parameters as byte[] rather than string, and you need to figure out how the process name is stored in memory (most likely UTF-16).

How to create big sized .txt file?

For certain reasons, I have to create a 1024 kb .txt file.
Below is my current code:
int size = 1024000 //1024 kb..
byte[] bytearray = new byte[size];
foreach (byte bit in bytearray)
{
bit = 0;
}
string tobewritten = string.Empty;
foreach (byte bit in bytearray)
{
tobewritten += bit.ToString();
}
//newPath is local directory, where I store the created file
using (System.IO.StreamWriter sw = File.CreateText(newPath))
{
sw.WriteLine(tobewritten);
}
I have to wait at least 30 minutes to execute this piece of code, which I consider too long.
Now, I would like to ask for advice on how to actually achieve my mentioned objective effectively. Are there any alternatives to do this task? Am I writing bad code? Any help is appreciated.

There are several misunderstandings in the code you provided:
byte[] bytearray = new byte[size];
foreach (byte bit in bytearray)
{
bit = 0;
}
You seem to think that your are initializing each byte in your array bytearray with zero. Instead you just set the loop variable bit (unfortunate naming) to zero size times. Actually this code wouldn't even compile since you cannot assign to the foreach iteration variable.
Also you didn't need initialization here in the first place: byte array elements are automatically initialized to 0.
string tobewritten = string.Empty;
foreach (byte bit in bytearray)
{
tobewritten += bit.ToString();
}
You want to combine the string representation of each byte in your array to the string variable tobewritten. Since strings are immutable you create a new string for each element that has to be garbage collected along with the string you created for bit, this is relatively expensive, especially when you create 2048000 one of them - use a Stringbuilder instead.
Lastly all of that is not needed at all anyway - it seems you just want to write a bunch of "0" characters to a text file - if you are not worried about creating a single large string of zeros (it depends on the value of size whether this makes sense) you can just create the string directly to do this one go - or alternatively write a smaller string directly to the stream a bunch of times.
using (var file = File.CreateText(newpath))
{
file.WriteLine(new string('0', size));
}

Replace the string with a pre-sized StringBuilder to avoid unnecessary allocations.
Or, better yet, write each piece directly to the StreamWriter instead of pointlessly building a 100MB in-memory string first.

Comparing two files in C# [duplicate]

This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 7 years ago.
I want to compare two files in C# and see if they are different. They have the same file names and they are the exact same size when different. I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Thanks

Depending on how far you're looking to take it, you can take a look at Diff.NET
Here's a simple file comparison function:
// This method accepts two strings the represent two files to
// compare. A return value of 0 indicates that the contents of the files
// are the same. A return value of any other value indicates that the
// files are not the same.
private bool FileCompare(string file1, string file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;
// Determine if the same file was referenced two times.
if (file1 == file2)
{
// Return true to indicate that the files are the same.
return true;
}
// Open the two files.
fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read);
fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read);
// Check the file sizes. If they are not the same, the files
// are not the same.
if (fs1.Length != fs2.Length)
{
// Close the file
fs1.Close();
fs2.Close();
// Return false to indicate files are different
return false;
}
// Read and compare a byte from each file until either a
// non-matching set of bytes is found or until the end of
// file1 is reached.
do
{
// Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1));
// Close the files.
fs1.Close();
fs2.Close();
// Return the success of the comparison. "file1byte" is
// equal to "file2byte" at this point only if the files are
// the same.
return ((file1byte - file2byte) == 0);
}

I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Not really.
If the files came with hashes, you could compare the hashes, and if they are different you can conclude the files are different (same hashes, however, does not mean the files are the same and so you will still have to do a byte by byte comparison).
However, hashes use all the bytes in the file, so no matter what, you at some point have to read the files byte for byte. And in fact, just a straight byte by byte comparison will be faster than computing a hash. This is because a hash reads all the bytes just like comparing byte-by-byte does, but hashes do some other computations that add time. Additionally, a byte-by-byte comparison can terminate early on the first pair of non-equal bytes.
Finally, you can not avoid the need for a byte-by-byte read. If the hashes are equal, that doesn't mean the files are equal. In this case you still have to compare byte-by-byte.

Well, I'm not sure if you can in the file write timestamps. If not, your unique alternative, is comparing the content of the files.
A simple approach is comparing the files byte-to-byte, but if you're going to compare a file several times with others, you can calculate the hashcode of the files and compare it.
The following code snippet shows how you can do it:
public static string CalcHashCode(string filename)
{
FileStream stream = new FileStream(
filename,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
try
{
return CalcHashCode(stream);
}
finally
{
stream.Close();
}
}
public static string CalcHashCode(FileStream file)
{
MD5CryptoServiceProvider md5Provider = new MD5CryptoServiceProvider();
Byte[] hash = md5Provider.ComputeHash(file);
return Convert.ToBase64String(hash);
}
If you're going to compare a file with others more that one time, you can save the file hash and compare it. For a single comparison, the byte-to-byte comparison is better. You need also to recompute hash when the file changes, but if you're going to do massive comparisons (more than one time), I recommend using the hash approach.

If the filenames are the same, and the file sizes are the same, then, no, there is no way to know if they have different content without examining the content.

Read the file into a stream, then hash the stream. That should give you a reliable result for comparing.
byte[] fileHash1, fileHash2;
using (SHA256Managed sha = new SHA256Managed())
{
fileHash1 = sha.ComputeHash(streamforfile1);
fileHash2 = sha.ComputeHash(streamforfile2);
}
for (int i = 0; (i < fileHash1.Length) && (i < fileHash2.Length); i++)
{
if (fileHash[i] != fileHash2[i])
{
//files are not the same
break;
}
}

If they are not complied files then use a diff tool like KDiff or WinMerge. It will highlight were they are different.
http://kdiff3.sourceforge.net/
http://winmerge.org/

pass each file stream through an MD5 hasher and compare the hashes.

Cast filestream length to int

I have a question about the safety of a cast from long to int. I fear that the method I wrote might fail at this cast. Can you please take a look at the code below and tell me if it is possible to write something that would avoid a possible fail?
Thank you in advance.
public static string ReadDecrypted(string fileFullPath)
{
string result = string.Empty;
using (FileStream fs = new FileStream(fileFullPath, FileMode.Open, FileAccess.Read))
{
int fsLength = (int)fs.Length;
byte[] decrypted;
byte[] read = new byte[fsLength];
if (fs.CanRead)
{
fs.Read(read, 0, fsLength);
decrypted = ProtectedData.Unprotect(read, CreateEntropy(), DataProtectionScope.CurrentUser);
result = Utils.AppDefaultEncoding.GetString(decrypted, 0, decrypted.Length);
}
}
return result;
}

the short answer is: yes, this way you will have problems with any file with a length >= 2 GB!
if you don't expect any files that big then you can insert directly at the start of the using block:
if (((int)fs.Length) != fs.Length) throw new Exception ("too big");
otherwise you should NOT cast to int, but change byte[] read = new byte[fsLength];
to byte[] read = new byte[fs.Length]; and use a loop to read the file content in "chunks" of max. 2 GB per chunk.
Another alternative (available in .NET4) is to use MemoryMappedFile (see http://msdn.microsoft.com/en-us/library/dd997372.aspx) - this way you don't need to call Read at all :-)

Well, int is 32-bit and long is 64-bit, so there's always the possibility of losing some data with the cast if you're opening up 2GB files; on the other hand, that allocation of a byte array of fsLength would seem to indicate you're not expecting files that big. Put a check in to make sure that fs.Length isn't greater than 2,147,483,647, and you should be fine.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Which is a better way to compare 2 files? - c#

Related

Combining multiple Types into a byte array

How do I find strings inside a memory dumped byte array converted to UTF8 encoded string?

How to create big sized .txt file?

Comparing two files in C# [duplicate]

Cast filestream length to int

Categories

Resources