How to compare files using Byte Array and Hash

How to compare files using Byte Array and Hash - c#

Background
I am converting media files to a new format and need a way of knowing if I've previously in current runtime, converted a file.
My solution
To hash each file and store the hash in an array. Each time I go to convert a file I hash it and check the hash against the hashes stored in the array.
Problem
My logic doesn't seem able to detect when I've already seen a file and I end up converting the same file multiple times.
Code
//Byte array of already processed files
private static readonly List<byte[]> Bytelist = new List<byte[]>();
public static bool DoCheck(string file)
{
FileInfo info = new FileInfo(file);
while (FrmMain.IsFileLocked(info)) //Make sure file is finished being copied/moved
{
Thread.Sleep(500);
}
//Get byte sig of file and if seen before dont process
byte[] myFileData = File.ReadAllBytes(file);
byte[] myHash = MD5.Create().ComputeHash(myFileData);
if (Bytelist.Count != 0)
{
foreach (var item in Bytelist)
{
//If seen before ignore
if (myHash == item)
{
return true;
}
}
}
Bytelist.Add(myHash);
return false;
}
Question
Is there more efficient way of trying to acheive my end goal? What am I doing wrong?

There are multiple questions, I'm going to answer the first one:
Is there more efficient way of trying to acheive my end goal?
TL;DR yes.
You're storing hashes and comparing hashes only for the files, which is a really expensive operation. You can do other checks before calculating the hash:
Is the file size the same? If not, go to the next check.
Are the first bunch of bytes the same? If not, go to the next check.
At this point you have to check the hashes (MD5).
Of course you will have to store size/first X bytes/hash for each processed file.
In addition, same MD5 doesn't mean the files are the same so you might want to take an extra step to check if they're really the same, but this might be an overkill, depends on how heavy the cost of reprocessing the file is, might be more important not to calculate expensive hashes.
EDIT: The second question: is likely to fail as you are comparing the reference of two byte arrays that will never be the same as you create a new one every time, you need to create a sequence equal comparison between byte[]. (Or convert the hash to a string and compare strings then)
var exists = Bytelist.Any(hash => hash.SequenceEqual(myHash));

Are you sure this new file format doesn't add extra meta data into
the content? like last modified, or attributes that change ?
Also, if you are converting to a known format, then there should be a
way using a file signature to know if its already in this format or
not, if this is your format, then add some extra bytes for signature to identify it.
Don't forget that if your app gets closed and opened again it will
reporcess all files again by your approach.
Another last point regarding the code, I prefer not storing byte
arrays, but if you should, its better you create HashSet
instead of list, it has an access time of O(1).

There's a lot of room for improvement with regard to efficiency, effectiveness and style, but this isn't CodeReview.SE, so I'll try to stick the problem at hand:
You're checking if a two byte arrays are equivalent by using the == operator. But that will only perform reference equality testing - i.e. test if the two variables point to the same instance, the very same array. That, of course, won't work here.
There are many ways to do it, starting with a simple foreach loop over the arrays (with an optimization that checks the length first, probably) or using Enumerable.SequenceEquals as you can find in this answer here.
Better yet, convert your hash's byte[] to a string (any string - Convert.ToBase64String would be a good choice) and store that in your Bytelist cache (which should be a Hashset, not a List). Strings are optimized for these sort of comparisons, and you won't run into the "reference equality" problem here.
So a sample solution would be this:
private static readonly HashSet<string> _computedHashes = new HashSet<string>();
public static bool DoCheck(string file)
{
/// stuff
//Get byte sig of file and if seen before dont process
byte[] myFileData = File.ReadAllBytes(file);
byte[] myHash = MD5.Create().ComputeHash(myFileData);
string hashString = Convert.ToBase64String(myHash);
return _computedHashes.Contains(hashString);
}
Presumably, you'll add the hash to the _computedHashes set after you've done the conversion.

You have to compare the byte arrays item by item:
foreach (var item in Bytelist)
{
//If seen before ignore
if (myHash.Length == item.Length)
{
bool isequal = true;
for (int i = 0; i < myHash.Length; i++)
{
if (myHash[i] != item[i])
{
isequal = false;
}
}
if (isequal)
{
return true;
}
}
}

Related

How to check if a byte array ends with carriage return

I want to know wether my byte array ends on carriage return and if not I want to add it.
Thats what I have tried
byte[] fileContent = File.ReadAllBytes(openFileDialog.FileName);
byte[] endCharacter = fileContent.Skip(fileContent.Length - 2).Take(2).ToArray();
if (!(endCharacter.Equals(Encoding.ASCII.GetBytes(Environment.NewLine))))
{
fileContent = fileContent.Concat(Encoding.ASCII.GetBytes(Environment.NewLine)).ToArray();
}
But I don't get it... Is this the right approach? If so, what's wrong with equals? Even if my byte array ends with {10,13}, the If statement never detects it.

In this case, Equals checks for reference equality; while endCharacter and Encoding.ASCII.GetBytes(Environment.NewLine) may have the same contents, they are not the same array, so Equals returns false.
You're interested in value equality, so you should instead individually compare the values at each position in the arrays:
newLine = Encoding.ASCII.GetBytes(Environment.NewLine);
if (endCharacter[0] != newLine[0] && endCharacter[1] != newLine[1])
{
// ...
}
In general, if you want to compare arrays for value equality, you could use something like this method, provided by Marc Gravell.
However, a much more efficient solution to your problem would be to convert the last two bytes of your file into ASCII and do a string comparison (since System.String already overloads == to check for value equality):
string endCharacter = Encoding.ASCII.GetString(fileContent, fileContent.Length - 2, 2);
if (endCharacter == Environment.NewLine)
{
// ...
}
You may also need to be careful about reading the entire file into memory if it's likely to be large. If you don't need the full contents of the file, you could do this more efficiently by just reading in the final two bytes, inspecting them, and appending directly to the file as necessary. This can be achieved by opening a System.IO.FileStream for the file (through System.IO.File.Open).

I found the solution, I must take SequenceEqual (http://www.dotnetperls.com/sequenceequal) in place of Equals. Thanks to everyone!
byte[] fileContent = File.ReadAllBytes(openFileDialog.FileName);
byte[] endCharacter = fileContent.Skip(fileContent.Length - 2).Take(2).ToArray();
if (!(endCharacter.SequenceEqual(Encoding.ASCII.GetBytes(Environment.NewLine))))
{
fileContent = fileContent.Concat(Encoding.ASCII.GetBytes(Environment.NewLine)).ToArray();
File.AppendAllText(openFileDialog.FileName, Environment.NewLine);
}

Search for string in multiple text files of size 150 MB each C#

I have multiple .txt files of 150MB size each. Using C# I need to retrieve all the lines containing the string pattern from each file and then write those lines to a newly created file.
I already looked into similar questions but none of their suggested answers could give me the fastest way of fetching results. I tried regular expressions, linq query, contains method, searching with byte arrays but all of them are taking more than 30 minutes to read and compare the file content.
My test files doesn't have any specific format, it's like raw data which we can't split based on a demiliter and filter based on DataViews.. Below is sample format of each line in that file.
Sample.txt
LTYY;;0,0,;123456789;;;;;;;20121002 02:00;;
ptgh;;0,0,;123456789;;;;;;;20121002 02:00;;
HYTF;;0,0,;846234863;;;;;;;20121002 02:00;;
Multiple records......
My Code
using (StreamWriter SW = new StreamWriter(newFile))
{
using(StreamReader sr = new StreamReader(sourceFilePath))
{
while (sr.Peek() >= 0)
{
if (sr.ReadLine().Contains(stringToSearch))
SW.WriteLine(sr.ReadLine().ToString());
}
}
}
I want a sample code which would take less than a minute to search for 123456789 from the Sample.txt. Let me know if my requirement is not clear. Thanks in advance!
Edit
I found the root cause as having the files residing in a remote server is what consuming more time for reading them because when I copied the files into my local machine, all comparison methods completed very quickly so this isn't issue with the way we read or compare content, they more or less took the same time.
But now how do I address this issue, I can't copy all those files to my machine for comparison and get OutOfMemory exceptions

Fastest method to search is using the Boyer–Moore string search algorithm as this method not require to read all bytes from the files, but require random access to bytes or you can try using the Rabin Karp Algorithm
or you can try doing something like the following code, from this answer:
public static int FindInFile(string fileName, string value)
{ // returns complement of number of characters in file if not found
// else returns index where value found
int index = 0;
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
{
if (String.IsNullOrEmpty(value))
return 0;
StringSearch valueSearch = new StringSearch(value);
int readChar;
while ((readChar = reader.Read()) >= 0)
{
++index;
if (valueSearch.Found(readChar))
return index - value.Length;
}
}
return ~index;
}
public class StringSearch
{ // Call Found one character at a time until string found
private readonly string value;
private readonly List<int> indexList = new List<int>();
public StringSearch(string value)
{
this.value = value;
}
public bool Found(int nextChar)
{
for (int index = 0; index < indexList.Count; )
{
int valueIndex = indexList[index];
if (value[valueIndex] == nextChar)
{
++valueIndex;
if (valueIndex == value.Length)
{
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
return true;
}
else
{
indexList[index] = valueIndex;
++index;
}
}
else
{ // next char does not match
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
}
}
if (value[0] == nextChar)
{
if (value.Length == 1)
return true;
indexList.Add(1);
}
return false;
}
public void Reset()
{
indexList.Clear();
}
}

I don't know how long this will take to run, but here are some improvements:
using (StreamWriter SW = new StreamWriter(newFile))
{
using (StreamReader sr = new StreamReader(sourceFilePath))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
if (line.Contains(stringToSearch))
SW.WriteLine(line);
}
}
}
Note that you don't need Peek, EndOfStream will give you what you want. You were calling ReadLine twice (probably not what you had intended). And there's no need to call ToString() on a string.

As I said already, you should have a database, but whatever.
The fastest, shortest and nicest way to do it (even one-lined) is this:
File.AppendAllLines("b.txt", File.ReadLines("a.txt")
.Where(x => x.Contains("123456789")));
But fast? 150MB is 150MB. It's gonna take a while.
You can replace the Contains method with your own, for faster comparison, but that's a whole different question.
Other possible solution...
var sb = new StringBuilder();
foreach (var x in File.ReadLines("a.txt").Where(x => x.Contains("123456789")))
{
sb.AppendLine(x);
}
File.WriteAllText("b.txt", sb.ToString()); // That is one heavy operation there...
Testing it with a file size 150MB, and it found all results within 3 seconds. The thing that takes time is writing the results into the 2nd file (in case there are many results).

150MB is 150MB. If you have one thread going through the entire 150MB, line by line (a "line" being terminated by a newline character/group or by an EOF), your process must read in and spin through all 150MB of the data (not all at once, and it doesn't have to hold all of it at the same time). A linear search through 157,286,400 characters is, very simply, going to take time, and you say you have many such files.
First thing; you're reading the line out of the stream twice. This will, in most cases, actually cause you to read two lines whenever there's a match; what's written to the new file will be the line AFTER the one containing the search string. This is probably not what you want (then again, it may be). If you want to write the line actually containing the search string, read it into a variable before performing the Contains check.
Second, String.Contains() will, by necessity, perform a linear search. In your case, the behavior will actually approach N^2, because when searching for a string within a string, the first character must be found, and where it is, each character is then matched one by one to subsequent characters until all characters in the search string have matched or a non-matching character is found; when a non-match occurs, the algorithm must go back to the character after the initial match to avoid skipping a possible match, meaning it can test the same character many times when checking for a long string against a longer one with many partial matches. This strategy is therefore technically a "brute force" solution. Unfortunately, when you don't know where to look (such as in unsorted data files), there is no more efficient solution.
The only possible speedup I could suggest, other than being able to sort the files' data and then perform an indexed search, is to multithread the solution; if you're only running this method on one thread that looks through every file, not only is only one thread doing the job, but that thread is constantly waiting for the hard drive to serve up the data it needs. Having 5 or 10 threads each working through one file at a time will not only leverage the true power of modern multi-core CPUs more efficiently, but while one thread is waiting on the hard drive, another thread whose data has been loaded can execute, further increasing the efficiency of this approach. Remember, the further away the data is from the CPU, the longer it takes for the CPU to get it, and when your CPU can do between 2 and 4 billion things per second, having to wait even a few milliseconds for the hard drive means you're losing out on millions of potential instructions per second.

I'm not giving you sample code, but have you tried sorting the content of your files?
trying to search for a string from 150MB worth of files is going to take some time any way you slice it, and if regex takes too long for you, than I'd suggest maybe sorting the content of your files, so that you know roughly where "123456789" will occur before you actually search, that way you won't have to search the unimportant parts.

Do not read and write at same time. Search first, save list of matching lines and write it to file at the end.
using System;
using System.Collections.Generic;
using System.IO;
...
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader("input.txt")) {
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Contains(stringToSearch)) {
list.Add(line); // Add to list.
}
}
}
using (StreamWriter writer = new StreamWriter("output.txt")) {
foreach (string line in list) {
writer.WriteLine(line);
}
}

You're going to experience performance problems in your approaches of blocking input from these files while doing string comparisons.
But Windows has a pretty high performance GREP-like tool for doing string searches of text files called FINDSTR that might be fast enough. You could simply call it as a shell command or redirect the results of the command to your output file.
Either preprocessing (sort) or loading your large files into a database will be faster, but I'm assuming that you already have existing files you need to search.

How to create big sized .txt file?

For certain reasons, I have to create a 1024 kb .txt file.
Below is my current code:
int size = 1024000 //1024 kb..
byte[] bytearray = new byte[size];
foreach (byte bit in bytearray)
{
bit = 0;
}
string tobewritten = string.Empty;
foreach (byte bit in bytearray)
{
tobewritten += bit.ToString();
}
//newPath is local directory, where I store the created file
using (System.IO.StreamWriter sw = File.CreateText(newPath))
{
sw.WriteLine(tobewritten);
}
I have to wait at least 30 minutes to execute this piece of code, which I consider too long.
Now, I would like to ask for advice on how to actually achieve my mentioned objective effectively. Are there any alternatives to do this task? Am I writing bad code? Any help is appreciated.

There are several misunderstandings in the code you provided:
byte[] bytearray = new byte[size];
foreach (byte bit in bytearray)
{
bit = 0;
}
You seem to think that your are initializing each byte in your array bytearray with zero. Instead you just set the loop variable bit (unfortunate naming) to zero size times. Actually this code wouldn't even compile since you cannot assign to the foreach iteration variable.
Also you didn't need initialization here in the first place: byte array elements are automatically initialized to 0.
string tobewritten = string.Empty;
foreach (byte bit in bytearray)
{
tobewritten += bit.ToString();
}
You want to combine the string representation of each byte in your array to the string variable tobewritten. Since strings are immutable you create a new string for each element that has to be garbage collected along with the string you created for bit, this is relatively expensive, especially when you create 2048000 one of them - use a Stringbuilder instead.
Lastly all of that is not needed at all anyway - it seems you just want to write a bunch of "0" characters to a text file - if you are not worried about creating a single large string of zeros (it depends on the value of size whether this makes sense) you can just create the string directly to do this one go - or alternatively write a smaller string directly to the stream a bunch of times.
using (var file = File.CreateText(newpath))
{
file.WriteLine(new string('0', size));
}

Replace the string with a pre-sized StringBuilder to avoid unnecessary allocations.
Or, better yet, write each piece directly to the StreamWriter instead of pointlessly building a 100MB in-memory string first.

Comparing two files in C# [duplicate]

This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 7 years ago.
I want to compare two files in C# and see if they are different. They have the same file names and they are the exact same size when different. I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Thanks

Depending on how far you're looking to take it, you can take a look at Diff.NET
Here's a simple file comparison function:
// This method accepts two strings the represent two files to
// compare. A return value of 0 indicates that the contents of the files
// are the same. A return value of any other value indicates that the
// files are not the same.
private bool FileCompare(string file1, string file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;
// Determine if the same file was referenced two times.
if (file1 == file2)
{
// Return true to indicate that the files are the same.
return true;
}
// Open the two files.
fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read);
fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read);
// Check the file sizes. If they are not the same, the files
// are not the same.
if (fs1.Length != fs2.Length)
{
// Close the file
fs1.Close();
fs2.Close();
// Return false to indicate files are different
return false;
}
// Read and compare a byte from each file until either a
// non-matching set of bytes is found or until the end of
// file1 is reached.
do
{
// Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1));
// Close the files.
fs1.Close();
fs2.Close();
// Return the success of the comparison. "file1byte" is
// equal to "file2byte" at this point only if the files are
// the same.
return ((file1byte - file2byte) == 0);
}

I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Not really.
If the files came with hashes, you could compare the hashes, and if they are different you can conclude the files are different (same hashes, however, does not mean the files are the same and so you will still have to do a byte by byte comparison).
However, hashes use all the bytes in the file, so no matter what, you at some point have to read the files byte for byte. And in fact, just a straight byte by byte comparison will be faster than computing a hash. This is because a hash reads all the bytes just like comparing byte-by-byte does, but hashes do some other computations that add time. Additionally, a byte-by-byte comparison can terminate early on the first pair of non-equal bytes.
Finally, you can not avoid the need for a byte-by-byte read. If the hashes are equal, that doesn't mean the files are equal. In this case you still have to compare byte-by-byte.

Well, I'm not sure if you can in the file write timestamps. If not, your unique alternative, is comparing the content of the files.
A simple approach is comparing the files byte-to-byte, but if you're going to compare a file several times with others, you can calculate the hashcode of the files and compare it.
The following code snippet shows how you can do it:
public static string CalcHashCode(string filename)
{
FileStream stream = new FileStream(
filename,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
try
{
return CalcHashCode(stream);
}
finally
{
stream.Close();
}
}
public static string CalcHashCode(FileStream file)
{
MD5CryptoServiceProvider md5Provider = new MD5CryptoServiceProvider();
Byte[] hash = md5Provider.ComputeHash(file);
return Convert.ToBase64String(hash);
}
If you're going to compare a file with others more that one time, you can save the file hash and compare it. For a single comparison, the byte-to-byte comparison is better. You need also to recompute hash when the file changes, but if you're going to do massive comparisons (more than one time), I recommend using the hash approach.

If the filenames are the same, and the file sizes are the same, then, no, there is no way to know if they have different content without examining the content.

Read the file into a stream, then hash the stream. That should give you a reliable result for comparing.
byte[] fileHash1, fileHash2;
using (SHA256Managed sha = new SHA256Managed())
{
fileHash1 = sha.ComputeHash(streamforfile1);
fileHash2 = sha.ComputeHash(streamforfile2);
}
for (int i = 0; (i < fileHash1.Length) && (i < fileHash2.Length); i++)
{
if (fileHash[i] != fileHash2[i])
{
//files are not the same
break;
}
}

If they are not complied files then use a diff tool like KDiff or WinMerge. It will highlight were they are different.
http://kdiff3.sourceforge.net/
http://winmerge.org/

pass each file stream through an MD5 hasher and compare the hashes.

Alternate UniqueId generation technique

In the application, when special types of objects are created, I need to generate a unique-id for each of them. The objects are created thro' a factory and have a high possibility of being created in a 'bulk' operation. I realize that the "Random" from the framework is not so 'random' after all, so I tried appending the time-stamp as follows:
private string GenerateUniqueId()
{
Random randomValue = new Random();
return DateTime.Now.Ticks.ToString() + randomValue.Next().ToString();
}
Unfortunately, even this does not work. For objects that are created in rapid succession, I generate the same Unique Id :-(
Currently, I am implementing it in a crude way as follows:
private string GenerateUniqueId()
{
Random randomValue = new Random();
int value = randomValue.Next();
Debug.WriteLine(value.ToString());
Thread.Sleep(100);
return DateTime.Now.Ticks.ToString() + value.ToString();
}
Since this is not a very large application, I think a simple and quick technique would suffice instead of implementing an elaborate algorithm.
Please suggest.

A GUID is probably what you're looking for:
private string GenerateUniqueId()
{
return Guid.NewGuid().ToString("N");
}
If you want a smaller, more manageable ID then you could use something like this:
private string GenerateUniqueId()
{
using (var rng = new RNGCryptoServiceProvider())
{
// change the size of the array depending on your requirements
var rndBytes = new byte[8];
rng.GetBytes(rndBytes);
return BitConverter.ToString(rndBytes).Replace("-", "");
}
}
Note: This will only give you a 64-bit number in comparison to the GUID's 128 bits, so there'll be more chance of a collision. Probably not an issue in the real world though. If it is an issue then you could increase the size of the byte array to generate a larger id.

Assuming you do not want a GUID, First option would be a static field, and interlocked:
private static long lastId = 0
private static long GetNextId() {
return Interlocked.Increment(ref lastId);
}
If you want something based on time ticks, remember the last value and if the same manually increment and save; otherwise just save:
private static long lastTick = 0;
private static object idGenLock = new Object();
private static long GetNextId() {
lock (idGenLock) {
long tick = DateTime.UtcNow.Ticks;
if (lastTick == tick) {
tick = lastTick+1;
}
lastTick = tick;
return tick;
}
}
(Neither of these approaches will be good with multiple processes.)

In your comments Codex you say use the unique ID as a file name. There is a specific function for generating cryptographically secure file names, Path.GetRandomFileName()
As it's cryptographically secure these would be unique even in batch operations. The format is a little horrible though as they're optimised for filenames, but it may work for other references as well.

Why can't your factory (which is presumably single-threaded) generate sequential unique integers? If you expected Random() to work, why not Guid() (or whatever is equivalent)?

If you're going to resort to coding your own UUID-generator, make sure you salt the generator.
I suggest you check out the open source package ossp-uuid, which is an ISO-C API and CLI for generating Universally Unique Identifiers.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.