Comparing two files in C# [duplicate] - c#

This question already has answers here:
How to compare 2 files fast using .NET?
(20 answers)
Closed 7 years ago.
I want to compare two files in C# and see if they are different. They have the same file names and they are the exact same size when different. I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Thanks

Depending on how far you're looking to take it, you can take a look at Diff.NET
Here's a simple file comparison function:
// This method accepts two strings the represent two files to
// compare. A return value of 0 indicates that the contents of the files
// are the same. A return value of any other value indicates that the
// files are not the same.
private bool FileCompare(string file1, string file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;
// Determine if the same file was referenced two times.
if (file1 == file2)
{
// Return true to indicate that the files are the same.
return true;
}
// Open the two files.
fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read);
fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read);
// Check the file sizes. If they are not the same, the files
// are not the same.
if (fs1.Length != fs2.Length)
{
// Close the file
fs1.Close();
fs2.Close();
// Return false to indicate files are different
return false;
}
// Read and compare a byte from each file until either a
// non-matching set of bytes is found or until the end of
// file1 is reached.
do
{
// Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1));
// Close the files.
fs1.Close();
fs2.Close();
// Return the success of the comparison. "file1byte" is
// equal to "file2byte" at this point only if the files are
// the same.
return ((file1byte - file2byte) == 0);
}

I was just wondering if there is a fast way to do this without having to manually go in and read the file.
Not really.
If the files came with hashes, you could compare the hashes, and if they are different you can conclude the files are different (same hashes, however, does not mean the files are the same and so you will still have to do a byte by byte comparison).
However, hashes use all the bytes in the file, so no matter what, you at some point have to read the files byte for byte. And in fact, just a straight byte by byte comparison will be faster than computing a hash. This is because a hash reads all the bytes just like comparing byte-by-byte does, but hashes do some other computations that add time. Additionally, a byte-by-byte comparison can terminate early on the first pair of non-equal bytes.
Finally, you can not avoid the need for a byte-by-byte read. If the hashes are equal, that doesn't mean the files are equal. In this case you still have to compare byte-by-byte.

Well, I'm not sure if you can in the file write timestamps. If not, your unique alternative, is comparing the content of the files.
A simple approach is comparing the files byte-to-byte, but if you're going to compare a file several times with others, you can calculate the hashcode of the files and compare it.
The following code snippet shows how you can do it:
public static string CalcHashCode(string filename)
{
FileStream stream = new FileStream(
filename,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
try
{
return CalcHashCode(stream);
}
finally
{
stream.Close();
}
}
public static string CalcHashCode(FileStream file)
{
MD5CryptoServiceProvider md5Provider = new MD5CryptoServiceProvider();
Byte[] hash = md5Provider.ComputeHash(file);
return Convert.ToBase64String(hash);
}
If you're going to compare a file with others more that one time, you can save the file hash and compare it. For a single comparison, the byte-to-byte comparison is better. You need also to recompute hash when the file changes, but if you're going to do massive comparisons (more than one time), I recommend using the hash approach.

If the filenames are the same, and the file sizes are the same, then, no, there is no way to know if they have different content without examining the content.

Read the file into a stream, then hash the stream. That should give you a reliable result for comparing.
byte[] fileHash1, fileHash2;
using (SHA256Managed sha = new SHA256Managed())
{
fileHash1 = sha.ComputeHash(streamforfile1);
fileHash2 = sha.ComputeHash(streamforfile2);
}
for (int i = 0; (i < fileHash1.Length) && (i < fileHash2.Length); i++)
{
if (fileHash[i] != fileHash2[i])
{
//files are not the same
break;
}
}

If they are not complied files then use a diff tool like KDiff or WinMerge. It will highlight were they are different.
http://kdiff3.sourceforge.net/
http://winmerge.org/

pass each file stream through an MD5 hasher and compare the hashes.

Related

What is the best way to read and then update record in a binary file with c#

I'm trying to edit some records in a binary file, but I just can't seem to get the hang of it.
I can read the file, but than I can't find the position where I want the record to edit, so I can replace.
This is my code so far:
public MyModel Put(MyModel exMyModel)
{
List<MyModel> list = new List<MyModel>();
try
{
IFormatter formatter = new BinaryFormatter();
using (Stream stream = new FileStream(_exMyModel, FileMode.Open, FileAccess.Read, FileShare.Read))
{
while (stream.Position < stream.Length)
{
var obj = (MyModel)formatter.Deserialize(stream);
list.Add(obj);
}
MyModel mymodel = list.FirstOrDefault(i => i.ID == exMyModel.ID);
mymodel.FirstName = exMyModel.FirstName;
mymodel.PhoneNumber = exMyModel.PhoneNumber;
// Now I want to update the current record with this new object
// ... code to update
}
return phoneBookEntry;
}
catch (Exception ex)
{
Console.WriteLine("The error is " + ex.Message);
return null;
}
}
I'm really stuck here guys. Any help would be appreciated.
I already checked these answers:
answer 1
answer 2
Thank you in advance :)
I would recommend just writing all objects back to the stream. You could perhaps just write the changed object and each after it, but I would not bother.
Start by resetting the stream: stream.Position = 0. You can then write a loop an serialize each object using formatter.Serialize(stream, object)
If this is a coding task I guess you have no choice in the matter. But you should know that BinaryFormatter has various problems. It more or less saves the objects the same way they are stored in memory. This is inefficient, insecure, and changes to the classes may prevent you from deserializing stored objects. The most common serialization method today is json, but there are also binary alternatives like protobuf.net.
How you update the file is going to rely pretty heavily on whether or not your records serialize as fixed length.
Variable-Length Records
Since you're using strings in the record then any change in string length (as serialized bytes) or anything other change that affects the length of the serialized object will make it impossible to do an in-place update of the record.
With that in mind you're going to have to do some extra work.
First, test the objects inside the read loop. Capture current position before you deserialize each object, test the object for equivalence, save the offset when you find the record you're looking for then deserialize the rest of the objects in the stream... or copy the rest of the stream to a MemoryStream instance for later.
Next, set stream.Position and stream.Length equal to the start position of the record you're updating, truncating the file. Serialize the new copy of the record into the stream, then copy the MemoryStream that holds the rest of the records back into the stream... or capture and serialize the rest of the objects.
In other words (untested but showing the general structure):
public MyModel Put(MyModel exMyModel)
{
try
{
IFormatter formatter = new BinaryFormatter();
using (Stream stream = File.Open(_exMyModel))
using (var buffer = new MemoryStream())
{
long location = -1;
while (stream.Position < stream.Length)
{
var position = stream.Position;
var obj = (MyModel)formatter.Deserialize(stream);
if (obj.ID == exMyModel.ID)
{
location = position;
stream.CopyTo(buffer);
buffer.Position = 0;
stream.Position = stream.Length = position;
}
}
formatter.Serialize(stream);
if (location > 0 && buffer.Length > 0)
{
buffer.CopyTo(stream);
}
}
return phoneBookEntry;
}
catch (Exception ex)
{
Console.WriteLine("The error is " + ex.Message);
return null;
}
}
Note that in general a MemoryStream holding the serialized data will be faster and take less memory than deserializing the records and then serializing them again.
Static-Length Records
This is unlikely, but in the case that your record type is annotated in such a way that it always serializes to the same number of bytes then you can skip everything to do with the MemoryStream and truncating the binary file. In this case just read records until you find the right one, rewind the stream to that position (after the read) and write a new copy of the record.
You'll have to examine the classes yourself to see what sort of serialization modifier attributes are on the string properties, and I'd suggest testing this extensively with different string values to ensure that you're actually getting the same data length for all of them. Adding or removing a single byte will screw up the remainder of the records in the file.
Edge Case - Same Length Strings
Since replacing a record with data that's the same length only requires an overwrite, not a rewrite of the file, you might get some use out of testing the record length before grabbing the rest of the file. If you get lucky and the modified record is the same length then just seek back to the right position and write the data in-place. That way if you have a file with a ton of records in it you'll get a much faster update whenever the length is the same.
Changing Format...
You said that this is a coding task so you probably can't take this option, but if you can alter the storage format... let's just say that BinaryFormatter is definitely not your friend. There are much better ways to do it if you have the option. SQLite is my binary format of choice :)
Actually, since this appears to be a coding test you might want to make a point of that. Write the code they asked for, then if you have time write a better format that doesn't rely on BinaryFormatter, or throw SQLite at the problem. Using an ORM like LinqToDB makes SQLite trivial. Explain to them that the file format they're using is inherently unstable and should be replaced with something that is both stable, supported and efficient.

How to compare files using Byte Array and Hash

Background
I am converting media files to a new format and need a way of knowing if I've previously in current runtime, converted a file.
My solution
To hash each file and store the hash in an array. Each time I go to convert a file I hash it and check the hash against the hashes stored in the array.
Problem
My logic doesn't seem able to detect when I've already seen a file and I end up converting the same file multiple times.
Code
//Byte array of already processed files
private static readonly List<byte[]> Bytelist = new List<byte[]>();
public static bool DoCheck(string file)
{
FileInfo info = new FileInfo(file);
while (FrmMain.IsFileLocked(info)) //Make sure file is finished being copied/moved
{
Thread.Sleep(500);
}
//Get byte sig of file and if seen before dont process
byte[] myFileData = File.ReadAllBytes(file);
byte[] myHash = MD5.Create().ComputeHash(myFileData);
if (Bytelist.Count != 0)
{
foreach (var item in Bytelist)
{
//If seen before ignore
if (myHash == item)
{
return true;
}
}
}
Bytelist.Add(myHash);
return false;
}
Question
Is there more efficient way of trying to acheive my end goal? What am I doing wrong?
There are multiple questions, I'm going to answer the first one:
Is there more efficient way of trying to acheive my end goal?
TL;DR yes.
You're storing hashes and comparing hashes only for the files, which is a really expensive operation. You can do other checks before calculating the hash:
Is the file size the same? If not, go to the next check.
Are the first bunch of bytes the same? If not, go to the next check.
At this point you have to check the hashes (MD5).
Of course you will have to store size/first X bytes/hash for each processed file.
In addition, same MD5 doesn't mean the files are the same so you might want to take an extra step to check if they're really the same, but this might be an overkill, depends on how heavy the cost of reprocessing the file is, might be more important not to calculate expensive hashes.
EDIT: The second question: is likely to fail as you are comparing the reference of two byte arrays that will never be the same as you create a new one every time, you need to create a sequence equal comparison between byte[]. (Or convert the hash to a string and compare strings then)
var exists = Bytelist.Any(hash => hash.SequenceEqual(myHash));
Are you sure this new file format doesn't add extra meta data into
the content? like last modified, or attributes that change ?
Also, if you are converting to a known format, then there should be a
way using a file signature to know if its already in this format or
not, if this is your format, then add some extra bytes for signature to identify it.
Don't forget that if your app gets closed and opened again it will
reporcess all files again by your approach.
Another last point regarding the code, I prefer not storing byte
arrays, but if you should, its better you create HashSet
instead of list, it has an access time of O(1).
There's a lot of room for improvement with regard to efficiency, effectiveness and style, but this isn't CodeReview.SE, so I'll try to stick the problem at hand:
You're checking if a two byte arrays are equivalent by using the == operator. But that will only perform reference equality testing - i.e. test if the two variables point to the same instance, the very same array. That, of course, won't work here.
There are many ways to do it, starting with a simple foreach loop over the arrays (with an optimization that checks the length first, probably) or using Enumerable.SequenceEquals as you can find in this answer here.
Better yet, convert your hash's byte[] to a string (any string - Convert.ToBase64String would be a good choice) and store that in your Bytelist cache (which should be a Hashset, not a List). Strings are optimized for these sort of comparisons, and you won't run into the "reference equality" problem here.
So a sample solution would be this:
private static readonly HashSet<string> _computedHashes = new HashSet<string>();
public static bool DoCheck(string file)
{
/// stuff
//Get byte sig of file and if seen before dont process
byte[] myFileData = File.ReadAllBytes(file);
byte[] myHash = MD5.Create().ComputeHash(myFileData);
string hashString = Convert.ToBase64String(myHash);
return _computedHashes.Contains(hashString);
}
Presumably, you'll add the hash to the _computedHashes set after you've done the conversion.
You have to compare the byte arrays item by item:
foreach (var item in Bytelist)
{
//If seen before ignore
if (myHash.Length == item.Length)
{
bool isequal = true;
for (int i = 0; i < myHash.Length; i++)
{
if (myHash[i] != item[i])
{
isequal = false;
}
}
if (isequal)
{
return true;
}
}
}

How to check if a byte array ends with carriage return

I want to know wether my byte array ends on carriage return and if not I want to add it.
Thats what I have tried
byte[] fileContent = File.ReadAllBytes(openFileDialog.FileName);
byte[] endCharacter = fileContent.Skip(fileContent.Length - 2).Take(2).ToArray();
if (!(endCharacter.Equals(Encoding.ASCII.GetBytes(Environment.NewLine))))
{
fileContent = fileContent.Concat(Encoding.ASCII.GetBytes(Environment.NewLine)).ToArray();
}
But I don't get it... Is this the right approach? If so, what's wrong with equals? Even if my byte array ends with {10,13}, the If statement never detects it.
In this case, Equals checks for reference equality; while endCharacter and Encoding.ASCII.GetBytes(Environment.NewLine) may have the same contents, they are not the same array, so Equals returns false.
You're interested in value equality, so you should instead individually compare the values at each position in the arrays:
newLine = Encoding.ASCII.GetBytes(Environment.NewLine);
if (endCharacter[0] != newLine[0] && endCharacter[1] != newLine[1])
{
// ...
}
In general, if you want to compare arrays for value equality, you could use something like this method, provided by Marc Gravell.
However, a much more efficient solution to your problem would be to convert the last two bytes of your file into ASCII and do a string comparison (since System.String already overloads == to check for value equality):
string endCharacter = Encoding.ASCII.GetString(fileContent, fileContent.Length - 2, 2);
if (endCharacter == Environment.NewLine)
{
// ...
}
You may also need to be careful about reading the entire file into memory if it's likely to be large. If you don't need the full contents of the file, you could do this more efficiently by just reading in the final two bytes, inspecting them, and appending directly to the file as necessary. This can be achieved by opening a System.IO.FileStream for the file (through System.IO.File.Open).
I found the solution, I must take SequenceEqual (http://www.dotnetperls.com/sequenceequal) in place of Equals. Thanks to everyone!
byte[] fileContent = File.ReadAllBytes(openFileDialog.FileName);
byte[] endCharacter = fileContent.Skip(fileContent.Length - 2).Take(2).ToArray();
if (!(endCharacter.SequenceEqual(Encoding.ASCII.GetBytes(Environment.NewLine))))
{
fileContent = fileContent.Concat(Encoding.ASCII.GetBytes(Environment.NewLine)).ToArray();
File.AppendAllText(openFileDialog.FileName, Environment.NewLine);
}

Which is a better way to compare 2 files?

I have the following situation in C#:
ZipFile z1 = ZipFile.Read("f1.zip");
ZipFile z2 = ZipFile.Read("f2.zip");
MemoryStream ms1 = new MemoryStream();
MemoryStream ms2 = new MemoryStream()
ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry1 = zip2["f2.dll"];
zipentry1.Extract(ms1);
zipentry2.Extract(ms2);
byte[] b1 = new byte[ms1.Length];
byte[] b2 = new byte[ms2.Length];
ms1.Seek(0, SeekOrigin.Begin);
ms2.Seek(0, SeekOrigin.Begin);
what I have done here is opened 2 zip files f1.zip and f2.zip. Then I extract 2 files inside them (f1.txt and f2.txt inside f1.zip and f2.zip respectively) onto the MemoryStream objects. I now want to compare the files and find out if they are the same or not. I had 2 ways in mind:
1) Read the memory streams byte by byte and compare them.
For this I would use
ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);
and then run a for loop and compare each byte in b1 and b2.
2) Get the string values for both the memory streams and then do a string compare. For this I would use
string str1 = Encoding.UTF8.GetString(ms1.GetBuffer(), 0, (int)ms1.Length);
string str2 = Encoding.UTF8.GetString(ms2.GetBuffer(), 0, (int)ms2.Length);
and then do a simple string compare.
Now, I know comparing byte by byte will always give me a correct result. But the problem with it is, it will take a lot time as I have to do this for thousands of files. That is why I am thinking about the string compare method which looks to find out if the files are equal or not very quickly. But I am not sure if string compare will give me the correct result as the files are either dlls or media files etc and will contain special characters for sure.
Can anyone tell me if the string compare method will work correctly or not ?
Thanks in advance.
P.S. : I am using DotNetLibrary.
The baseline for this question is the native way to compare arrays: Enumerable.SequenceEqual. You should use that unless you have good reason to do otherwise.
If you care about speed, you could attempt to p/invoke to memcmp in msvcrt.dll and compare the byte arrays that way. I find it hard to imagine that could be beaten. Obviously you'd do a comparison of the lengths first and only call memcmp if the two byte arrays had the same length.
The p/invoke looks like this:
[DllImport("msvcrt.dll", CallingConvention=CallingConvention.Cdecl)]
static extern int memcmp(byte[] lhs, byte[] rhs, UIntPtr count);
But you should only contemplate this if you really do care about speed, and the pure managed alternatives are too slow for you. So, do some timings to make sure you are not optimising prematurely. Well, even to make sure that you are optimising at all.
I'd be very surprised if converting to string was fast. I'd expect it to be slow. And in fact I'd expect your code to fail because there's no reason for your byte arrays to be valid UTF-8. Just forget you ever had that idea!
Compare ZipEntry.Crc and ZipEntry.UncompressedSize of the two files, only if they match uncompress and do the byte comparison. If the two files are the same, their CRC and Size will be the same too. This strategy will save you a ton of CPU cycles.
ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry2 = zip2["f2.dll"];
if (zipentry1.Crc == zipentry2.Crc && zipentry1.UncompressedSize == zipentry2.UncompressedSize)
{
// uncompress
zipentry1.Extract(ms1);
zipentry2.Extract(ms2);
byte[] b1 = new byte[ms1.Length];
byte[] b2 = new byte[ms2.Length];
ms1.Seek(0, SeekOrigin.Begin);
ms2.Seek(0, SeekOrigin.Begin);
ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);
// perform a byte comparison
if (Enumerable.SequenceEqual(b1, b2)) // or a simple for loop
{
// files are the same
}
else
{
// files are different
}
}
else
{
// files are different
}

Search for string in multiple text files of size 150 MB each C#

I have multiple .txt files of 150MB size each. Using C# I need to retrieve all the lines containing the string pattern from each file and then write those lines to a newly created file.
I already looked into similar questions but none of their suggested answers could give me the fastest way of fetching results. I tried regular expressions, linq query, contains method, searching with byte arrays but all of them are taking more than 30 minutes to read and compare the file content.
My test files doesn't have any specific format, it's like raw data which we can't split based on a demiliter and filter based on DataViews.. Below is sample format of each line in that file.
Sample.txt
LTYY;;0,0,;123456789;;;;;;;20121002 02:00;;
ptgh;;0,0,;123456789;;;;;;;20121002 02:00;;
HYTF;;0,0,;846234863;;;;;;;20121002 02:00;;
Multiple records......
My Code
using (StreamWriter SW = new StreamWriter(newFile))
{
using(StreamReader sr = new StreamReader(sourceFilePath))
{
while (sr.Peek() >= 0)
{
if (sr.ReadLine().Contains(stringToSearch))
SW.WriteLine(sr.ReadLine().ToString());
}
}
}
I want a sample code which would take less than a minute to search for 123456789 from the Sample.txt. Let me know if my requirement is not clear. Thanks in advance!
Edit
I found the root cause as having the files residing in a remote server is what consuming more time for reading them because when I copied the files into my local machine, all comparison methods completed very quickly so this isn't issue with the way we read or compare content, they more or less took the same time.
But now how do I address this issue, I can't copy all those files to my machine for comparison and get OutOfMemory exceptions
Fastest method to search is using the Boyer–Moore string search algorithm as this method not require to read all bytes from the files, but require random access to bytes or you can try using the Rabin Karp Algorithm
or you can try doing something like the following code, from this answer:
public static int FindInFile(string fileName, string value)
{ // returns complement of number of characters in file if not found
// else returns index where value found
int index = 0;
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
{
if (String.IsNullOrEmpty(value))
return 0;
StringSearch valueSearch = new StringSearch(value);
int readChar;
while ((readChar = reader.Read()) >= 0)
{
++index;
if (valueSearch.Found(readChar))
return index - value.Length;
}
}
return ~index;
}
public class StringSearch
{ // Call Found one character at a time until string found
private readonly string value;
private readonly List<int> indexList = new List<int>();
public StringSearch(string value)
{
this.value = value;
}
public bool Found(int nextChar)
{
for (int index = 0; index < indexList.Count; )
{
int valueIndex = indexList[index];
if (value[valueIndex] == nextChar)
{
++valueIndex;
if (valueIndex == value.Length)
{
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
return true;
}
else
{
indexList[index] = valueIndex;
++index;
}
}
else
{ // next char does not match
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
}
}
if (value[0] == nextChar)
{
if (value.Length == 1)
return true;
indexList.Add(1);
}
return false;
}
public void Reset()
{
indexList.Clear();
}
}
I don't know how long this will take to run, but here are some improvements:
using (StreamWriter SW = new StreamWriter(newFile))
{
using (StreamReader sr = new StreamReader(sourceFilePath))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
if (line.Contains(stringToSearch))
SW.WriteLine(line);
}
}
}
Note that you don't need Peek, EndOfStream will give you what you want. You were calling ReadLine twice (probably not what you had intended). And there's no need to call ToString() on a string.
As I said already, you should have a database, but whatever.
The fastest, shortest and nicest way to do it (even one-lined) is this:
File.AppendAllLines("b.txt", File.ReadLines("a.txt")
.Where(x => x.Contains("123456789")));
But fast? 150MB is 150MB. It's gonna take a while.
You can replace the Contains method with your own, for faster comparison, but that's a whole different question.
Other possible solution...
var sb = new StringBuilder();
foreach (var x in File.ReadLines("a.txt").Where(x => x.Contains("123456789")))
{
sb.AppendLine(x);
}
File.WriteAllText("b.txt", sb.ToString()); // That is one heavy operation there...
Testing it with a file size 150MB, and it found all results within 3 seconds. The thing that takes time is writing the results into the 2nd file (in case there are many results).
150MB is 150MB. If you have one thread going through the entire 150MB, line by line (a "line" being terminated by a newline character/group or by an EOF), your process must read in and spin through all 150MB of the data (not all at once, and it doesn't have to hold all of it at the same time). A linear search through 157,286,400 characters is, very simply, going to take time, and you say you have many such files.
First thing; you're reading the line out of the stream twice. This will, in most cases, actually cause you to read two lines whenever there's a match; what's written to the new file will be the line AFTER the one containing the search string. This is probably not what you want (then again, it may be). If you want to write the line actually containing the search string, read it into a variable before performing the Contains check.
Second, String.Contains() will, by necessity, perform a linear search. In your case, the behavior will actually approach N^2, because when searching for a string within a string, the first character must be found, and where it is, each character is then matched one by one to subsequent characters until all characters in the search string have matched or a non-matching character is found; when a non-match occurs, the algorithm must go back to the character after the initial match to avoid skipping a possible match, meaning it can test the same character many times when checking for a long string against a longer one with many partial matches. This strategy is therefore technically a "brute force" solution. Unfortunately, when you don't know where to look (such as in unsorted data files), there is no more efficient solution.
The only possible speedup I could suggest, other than being able to sort the files' data and then perform an indexed search, is to multithread the solution; if you're only running this method on one thread that looks through every file, not only is only one thread doing the job, but that thread is constantly waiting for the hard drive to serve up the data it needs. Having 5 or 10 threads each working through one file at a time will not only leverage the true power of modern multi-core CPUs more efficiently, but while one thread is waiting on the hard drive, another thread whose data has been loaded can execute, further increasing the efficiency of this approach. Remember, the further away the data is from the CPU, the longer it takes for the CPU to get it, and when your CPU can do between 2 and 4 billion things per second, having to wait even a few milliseconds for the hard drive means you're losing out on millions of potential instructions per second.
I'm not giving you sample code, but have you tried sorting the content of your files?
trying to search for a string from 150MB worth of files is going to take some time any way you slice it, and if regex takes too long for you, than I'd suggest maybe sorting the content of your files, so that you know roughly where "123456789" will occur before you actually search, that way you won't have to search the unimportant parts.
Do not read and write at same time. Search first, save list of matching lines and write it to file at the end.
using System;
using System.Collections.Generic;
using System.IO;
...
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader("input.txt")) {
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Contains(stringToSearch)) {
list.Add(line); // Add to list.
}
}
}
using (StreamWriter writer = new StreamWriter("output.txt")) {
foreach (string line in list) {
writer.WriteLine(line);
}
}
You're going to experience performance problems in your approaches of blocking input from these files while doing string comparisons.
But Windows has a pretty high performance GREP-like tool for doing string searches of text files called FINDSTR that might be fast enough. You could simply call it as a shell command or redirect the results of the command to your output file.
Either preprocessing (sort) or loading your large files into a database will be faster, but I'm assuming that you already have existing files you need to search.

Categories