C# - Compare Two Text Files

C# - Compare Two Text Files - c#

Background
I'm developing a simple windows service which monitors certain directories for file creation events and logs these - long story short, to ascertain if a file was copied from directory A to directory B. If a file is not in directory B after X time, an alert will be raised.
The issue with this is I only have the file to go on for information when working out if it has made its way to directory B - I'd assume two files with the same name are the same, but as there are over 60 directory A's and a single directory B - AND the files in any directory A may accidentally be the same as another (by date or sequence) this is not a safe assumption...
Example
Lets say, for example, I store a log that file "E17999_XXX_2111.txt" was created in directory C:\Test. I would store the filename, file path, file creation date, file length and the BOM for this file.
30 seconds later, I detect that the file "E17999_XXX_2111.txt" was created in directory C:\FinalDestination... now I have the task of determining whether;
a) the file is the same one created in C:\Test, therefore I can update the first log as complete and stop worrying about it.
b) the file is not the same and I somehow missed the previous steps - therefore I can ignore this file because it has found its way to the destination dir.
Research
So, in order to determine if the file created in the destination is exactly the same as the one created in the first instance, I've done a bit of research and found the following options:
a) filename compare
b) length compare
c) a creation-date compare
d) byte-for-byte compare
e) hash compare
Problems
a) As I said above, going by Filename alone is too presumptuous.
b) Again, just because the length of the contents of a file is the same, it doesn't necessarily mean the files are actually the same.
c) The problem with this is that a copied file is technically a new file, therefore the creation date changes. I would want to set the first log as complete regardless of the time elapsed between the file appearing in directory A and directory B.
d) Aside from the fact that this method is extremely slow, it appears there's an issue if the second file has somehow changed encoding - for example between ANSII and ASCII, which would cause a byte mis-match for things like ascii quotes
I would like not to assume that just because an ASCII ' has changed to an ANSII ', the file is now different as it is near enough the same.
e) This seems to have the same downfalls as a byte-for-byte compare
EDIT
It appears the actual issue I'm experiencing comes down to the reason for the difference in encoding between directories - I'm not currently able to access the code which deals with this part, so I can't tell why this happens, but I am looking to implement a solution which can compare files regardless of encoding to determine "real" differences (i.e. not those whereby a byte has changed due to encoding)
SOLUTION
I've managed to resolve this now by using the SequenceEqual comparison below after encoding my files to remove any bad data if the initial comparison suggested by #Magnus failed to find a match due to this. Code below:
byte[] bytes1 = Encoding.Convert(Encoding.GetEncoding(1252), Encoding.ASCII, Encoding.GetEncoding(1252).GetBytes(File.ReadAllText(FilePath)));
byte[] bytes2 = Encoding.Convert(Encoding.GetEncoding(1252), Encoding.ASCII, Encoding.GetEncoding(1252).GetBytes(File.ReadAllText(FilePath)));
if (Encoding.ASCII.GetChars(bytes1).SequenceEqual(Encoding.ASCII.GetChars(bytes2)))
{
//matched!
}
Thanks for the help!

You would then have to compare the string content if the files. The StreamReader (which ReadLines uses) should detect the encoding.
var areEquals = System.IO.File.ReadLines("c:\\file1.txt").SequenceEqual(
System.IO.File.ReadLines("c:\\file2.txt"));
Note that ReadLines will not read the complete file into memory.

Related

OpenFileDialog & SaveFileDialog Pop-up search with filter in C#

I have openFileDialog and saveFileDialog with filter (only .dvbcfg extention):
SaveFileDialog saveFileDialog = new SaveFileDialog();
saveFileDialog.Filter = "DVB Configuration File (*.dvbcfg)|*.dvbcfg";
saveFileDialog.DefaultExt = "dvbcfg";
saveFileDialog.AddExtension = true;
It works properly, but when I'm trying to type filename manually it shows files with any extentions w/o filtering and opens/saves them (first - open file, second - save file):
ScreenShot
How to show only files that matches saveFileDialog.Filter?
P.S. I have overwrite function in saveFileDialog.
UPD I have another option - throw an exception when user selected wrong filetype, but I have no idea how to get only file extention from saveFileDialog.FileName string.

At a certain point, you have to "trust" your users. You can steer them towards good ways of working with your program, but at a certain point, you have to recognise that you've put enough simple barriers in their way to prevent accidental misuse1 but you're unlikely to be able to create enough barriers (in these dialogs) to prevent malicious misuse.
The problem is that using wrong file may cause damage to expensive equipment (DVB-3030 Digital Modulator in this case) even if I'm using try/catch to get variables from files (they need to be integers, in try segment I have Convert.ToInteger32) and variable ranges in if/else checks (for example Frequency range should be 10MHz - 90 MHz with 100Hz step). Since program will be used by students, they can purposely try to break it.
And nothing in your current question (or sought answer) would prevent someone from renaming any arbitrary file to have a .dvbcfg extension.
At this point, you "trust" that the user has given you the filename they wish to use. What you need to do next is to validate the contents of the file. If it has a .dvbcfg extension but isn't actually a valid DVB config file, you need to reject it. If it doesn't have a .dvbcfg extension (hey, maybe they're working with an old file system that only allows 8.3 file names :-)) but turns out to have valid content, why be churlish and reject that file?
I would recommend more than just wrapping ToInteger32 calls in try/catch. Go through the file. Ensure it contains exactly what it should and nothing else. Read each parameter value and probably use TryParse on those. Because your code now "expects" to receive invalid inputs. Then validate ranges, etc.
1Which I'd say you've already got.

Is overwriting a file multiple times enough to erase its data?

In Shredding files in .NET it is recommended to use Eraser or this code here on CodeProject to securely erase a file in .NET.
I was trying to make my own method of doing so, as the code from CodeProject had some problems for me. Here's what I came up with:
public static void secureDelete(string file, bool deleteFile = true)
{
string nfName = "deleted" + rnd.Next(1000000000, 2147483647) + ".del";
string fName = Path.GetFileName(file);
System.IO.File.Move(file, file.Replace(fName, nfName));
file = file.Replace(fName, nfName);
int overWritten = 0;
while (overWritten <= 7)
{
byte[] data = new byte[1 * 1024 * 1024];
rnd.NextBytes(data);
File.WriteAllBytes(file, data);
overWritten += 1;
}
if (deleteFile) { File.Delete(file); }
}
It seems to work fine. It renames the file randomly and then overwrites it with 1 mb of random data 7 times. However, I was wondering how safe it actually is, and if there was anyway I could make it safer?

A file system, especially when accessed through a higher-level API such as the ones found in System.IO is so many levels of abstraction above the actual storage implementation that this approach makes little sense for modern drives.
To be clear: the CodeProject article, which promotes overwriting a file by name multiple times, is absolute nonsense - for SSDs at least. There is no guarantee whatsoever that writing to a file at some path multiple times writes to the same physical location on disk every time.
Of course, opening a file with read-write access and overwriting it from the beginning, conceptually writes to the same "location". But that location is pretty abstract.
See it like this: hard disks, but especially solid state drives, might take a write, such as "set byte N of cluster M to O", and actually write an entire new cluster to an entirely different location on the drive, to prolong the drive's lifetime (as repeated writes to the same memory cells may damage the drive).
From Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer | Code Capsule:
Pages cannot be overwritten
A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.
This means that somewhere on the drive, the original data is still readable, namely in the cluster M to which a write was requested. That is, until it is overwritten. The cluster is now marked as "free", but you'll need very low-level access to the disk to access that cluster in order to overwrite it, and I'm not sure that's possible with SSDs.
Even if you would overwrite the entire SSD or hard drive multiple times, chances are that some of your very private data is hidden in a now defunct sector or page on the disk or SSD, because at the moment of overwriting or clearing it the drive determined that location to be defective. A forensics team will be able to read this data (albeit damaged). So, if you have data on a hard drive that can be used against you: toss the drive into a fire.
See also Get file offset on disk/cluster number for some more (links to) information about lower-level file system APIs.
But all of this is to be taken with quite a grain of salt, as all of this is hearsay and I have no actual experience with this level of disk access.

decrypt an encrypted value?

I have an old Paradox database (I can convert it to Access 2007) which contains more then 200,000 records. This database has two columns: the first one is named "Word" and the second one is named "Mean". It is a dictionary database and my client wants to convert this old database to ASP.NET and SQL.
However, we don't know what key or method is used to encrypt or encode the "Mean" column which is in the Unicode format. The software itself has been written in Delphi 7 and we don't have the source code. My client only knows the credentials for logging in to database. The problem is decoding the Mean column.
What I do have is the compiled windows application and the Paradox database. This software can decode the "Mean" column for each "Word" so the method and/or key is in its own compiled code(.exe) or one of the files in its directory.
For example, we know that in the following row the "Zymurgy"
exactly means "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی" since the application translates it like that. Here is what the record looks like when I open the database in Access:
Word Mean
Zymurgy 5OBnGguKPdDAd7L2lnvd9Lnf1mdd2zDBQRxngsCuirK5h91sVmy0kpRcue/+ql9ORmP99Mn/QZ4=
Therefore we're trying to discover how the value in the Mean column is converted to "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی". I think the "Mean" column value in above row is encoded in Base64 string format, but decoding the Base64 string does not yet result in the expected text.
The extensions for files in the win app directory are dll, CCC, DAT, exe (other than the main app file), SYS, FAM, MB, PX, TV, VAL.
Any kind of help is appreciated.
here is two more example and remember double quotes at start and end are not part of the strings:
word: "abdominal"
coded value: "vwtj0bmj7jdF9SS8sbrIalBoKMDvTbpraFgG4gP/G9GLx5iU/E98rQ=="
translation in Farsi: "شکمی, بطنی, وریدهای شکمی, ماهیان بطنی"
word: "cart"
coded value: "KHoCkDsIndb6OKjxVxsh+Ti+iA/ZqP9sz28e4/cQzMyLI+ToPbiLOaECWQ8XKXTz"
translation in Farsi: "ارابه, گاری, دوچرخه, چرخ, با گاری بردن"
here is the result in different encodings:
1- in unicode the result is: "ᩧ訋퀽矀箖�柖�섰᱁艧껀늊螹泝汖銴岔꫾也捆￉鹁"
2- in utf32 the result is: "��������������"
3- in utf7 the result is: "äàg\v=ÐÀw²ö{Ýô¹ßÖg]Û0ÁAgÀ®²¹ÝlVl´\\¹ïþª_NFcýôÉÿA"
4- in utf8 the result is: "��g\v�=��w���{����g]�0�Ag��������lVl���\\����_NFc����A�"
5- in 1256 the result is: "نàg\vٹ=ذہw²ِ–{فô¹كضg]غ0ءAg‚ہ®ٹ²¹‡فlVl´’”\\¹ï‏ھ_NFc‎ôةےA"
yet i discovered that the paradox database system is very complex when it comes to key management and most of the time the keys are "compound keys" and that's why it's problematic and that's why it's abandoned!
UPDATE: i'm trying to do the automation by using AutoIt v3 because the decryption process as i understand can't be done in one or two days. now i have another problem which is related to text/font. when i copy the translated text to notepad it will change to some unrecognizable text unless i change the font of notepad to the font of the translation software. if i type something in the notepad in Farsi it will show it correctly regardless of what font i've been chosen. more interesting is when i copy the text to any other program like MS Office Word it'll be shown correctly no matter what font i choose.
so how can i get around this ?

In this situation, I would think about writing a script/program to simply pull all the data out through the existing program.
You could write an application to send keypresses to the app which would select and copy each value in turn.
It would take a while to run, but you could just leave it overnight (how big is your database?) and it only has to run once.
Not sure how easy this would be, since I haven't seen this app of course - might this work?

Take a debugger like ollydbg/softice. Find the place where the mean is decoded/encoded and then step through the instructions one by one, check all registers to find out what is done. I have done so numerous times. That should help you getting started, since you have the application which is able to decode this stuff. You also have a reference word. That's all you need.
Also take into consideration: Unicode can be Little or Big Endian. So you might try swapping the bytes. UTF-8 can be a pain, since some words are stored as one byte and some as two bytes.
You can also try to take words which are almost identical in Farsi and try to compare the outputs. That could lead to a reconstruction of a custom code page, if there is one.

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.

I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)

As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.

You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);

This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.

If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.

The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

Find appended text from txt file

i want to write a code in a way,if there is a text file placed in a specified path, one of the users edited the file and entered new text and saved it.now,i want to get the text which is appended last time.
here am having file size for both before and after append the text
my text file size is 1204kb from that i need to take the end of 200kb text alone is it possible

This can only be done if you're monitoring the file size in real-time, since files do not maintain their own histories.
If watching the files as they are modified is a possibility, you could perhaps use a FileSystemWatcher and calculate the increase in file size upon any modification. You could then read the bytes appended since the file last changes, which would be very straightforward.

Do you know how big the file was before the user appended the text? If not, there's no way of telling... files don't maintain a revision history (in most file systems, anyway).

You can keep track of the file pointer . Eg If you are using C language then you can go to the end of the file using fseek(fp,SEEK_END) and then use ftell(fp) which will give you the current position of the file pointer . After the user edits and saves the file , when you rerun the code you can check with the new position original position . If the new position is greater than the original position offset those number of bytes with the file pointer

As #Jon Skeet alludes to in his answer, the only way to tell specifically what text that was "appended", is by knowing how large the file was before it was changed. The rest of the characters is thus what was "appended".
Note that I quote appended above since I get two conflicting meanings from your question; edited and appended.
If the user only appends text, which is taken to mean "add more text only at the end", then the previous-size approach should in theory work.
However, if the user freely edits the text, by adding text in random spots, and perhaps even removing or changing existing text, then you need a whole 'nother approach to this.
If it's the latter, I might have something you could use, a binary patching implementation that can also be used to figure out from an older copy of the same file what was changed in a newer copy. It isn't easy to use, and might not give you exactly what you want, but as I said, it's hard to tell exactly what your question is.

If your program is running the entire time, you could grab a copy of the file in memory. Then in a separate thread periodically read the new file and compare the two.

If you want your program to be notified when file is changed, use FileSystemWatcher. However, it will only notify you, when file is changed while your program is running and will not provide you with appended text. You will get only information about which file was changed.
FileSystemWatcher watcher = new FileSystemWatcher(Environment.CurrentDirectory, "test.txt");
while (true)
{
var changedResult =
watcher.WaitForChanged(WatcherChangeTypes.Changed);
Console.WriteLine(changedResult.Name);
}
Or:
FileSystemWatcher watcher = new FileSystemWatcher(Environment.CurrentDirectory, "test.txt");
watcher.Changed += watcher_Changed;
static void watcher_Changed(object sender, FileSystemEventArgs e)
{
Console.WriteLine(e.FullPath);
Console.WriteLine(e.ChangeType);
}

Best solution imo is to write a small app which has to be used to change the file in question. This application can then insert additional info into the file which allows you to keep the entire revision history.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.