Computing MD5SUM of large files in C#

Computing MD5SUM of large files in C# - c#

I am using following code to compute MD5SUM of a file -
byte[] b = System.IO.File.ReadAllBytes(file);
string sum = BitConverter.ToString(new MD5CryptoServiceProvider().ComputeHash(b));
This works fine normally, but if I encounter a large file (~1GB) - e.g. an iso image or a DVD VOB file - I get an Out of Memory exception.
Though, I am able to compute the MD5SUM in cygwin for the same file in about 10secs.
Please suggest how can I get this to work for big files in my program.
Thanks

I suggest using the alternate method:
MD5CryptoServiceProvider.ComputeHash(Stream)
and just pass in an input stream opened on your file. This method will almost certainly not read in the whole file in memory in one go.
I would also note that in most implementations of MD5 it's possible to add byte[] data into the digest function a chunk at a time, and then ask for the hash at the end.

Related

Creating a file that can be only used by my program. How do I differ it from other programs' files?

I create my file using File.WriteAllBytes(). Byte[] that is passed to File.WriteAllBytes() is encrypted by algorithm of my own. You need password that was used when file was encrypted (user of the program knows the password) to decrypt it. But when some file is opened by my program using File.ReadAllBytes() there are 3 situations:
File that is being opened is my program's file and user knows the password to open it.
File that is being opened is my program's file but user doesn't know the password to open it.
File that is being opened is not my program's file.
First one is easy to handle. 2nd and 3rd are same for my program because my program doesn't know the difference between encrypted byte[] and byte[] of some random file.
How do I differ these situations? I was thinking of adding some sequence of bytes to the end or beginning of byte[] before passing it to File.WriteAllBytes(). Is that safe? How do modern programs differ their files from other files?

You can give your file some structure before encryption, and check that the structure is there after decryption. If the structure is not there, it's not your file.
For example, you could compute a check sum, and store it in the first few bytes prior to the "payload" block of data. Encrypt the check sum along with the rest of the file.
When you decrypt, take the payload content, and compute its check sum again. Compare the stored result to the computed result to see if the two match. If they don't match, it's not your file. If they do match, very good chances are that it is your file.
This is not the only approach - the structure could be anything you wish, from placing a special sequence of bytes at a specific place to using a specific strict format (e.g. an XML) for your content, and then validating this format after the decryption.
[the file is] encrypted by algorithm of my own.
Be very careful with security through obscurity: coming up with an algorithm that is cryptographically secure is an extremely hard task.

Many many file format use "Magic numbers" in front of the file to determine their types. Use the first ... 4 bytes, write a custom sequence it it then read it when you load the file.

Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?

I have a zip file that contains folder hierarchies and files.
\images\
\images\1.jpg
\images\2.jpg
\something\something\a.exe
\something\something\b.exe
1.txt
I need to decompress the contents of this zip file to a location. I also need to preserve the structure of the zip file.
I've read about .NET's GZipStream and DeflateStream but I am of the opinion that it is too "complicated" for my purpose.
I've also used DotNetZip and SharpZipLib in the past for personal projects but since this is work related and I'm working at a huge company, I would have a hard time convincing legal to use these libraries.
Question:
Is it possible decompress a zip file while maintaining hierarchy using just .NET or some other built-in Windows API?
PS: I've also read this but I think it's hacky because you'll need to produce another executable just to hide the progress dialog.
Thanks!

Check out if Ionic Zip helps?

DotNetZip would do what you want, but I understand your concerns about legal approval.
On a side note, It might be good for you to navigate the legal jungle associated with getting an open-source library approved for use in the company, just to understand what's involved. But I'll leave that up to you.
Getting back to rolling your own...
DotNetZip is pretty full featured, and it handles a number of scenarios you probably don't care about. Like Unicode filenames and comments, setting windows timestamps and permissions of extracted files, getting timestamps of zip files created on old unix systems, split archives, Encrypted archives, files over 2gb, or self-extracting archives, etc etc etc. Many zip files use none of those things.
Also DotNetZip does eventing and zip updates and zip creation - all the code associated with these things is probably not of interest to you, if you confine yourself just to the requirements you described in your question.
You could, though, grab the DotNetZip code and use it to help you roll your own solution. If you constrain yourself to JUST reading zip files and not dealing with all the possible special cases, the zip format is not difficult to parse.
here's how to do it:
open the zip file using new FileStream() or File.Open. You want a FileStream object.
Read 4 bytes. Verify that it is the zip-entry-header descriptor. (0x04034b50)
In the file, the order you will find these bytes is 50 4b 03 04.
if you find a match, you're in business.
at offset 14 is a 4-byte CRC. Get it. (Same byte ordering as above)
at offset 18 - the 4-byte length of the compressed blob. get it. (N)
at offset 22 - the 4-byte length of the UNcompressed blob. get it. (U)
at 26 - the 2-byte length of the filename. get it (L)
at 28 - the 2-byte length of the "extra field". get it (E)
Beyond the extra field, at offset 30, is the actual filename. read L bytes for the filename, and call System.Text.Encoding.ASCII.GetString(). The result will include a directory path, with the backslashes replaced with slashes (unix style). String.Replace() the slashes.
after the filename comes the extra field - seek E bytes to get beyond it. You can mostly ifgnore it. This is where the compressed data starts.
Open a System.IO.DeflateStream() on the zip FileStream, using CompressionMode.Decompress, and using the current offset of the FileStream as input. open a new FileStream, for output, with the file path you read in step 3. in a loop, call inflater.Read(). and output.Write(), to write the decompressed output of the DeflateStream to a filesystem file with the correct name. You will need to stop reading from the DeflateStream when you read exactly U (uncompressed) bytes.
Check the uncompressed size (U) against the data you actually wrote out from the DeflateStream (after compression). They should match.
If you are fancy, you can check the CRC of the output against what was in the header.
go to step 2, to look for the next entry in the file.
The most complicated part is step 3. Working code for that is easily found in this source module, look for the ReadHeader method.

Maybe the full features set of GZipStream it's a bit complicated, but note that the sample in the msdn page it's exactly what you need. I mean this msdn web (the 4.0 version) not the one you supply in the question.
http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx#Y2750

Compression issue with large archive of files in DotNetZip

Greetings....
I am writing a backup program in c# 3.5, using hte latest DotNetZip. The basics of the program is to be given a location on a server and the max size of a spanned zip file and go. From there it should traverse all the folder/files from the given location and add them to the archive, keeping the exact structure. It should also compress everything down to a reasonable amount. A given uncompressed collection of folders/files could easily be 10-25gb, with the created spanned files being limited to about 1gb each.
I have everything working (using DotNetZip). My only challenge is there is little to no compession actually happening. I chose to use the "AddDirectory" method for simplicity of code and just generally how well it seemed to fit my project. After reading around I am second guessing that decision.
Given the below code and the large amount of files in an archive, should I compress each file as it is added to the zip? or should the Adddirectory method provide about the same compression?
I have tried every level of compression offered by Ionic.Zlib.CompressionLevel and none seem to help. Should I think about using an outside compression algorithm and stream it into my DotNetZip file?
using (ZipFile zip = new ZipFile())
{
zip.AddDirectory(root.FullName);
if (zipPassword.Length > 0)
zip.Password = zipPassword;
float size = zipGbSize * 1024 * 1024 * 1024;
zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression;
zip.AddProgress += new EventHandler<AddProgressEventArgs>(Zip_AddProgress);
zip.ZipError += new EventHandler<ZipErrorEventArgs>(Zip_ZipError);
zip.Comment = "This zip was created at " + System.DateTime.Now.ToString("G");
zip.MaxOutputSegmentSize = (int)size; //in gig
zip.Name = archiveDir.FullName + #"\Task_" + taskId.ToString() + ".zip";
zip.Save();
}
Thank you for any help!

1.Given the below code and the large amount of files in an archive, should I compress each file as it is added to the zip?
The way DotNetZip works is to compress each file as it is added to the archive. Your app does not need to do compression. DotNetZip does this for you.
or should the Adddirectory method provide about the same compression?
Entries added to a zip file via the AddDirectory() method go through the same code path when the zip archive is written, as entries added via AddFile(). The file data is compressed, then optionally encrypted, then written to the zip file.
an unsolicited tip: you don't need to do:
zip.AddProgress += new EventHandler<AddProgressEventArgs>(Zip_AddProgress);
you can just do:
zip.AddProgress += Zip_AddProgress;
how are you determining that no compression is occurring?
If you are curious about the compression on each entry, you can register a SaveProgress event handler. The SaveProgress event is fired at various times during the writing of an archive, including when saving begins, when DotNetZip begins writing the data for one entry, at various intervals during the writing of one entry, after finishing writing the data for each entry, and after finishing writing all data. These stages and described in the ZipProgressEventType enumeration. When the EventType is Saving_AfterWriteEntry, you can calculate the compression ratio for THAT particular entry.
To verify that compression is not occurring, I'd suggest that you register such a SaveProgress event and look at that compression ratio.
Also, as described above, some file types cannot be compressed. JPG, MPG, MP3, ZIP files, and others are not very compressible.
Finally, doing a backup may be lots easier to do if you just use the DotNetZip command-line tool. If all you want to do is backup a particular directory, you could use the command line tool (zipit.exe) and avoid writing a program. With the zipit.exe tool, if you use the -v option, the tool prints progress reports, and will display the compression for each entry, via the mechanism I described above. Even if you prefer to write your own program, you might consider using zipit.exe to verify that compression is, or is not, occuring when you use DotNetZip.

Im not sure to have understated your question, but the maximum size for any zip file its 4Gb. Maybe you have to create a new ZipFile every time you reach that limit.
Sorry if that doesnt help you.

What sort of data are you compressing? Some sorts of data just doesn't compress very well, for example JPEGs, or ZIP files which are already compressed.

How to determine file type?

I need to know if my file is audio file: mp3, wav, etc...
How to do this?

Well, the most robust way would be to write a parser for the file types you want to detect and then just try – if there are no errors, it's obviously of the type you tried. This is an expensive approach, however, but it would ensure that you can successfully load the file as well since it will also check the rest of the file for semantic soundness.
A much less expensive variant would be to look for “magic” bytes – signatures at the start or known offsets of the file. For example, if a file starts with an ID3 tag you can be reasonably sure it's an MP3 file. If a file starts with RIFF¼↕☻ WAVEfmt, then it's a WAV file. However, such detection cannot guarantee you that the file is really of that type – it could just be the signature and following that garbage.

While you can use the extension to make a reasonable guess as to what the file is it's not guaranteed to work 100% of the time. If you are targeting Windows then it will work 99.9% of the time as that's how Windows keeps track of what file is what type.
If you are getting your files from non-Windows sources the only sure way is to open the file and look for a specific string or set of bytes which will unambiguously identify it. For example, you could look for the ID3 tags in an mp3 file:
The ID3v1 tag occupies 128 bytes, beginning with the string TAG.
or
ID3v2 tags are of variable size, and usually occur at the start of the file, to aid streaming media.
How far you go depends on how robust you want your solution to be, and does rely on there being a header or pattern that's always present.
Doing it this way can help guard against malicious content where someone posts a piece of malware as a mp3 file (say) and hopes that it will just be run by a program prone to some exploit (a buffer overrun perhaps).

You can use the file extension to figure it out:
using System.IO;
class Program
{
static void Main()
{
string filepath = #"C:\Users\Sam\Documents\Test.txt";
string extension = Path.GetExtension(filepath);
if (extension == ".mp3")
{
Console.WriteLine(extension);
}
}
}
The file extension is the first point of call for the OS to figure out what file type it's dealing with, if you really want to know the file type 100% the only way to do it is read into the file. But this comes with a catch, image files are easy as they include headers in a pretty easy to read format, however it can get a little more complex with a completely variable file type.
You could check out this post on an old post for a bit of help. Here is a post about finding just media file types.
Ultimately it depends on why your trying to do this.

Path.GetExtension(PathToFile)

See this post. You end up passing the first (up to) 256 bytes of data from the file to FindMimeFromData (part of the Urlmon.dll).

Need help manipulating WAV (RIFF) Files at a byte level

I'm writing an an application in C# that will record audio files (*.wav) and automatically tag and name them. Wave files are RIFF files (like AVI) which can contain meta data chunks in addition to the waveform data chunks. So now I'm trying to figure out how to read and write the RIFF meta data to and from recorded wave files.
I'm using NAudio for recording the files, and asked on their forums as well on SO for way to read and write RIFF tags. While I received a number of good answers, none of the solutions allowed for reading and writing RIFF chunks as easily as I would like.
But more importantly I have very little experience dealing with files at a byte level, and think this could be a good opportunity to learn. So now I want to try writing my own class(es) that can read in a RIFF file and allow meta data to be read, and written from the file.
I've used streams in C#, but always with the entire stream at once. So now I'm little lost that I have to consider a file byte by byte. Specifically how would I go about removing or inserting bytes to and from the middle of a file? I've tried reading a file through a FileStream into a byte array (byte[]) as shown in the code below.
System.IO.FileStream waveFileStream = System.IO.File.OpenRead(#"C:\sound.wav");
byte[] waveBytes = new byte[waveFileStream.Length];
waveFileStream.Read(waveBytes, 0, waveBytes.Length);
And I could see through the Visual Studio debugger that the first four byte are the RIFF header of the file.
But arrays are a pain to deal with when performing actions that change their size like inserting or removing values. So I was thinking I could then to the byte[] into a List like this.
List<byte> list = waveBytes.ToList<byte>();
Which would make any manipulation of the file byte by byte a whole lot easier, but I'm worried I might be missing something like a class in the System.IO name-space that would make all this even easier. Am I on the right track, or is there a better way to do this? I should also mention that I'm not hugely concerned with performance, and would prefer not to deal with pointers or unsafe code blocks like this guy.
If it helps at all here is a good article on the RIFF/WAV file format.

I did not write in C#, but can point on some places which are bad from my point of view:
1) Do not read whole WAV files in memory unless the files are your own files and knowingly have small size.
2) There is no need to insert a data in memory. You can simply for example do about the following: Analyze source file, store offsets of chunks, and read metadata in memory; present the metadata for editing in a dialog; while saving write RIFF-WAV header, fmt chunk, transfer audio data from source file (by reading and writing blocks), add metadata; update RIFF-WAV header.
3) Try save metadata in the tail of file. This will results in alternating only tag will not require re-writing of whole file.
It seems some sources regarding working with RIFF files in C# are present here.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.