Performance issues while creating file checksums

Performance issues while creating file checksums - c#

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.
The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(#filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.
Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.
Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.
Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.
A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.

Related

Get a file SHA256 Hash code and Checksum

Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256.
In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.
public static string GetChecksum(string file)
{
using (FileStream stream = File.OpenRead(file))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(stream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
Then I read this topic and somehow change my code according what they said to :
public static string GetChecksumBuffered(Stream stream)
{
using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
{
var sha = new SHA256Managed();
byte[] checksum = sha.ComputeHash(bufferedStream);
return BitConverter.ToString(checksum).Replace("-", String.Empty);
}
}
But It doesn't have such a affection and takes about 9 mins.
Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !
Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.
My Questions are :
What causes such different between the above code and Linux sha256sum in time ?
What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)
Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?
How can I make my implementation as fast as sha256sum in C#?

public string SHA256CheckSum(string filePath)
{
using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = File.OpenRead(filePath))
return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
}
}

My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.
Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.
The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.
A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.
No idea, beyond those mentioned above. You're doing it right.
For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?
Edit in response to question in comment
The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.
The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.
As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.
Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?

Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:
SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code.
HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. As a result, you're not really going to see much difference using a BufferedStream for this call.
HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream.
Your best bet would be to use a third party implementation that's optimized for your use case.

using (SHA256 SHA256 = SHA256Managed.Create())
{
using (FileStream fileStream = System.IO.File.OpenRead(filePath))
{
string result = "";
foreach (var hash in SHA256.ComputeHash(fileStream))
{
result += hash.ToString("x2");
}
return result;
}
}
For Reference: https://www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/

C# - remove blocks of bytes in large binary files

i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}

No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.

A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.

If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.

How to make these IO reads parallel and performant

I have a list of files: List<string> Files in my C#-based WPF application.
Files contains ~1,000,000 unique file paths.
I ran a profiler on my application. When I try to do parallel operations, it's REALLY laggy because it's IO bound. It even lags my UI threads, despite not having dispatchers going to them (note the two lines I've marked down):
Files.AsParallel().ForAll(x =>
{
char[] buffer = new char[0x100000];
using (FileStream stream = new FileStream(x, FileMode.Open, FileAccess.Read)) // EXTREMELY SLOW
using (StreamReader reader = new StreamReader(stream, true))
{
while (true)
{
int bytesRead = reader.Read(buffer, 0, buffer.Length); // EXTREMELY SLOW
if (bytesRead <= 0)
{
break;
}
}
}
}
These two lines of code take up ~70% of my entire profile test runs. I want to achieve maximum parallelization for IO, while keeping performance such that it doesn't cripple my app's UI entirely. There is nothing else affecting my performance. Proof: Using Files.ForEach doesn't cripple my UI, and WithDegreeOfParallelism helps out too (but, I am writing an application that is supposed to be used on any PC, so I cannot assume a specific degree of parallelism for this computation); also, the PC I am on has a solid-state hard disk. I have searched on StackOverflow, and have found links that talk about using asynchronous IO read methods. I'm not sure how they apply in this case, though. Perhaps someone can shed some light? Also; how can you tune down the constructor time of a new FileStream; is that even possible?
Edit: Well, here's something strange that I've noticed...the UI doesn't get crushed so bad when I swap Read for ReadAsync while still using AsParallel. Simply awaiting the task created by ReadAsync to finish causes my UI thread to maintain some degree of usability. I think this does some sort of asynchronous scheduling that is done in this method to maintain optimal disk usage while not crushing existing threads. And on that note, is there ever a chance that the operating system contends for existing threads to do IO, such as my application's UI thread? I seriously don't understand why its slowing my UI thread. Is the OS scheduling work from IO on my thread or something? Did they do something to the CLR to eat threads that haven't been explicitly affinated using Thread.BeginThreadAffinity or something? Memory is not an issue; I am looking at Task Manager and there is plenty.

I don't agree with your assertion that you can't use WithDegreeOfParallelism because it will be used on any PC. You can base it on number of CPU. By not using WithDegreeOfParallelism you are going to get crushed on some PCs.
You optimized for a solid state disc where heads don't have to move. I don't think this unrestricted parallel design will hold up on regular disc (any PC).
I would try a BlockingCollection with 3 queues : FileStream, StreamReader, and ObservableCollection. Limit the FileStream to like 4 - it just has to stay ahead of StreamReader. And no parallelism.
A single head is a single head. It cannot read from 5 or 5000 files faster than it can read from 1. On solid state the is no penalty switching from file to file - on a regular disc there is a significant penalty. If your files are fragmented there is a significant penalty (on regular disc).
You don't show what the data write but the next step would be to put the write in a another queue with a BlockingCollection in the BlockingCollection.
E.G. sb.Append(text); in a separate queue.
But that may be more overhead than it is worth.
Keep that head as close to 100% busy on a single contiguous file is the best you are going to do.
private async Task<string> ReadTextAsync(string filePath)
{
using (FileStream sourceStream = new FileStream(filePath,
FileMode.Open, FileAccess.Read, FileShare.Read,
bufferSize: 4096, useAsync: true))
{
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[0x1000];
int numRead;
while ((numRead = await sourceStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
{
string text = Encoding.Unicode.GetString(buffer, 0, numRead);
sb.Append(text);
}
return sb.ToString();
}
}

File access is inherently not parallel. You can only benefit from parallelism, if you treat some files while reading others. It makes no sense to wait for the disk in parallel.
Instead of waiting 100 000 time 1 ms for disk access, you program to wait once 100 000 ms = 100 s.

Unfortunately, it's a vague question without a reproducible code example. So it's impossible to offer specific advice. But my two recommendations are:
Pass a ParallelOptions instance where you've set the MaxDegreeOfParallelism property to something reasonably low. Something like the number of cores in your system, or even that number minus one.
Make sure you aren't expecting too much from the disk. You should start with the known speed of the disk and controller, and compare that with the data throughput you're getting. Adjust the degree of parallelism even lower if it looks like you're already at or near the maximum theoretical throughput.
Performance optimization is all about setting realistic goals based on known limitations of the hardware, measuring your actual performance, and then looking at how you can improve the costliest elements of your algorithm. If you haven't done the first two steps yet, you really should start there. :)

I got it working; the problem was me trying to use an ExtendedObservableCollection with AddRange instead of calling Add multiple times in every UI dispatch...for some reason, the performance of the methods people list in here is actually slower in my situation: ObservableCollection Doesn't support AddRange method, so I get notified for each item added, besides what about INotifyCollectionChanging?
I think because it forces you to call change notifications with .Reset (reload) instead of .Add (a diff), there is some sort of logic in place that causes bottlenecks.
I apologize for not posting the rest of the code; I was really thrown off by this, and I'll explain why in a moment. Also, a note for others who come across the same issue, this might help. The main problem with profiling tools in this scenario is that they don't help much here. Most of your app's time will be spent reading files regardless. So you have to unit test all dispatchers separately.

FileStream.Seek vs. Buffered Reading

Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.

Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.

Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.

StringDictionary To TextFile

I am trying to export a stringdictionary to a text file, it has over one million of records, it takes over 3 minutes to export into a textfile if I use a loop.
Is there a way to do that faster?
Regards

Well, it depends on what format you're using for the export, but in general, the biggest overhead for exporting large amounts of data is going to be I/O. You can reduce this by using a more compact data format, and by doing less manipulation of the data in memory (to avoid memory copies) if possible.
The first thing to check is to look at your disk I/O speed and do some profiling of the code that does the writing.
If you're maxing out your disk I/O (e.g., writing at a good percentage of disk speed, which would be many tens of megabytes per second on a modern system), you could consider compressing the data before you write it. This uses more CPU, but you write less to the disk when you do this. This will also likely increase the speed of reading the file, if you have the same bottleneck on the reading side.
If you're maxing out your CPU, you need to do less processing work on the data before writing it. If you're using a serialization library, for example, avoiding that and switching to a simpler, more specialized data format might help. Consider the simplest format you need: probably just a word for the length of the string, followed by the string data itself, repeated for every key and value.

Note that most dictionary constructs don't preserve the insert order - this often makes them poor choices if you want repeatable file contents, but (depending on the size) we may be able to improve on the time.... this (below) takes about 3.5s (for the export) to write just under 30MB:
StringDictionary data = new StringDictionary();
Random rand = new Random(123456);
for (int i = 0; i < 1000000; i++)
{
data.Add("Key " + i, "Value = " + rand.Next());
}
Stopwatch watch = Stopwatch.StartNew();
using (TextWriter output = File.CreateText("foo.txt"))
{
foreach (DictionaryEntry pair in data)
{
output.Write((string)pair.Key);
output.Write('\t');
output.WriteLine((string)pair.Value);
}
output.Close();
}
watch.Stop();
Obviously the performance will depend on the size of the actual data getting written.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.