How can I compute two hashes without reading the same file twice?

How can I compute two hashes without reading the same file twice? - c#

I have a program which is going to be used on very large files (current test data is 250GB). I need to be able to calculate both MD5 and SHA1 hashes for these files. Currently my code drops the stream into MD5.Create().ComputeHash(Stream stream), and then the same for SHA1. These, as far as I can tell, read the file in 4096-byte blocks to a buffer internal to the hashing function, until the end of the stream.
The problem is, doing this one after the other takes a VERY long time! Is there any way I can take data into a buffer and provide the buffer to BOTH algorithms before reading a new block into the buffer?
Please explain thoroughly as I'm not an experienced coder.

Sure. You can call TransformBlock repeatedly, and then TransformFinalBlock at the end and then use Hash to get the final hash. So something like:
using (var md5 = MD5.Create()) // Or MD5Cng.Create
using (var sha1 = SHA1.Create()) // Or SHA1Cng.Create
using (var input = File.OpenRead("file.data"))
{
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = input.Read(buffer, 0, buffer.Length()) > 0)
{
md5.TransformBlock(buffer, 0, bytesRead, buffer, 0);
sha1.TransformBlock(buffer, 0, bytesRead, buffer, 0);
}
// We have to call TransformFinalBlock, but we don't have any
// more data - just provide 0 bytes.
md5.TransformFinalBlock(buffer, 0, 0, buffer, 0);
sha1.TransformFinalBlock(buffer, 0, 0, buffer, 0);
byte[] md5Hash = md5.Hash;
byte[] sha1Hash = sha1.Hash;
}
The MD5Cng.Create and SHA1Cng.Create calls will create wrappers around native implementations which are likely to be faster than the implementations returned by MD5.Create and SHA1.Create, but which will be a bit less portable (e.g. for PCLs).

Related

Handling big file stream (read+write bytes)

The following code do :
Read all bytes from an input file
Keep only part of the file in outbytes
Write the extracted bytes in outputfile
byte[] outbytes = File.ReadAllBytes(sourcefile).Skip(offset).Take(size).ToArray();
File.WriteAllBytes(outfile, outbytes);
But there is a limitation of ~2GB data for each step.
Edit: The extracted bytes size can also be greater than 2GB.
How could I handle big file ? What is the best way to proceed with good performances, regardless of size ?
Thx !

Example to FileStream to take the middle 3 Gb out of a 5 Gb file:
byte[] buffer = new byte{1024*1024];
using(var readFS = File.Open(pathToBigFile))
using(var writeFS = File.OpenWrite(pathToNewFile))
{
readFS.Seek(1024*1024*1024); //seek to 1gb in
for(int i=0; i < 3000; i++){ //3000 times of one megabyte = 3gb
int bytesRead = readFS.Read(buffer, 0, buffer.Length);
writeFS.Write(buffer, 0, bytesRead);
}
}
It's not a production grade code; Read might not read a full megabyte so you'd end up with less than 3Gb - it's more to demonstrate the concept of using two filestreams and reading repeatedly from one and writing repeatedly to the other. I'm sure you can modify it so that it copies an exact number of bytes by keeping track of the total of all the bytesRead in the loop and stopping reading when you have read enough

It is better to stream the data from one file to the other, only loading small parts of it into memory:
public static void CopyFileSection(string inFile, string outFile, long startPosition, long size)
{
// Open the files as streams
using (var inStream = File.OpenRead(inFile))
using (var outStream = File.OpenWrite(outFile))
{
// seek to the start position
inStream.Seek(startPosition, SeekOrigin.Begin);
// Create a variable to track how much more to copy
// and a buffer to temporarily store a section of the file
long remaining = size;
byte[] buffer = new byte[81920];
do
{
// Read the smaller of 81920 or remaining and break out of the loop if we've already reached the end of the file
int bytesRead = inStream.Read(buffer, 0, (int)Math.Min(buffer.Length, remaining));
if (bytesRead == 0) { break; }
// Write the buffered bytes to the output file
outStream.Write(buffer, 0, bytesRead);
remaining -= bytesRead;
}
while (remaining > 0);
}
}
Usage:
CopyFileSection(sourcefile, outfile, offset, size);
This should have equivalent functionality to your current method without the overhead of reading the entire file, regardless of its size, into memory.
Note: If you're doing this in code that uses async/await, you should change CopyFileSection to be public static async Task CopyFileSection and change inStream.Read and outStream.Write to await inStream.ReadAsync and await outStream.WriteAsync respectively.

Writing and reading bytes to/from filestream

I am working on simulating a filesystem. I am having a difficult time reading and writing bytes to/from the filestream. I am aiming to toggle the first bit to be a '1' indicating that it does in fact have data in it. I have set up a test scenario to represent what I am trying to achieve.
The problem is that it appears to turn the bit on and write it to _FileStream, however, when i go to read it out - I do not see my change.
_Filestream = new FileStream(volumeName, FileMode.Open);
_Filestream.Seek(0, SeekOrigin.Begin);
//Test lines
byte[] testAsBytes = new byte[_DirectoryUnitSize];
testAsBytes[0] = 1;
byte[] newDirectoryByteArray = new byte[_DirectoryUnitSize];
_Filestream.Write(testAsBytes, 0, newDirectoryByteArray.Length);
_Filestream.Flush();
int bytesRead;
byte[] buffer = new byte[64];
char[] charBuffer = new char[64];
List<byte> data = new List<byte>();
while ((bytesRead = _Filestream.Read(buffer, 0, buffer.Length)) > 0) {
if (!string.IsNullOrEmpty(Encoding.ASCII.GetString(buffer, 0, bytesRead))) {
data = Encoding.ASCII.GetBytes(charBuffer, 0, 1).ToList();
}
}

You have serval issues here
1.why you create another byte[] array - newDirectoryByteArray. Seems unnecessary, you can simply write
_Filestream.Write(testAsBytes, 0, testAsBytes.Length);
2. When you write to file the cursor moves along with it, if you want to use the same FileStream to read from a file, you must seek your desired location. Meaning before _Filestream.Read(...) you must call, for example. Filestream.Seek(0, SeekOrigin.Begin);
From MSDN
If the write operation is successful, the current position of the
stream is advanced by the number of bytes written. If an exception
occurs, the current position of the stream is unchanged.
the line data = Encoding.ASCII.GetBytes(charBuffer, 0, 1).ToList(); will always reutns 0 becuse no data is written to charBuffer . Your data is at buffer which will contaion 1 at index 0 and 0 at all other indices
you can see it with Console.WriteLine(buffer[0]); will output 1 and Console.WriteLine(data[0]); will output 0

Decompress file using zlib - stuck at 256kb limit

I hope someone here will be able to help me out with this.
What I'm trying to do is decompress a zlib compressed file in C# using ZlibNet. (I've also tried DotNetZip and SharpZipLib)
The problem that I'm having is that it'll decompress only the first 256kb, or rather the first 262144 bytes.
Here's my Decompress method, taken from here:
public static byte[] Decompress(byte[] gzip)
{
using (var stream = new Ionic.Zlib.ZlibStream(new MemoryStream(gzip), Ionic.Zlib.CompressionMode.Decompress))
{
var outStream = new MemoryStream();
const int size = 999999; //Playing around with various sizes didn't help
byte[] buffer = new byte[size];
int read;
while ((read = stream.Read(buffer, 0, size)) > 0)
{
outStream.Write(buffer, 0, read);
read = 0;
}
return outStream.ToArray();
}
}
Basically, the int (read) gets set to 262144 on the first time the while loop executes, it writes, and then the next pass of the while loop, read gets said to 0, thus making the loop exit and the function return the outStream as an array. (Even though there are still bytes left to be read!)
Thanks in advance to anyone who could help with this!

Upon further inspection of the originally packed data, it turns out that the script responsible for (de)compressing the data in the original application would split the zlib stream of a file into chunks of 262144 bytes each.
This is why the various libraries I tested always stopped at 262144 bytes-- it was the end of the zlib stream, but not the end of the file it was supposed to extract. (Each zlib stream was also seperated by a 32-bit unsigned int that indicated the amount of bytes the next zlib stream would contain)
My only guess is that they did this so that if they had a very large file, they wouldn't need to load all of it into memory for decompression. (But that's just a guess.)

How to perform the FFT to a wave-file using NAudio

I'm working with the NAudio-library and would like to perform the fast fourier transformation to a WaveStream. I saw that NAudio has already built-in the FFT but how do I use it?
I heard i have to use the SampleAggregator class.

You need to read this entire blog article to best understand the following code sample I lifted to ensure the sample is preserved even if the article isn't:
using (WaveFileReader reader = new WaveFileReader(fileToProcess))
{
IWaveProvider stream32 = new Wave16toFloatProvider(reader);
IWaveProvider streamEffect = new AutoTuneWaveProvider(stream32, autotuneSettings);
IWaveProvider stream16 = new WaveFloatTo16Provider(streamEffect);
using (WaveFileWriter converted = new WaveFileWriter(tempFile, stream16.WaveFormat))
{
// buffer length needs to be a power of 2 for FFT to work nicely
// however, make the buffer too long and pitches aren't detected fast enough
// successful buffer sizes: 8192, 4096, 2048, 1024
// (some pitch detection algorithms need at least 2048)
byte[] buffer = new byte[8192];
int bytesRead;
do
{
bytesRead = stream16.Read(buffer, 0, buffer.Length);
converted.WriteData(buffer, 0, bytesRead);
} while (bytesRead != 0 && converted.Length < reader.Length);
}
}
but in short, if you get the WAV file created you can use that sample to convert it to FFT.

How do I copy the contents of one stream to another?

What is the best way to copy the contents of one stream to another? Is there a standard utility method for this?

From .NET 4.5 on, there is the Stream.CopyToAsync method
input.CopyToAsync(output);
This will return a Task that can be continued on when completed, like so:
await input.CopyToAsync(output)
// Code from here on will be run in a continuation.
Note that depending on where the call to CopyToAsync is made, the code that follows may or may not continue on the same thread that called it.
The SynchronizationContext that was captured when calling await will determine what thread the continuation will be executed on.
Additionally, this call (and this is an implementation detail subject to change) still sequences reads and writes (it just doesn't waste a threads blocking on I/O completion).
From .NET 4.0 on, there's is the Stream.CopyTo method
input.CopyTo(output);
For .NET 3.5 and before
There isn't anything baked into the framework to assist with this; you have to copy the content manually, like so:
public static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[32768];
int read;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
output.Write (buffer, 0, read);
}
}
Note 1: This method will allow you to report on progress (x bytes read so far ...)
Note 2: Why use a fixed buffer size and not input.Length? Because that Length may not be available! From the docs:
If a class derived from Stream does not support seeking, calls to Length, SetLength, Position, and Seek throw a NotSupportedException.

MemoryStream has .WriteTo(outstream);
and .NET 4.0 has .CopyTo on normal stream object.
.NET 4.0:
instream.CopyTo(outstream);

I use the following extension methods. They have optimized overloads for when one stream is a MemoryStream.
public static void CopyTo(this Stream src, Stream dest)
{
int size = (src.CanSeek) ? Math.Min((int)(src.Length - src.Position), 0x2000) : 0x2000;
byte[] buffer = new byte[size];
int n;
do
{
n = src.Read(buffer, 0, buffer.Length);
dest.Write(buffer, 0, n);
} while (n != 0);
}
public static void CopyTo(this MemoryStream src, Stream dest)
{
dest.Write(src.GetBuffer(), (int)src.Position, (int)(src.Length - src.Position));
}
public static void CopyTo(this Stream src, MemoryStream dest)
{
if (src.CanSeek)
{
int pos = (int)dest.Position;
int length = (int)(src.Length - src.Position) + pos;
dest.SetLength(length);
while(pos < length)
pos += src.Read(dest.GetBuffer(), pos, length - pos);
}
else
src.CopyTo((Stream)dest);
}

.NET Framework 4 introduce new "CopyTo" method of Stream Class of System.IO namespace. Using this method we can copy one stream to another stream of different stream class.
Here is example for this.
FileStream objFileStream = File.Open(Server.MapPath("TextFile.txt"), FileMode.Open);
Response.Write(string.Format("FileStream Content length: {0}", objFileStream.Length.ToString()));
MemoryStream objMemoryStream = new MemoryStream();
// Copy File Stream to Memory Stream using CopyTo method
objFileStream.CopyTo(objMemoryStream);
Response.Write("<br/><br/>");
Response.Write(string.Format("MemoryStream Content length: {0}", objMemoryStream.Length.ToString()));
Response.Write("<br/><br/>");

There is actually, a less heavy-handed way of doing a stream copy. Take note however, that this implies that you can store the entire file in memory. Don't try and use this if you are working with files that go into the hundreds of megabytes or more, without caution.
public static void CopySmallTextStream(Stream input, Stream output)
{
using (StreamReader reader = new StreamReader(input))
using (StreamWriter writer = new StreamWriter(output))
{
writer.Write(reader.ReadToEnd());
}
}
NOTE: There may also be some issues concerning binary data and character encodings.

The basic questions that differentiate implementations of "CopyStream" are:
size of the reading buffer
size of the writes
Can we use more than one thread (writing while we are reading).
The answers to these questions result in vastly different implementations of CopyStream and are dependent on what kind of streams you have and what you are trying to optimize. The "best" implementation would even need to know what specific hardware the streams were reading and writing to.

Unfortunately, there is no really simple solution. You can try something like that:
Stream s1, s2;
byte[] buffer = new byte[4096];
int bytesRead = 0;
while (bytesRead = s1.Read(buffer, 0, buffer.Length) > 0) s2.Write(buffer, 0, bytesRead);
s1.Close(); s2.Close();
But the problem with that that different implementation of the Stream class might behave differently if there is nothing to read. A stream reading a file from a local harddrive will probably block until the read operaition has read enough data from the disk to fill the buffer and only return less data if it reaches the end of file. On the other hand, a stream reading from the network might return less data even though there are more data left to be received.
Always check the documentation of the specific stream class you are using before using a generic solution.

There may be a way to do this more efficiently, depending on what kind of stream you're working with. If you can convert one or both of your streams to a MemoryStream, you can use the GetBuffer method to work directly with a byte array representing your data. This lets you use methods like Array.CopyTo, which abstract away all the issues raised by fryguybob. You can just trust .NET to know the optimal way to copy the data.

if you want a procdure to copy a stream to other the one that nick posted is fine but it is missing the position reset, it should be
public static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[32768];
long TempPos = input.Position;
while (true)
{
int read = input.Read (buffer, 0, buffer.Length);
if (read <= 0)
return;
output.Write (buffer, 0, read);
}
input.Position = TempPos;// or you make Position = 0 to set it at the start
}
but if it is in runtime not using a procedure you shpuld use memory stream
Stream output = new MemoryStream();
byte[] buffer = new byte[32768]; // or you specify the size you want of your buffer
long TempPos = input.Position;
while (true)
{
int read = input.Read (buffer, 0, buffer.Length);
if (read <= 0)
return;
output.Write (buffer, 0, read);
}
input.Position = TempPos;// or you make Position = 0 to set it at the start

Since none of the answers have covered an asynchronous way of copying from one stream to another, here is a pattern that I've successfully used in a port forwarding application to copy data from one network stream to another. It lacks exception handling to emphasize the pattern.
const int BUFFER_SIZE = 4096;
static byte[] bufferForRead = new byte[BUFFER_SIZE];
static byte[] bufferForWrite = new byte[BUFFER_SIZE];
static Stream sourceStream = new MemoryStream();
static Stream destinationStream = new MemoryStream();
static void Main(string[] args)
{
// Initial read from source stream
sourceStream.BeginRead(bufferForRead, 0, BUFFER_SIZE, BeginReadCallback, null);
}
private static void BeginReadCallback(IAsyncResult asyncRes)
{
// Finish reading from source stream
int bytesRead = sourceStream.EndRead(asyncRes);
// Make a copy of the buffer as we'll start another read immediately
Array.Copy(bufferForRead, 0, bufferForWrite, 0, bytesRead);
// Write copied buffer to destination stream
destinationStream.BeginWrite(bufferForWrite, 0, bytesRead, BeginWriteCallback, null);
// Start the next read (looks like async recursion I guess)
sourceStream.BeginRead(bufferForRead, 0, BUFFER_SIZE, BeginReadCallback, null);
}
private static void BeginWriteCallback(IAsyncResult asyncRes)
{
// Finish writing to destination stream
destinationStream.EndWrite(asyncRes);
}

For .NET 3.5 and before try :
MemoryStream1.WriteTo(MemoryStream2);

Easy and safe - make new stream from original source:
MemoryStream source = new MemoryStream(byteArray);
MemoryStream copy = new MemoryStream(byteArray);

The following code to solve the issue copy the Stream to MemoryStream using CopyTo
Stream stream = new MemoryStream();
//any function require input the stream. In mycase to save the PDF file as stream
document.Save(stream);
MemoryStream newMs = (MemoryStream)stream;
byte[] getByte = newMs.ToArray();
//Note - please dispose the stream in the finally block instead of inside using block as it will throw an error 'Access denied as the stream is closed'

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I compute two hashes without reading the same file twice? - c#

Related

Handling big file stream (read+write bytes)

Writing and reading bytes to/from filestream

Decompress file using zlib - stuck at 256kb limit

How to perform the FFT to a wave-file using NAudio

How do I copy the contents of one stream to another?

Categories

Resources