Processing Huge Files In C#

Processing Huge Files In C# - c#

I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:
public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
{
var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
int FindPoint = 0;
int Results = 0;
for (long i = 0; i < Stream.Length; i++)
{
if (Find[FindPoint] == Stream.ReadByte())
{
FindPoint++;
if (FindPoint > Find.Length - 1)
{
Results++;
FindPoint = 0;
Stream.Seek(-Find.Length, SeekOrigin.Current);
Stream.Write(Replace, 0, Replace.Length);
}
}
else
{
FindPoint = 0;
}
}
Stream.Close();
return Results;
}
Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)
Start here:
http://en.wikipedia.org/wiki/String_searching_algorithm
The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example
4GB File starts with: A B C D E F G H I J K L M N O P
Your substring is: N O P
You skip the length of the substring-1 and check against the last byte, so compare C to P
It doesn't match, so the substring is not the first 3 bytes
Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
Compare F to P, doesn't match, F isn't in substring, skip 3
Compare I to P, etc, etc
If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

Instead of reading file byte by byte read it by buffer:
buffer = new byte[bufferSize];
currentPos = 0;
length = (int)Stream .Length;
while ((count = Stream.Read(buffer, currentPos, bufferSize)) > 0)
{
currentPos += count;
....
}

Another, easier way of reading more than one byte at a time:
var Stream = new BufferedStream(new FileStream(File, FileMode.Open, FileAccess.ReadWrite));
Combining this with Saeed Amiri's example of how to read into a buffer, and one of the better binary find/replace algorithms should give you better results.

You should try using memory-mapped files. C# supports them starting with version 4.0.
A memory-mapped file contains the contents of a file in virtual memory.
Persisted files are memory-mapped files that are associated with a source file on a disk. When the last process has finished working with the file, the data is saved to the source file on the disk. These memory-mapped files are suitable for working with extremely large source files.

Related

Does converting between byte[] and MemoryStream cause overhead?

I want to know if there's overhead when converting between byte arrays and Streams (specifically MemoryStreams when using MemoryStream.ToArray() and MemoryStream(byte[]). I assume it's temporary doubling memory usage.
For example, I read as a stream, convert to bytes, and then convert to stream again.
But getting rid of that byte conversion will require a bit of a rewrite. I don't want to waste time rewriting it if it doesn't make a difference.

So, yes.. you are correct in assuming that ToArray duplicates the memory in the stream.
If you want do not want to do this (for efficiency reasons), you could modify the bytes directly in the stream. Take a look at this:
// create some bytes: 0,1,2,3,4,5,6,7...
var originalBytes = Enumerable.Range(0, 256).Select(Convert.ToByte).ToArray();
using(var ms = new MemoryStream(originalBytes)) // ms is referencing bytes array, not duplicating it
{
// var duplicatedBytes = ms.ToArray(); // copy of originalBytes array
// If you don't want to duplicate the bytes but want to
// modify the buffer directly, you could do this:
var bufRef = ms.GetBuffer();
for(var i = 0; i < bufRef.Length; ++i)
{
bufRef[i] = Convert.ToByte(bufRef[i] ^ 0x55);
}
// or this:
/*
ms.TryGetBuffer(out var buf);
for (var i = 0; i < buf.Count; ++i)
{
buf[i] = Convert.ToByte(buf[i] ^ 0x55);
}*/
// or this:
/*
for (var i = 0; i < ms.Length; ++i)
{
ms.Position = i;
var b = ms.ReadByte();
ms.Position = i;
ms.WriteByte(Convert.ToByte(b ^ 0x55));
}*/
}
// originalBytes will now be 85,84,87,86...
ETA:
Edited to add in Blindy's examples. Thanks! -- Totally forgot about GetBuffer and had no idea about TryGetBuffer

Does MemoryStream(byte[]) cause a memory copy?
No, it's a non-resizable stream, and as such no copy is necessary.
Does MemoryStream.ToArray() cause a memory copy?
Yes, by design it creates a copy of the active buffer. This is to cover the resizable case, where the buffer used by the stream is not the same buffer that was initially provided due to reallocations to increase/decrease its size.
Alternatives to MemoryStream.ToArray() that don't cause memory copy?
Sure, you have MemoryStream.TryGetBuffer (out ArraySegment<byte> buffer), which returns a segment pointing to the internal buffer, whether or not it's resizable. If it's non-resizable, it's a segment into your original array.
You also have MemoryStream.GetBuffer, which returns the entire internal buffer. Note that in the resizable case, this will be a lot larger than the actual used stream space, and you'll have to adjust for that in code.
And lastly, you don't always actually need a byte array, sometimes you just need to write it to another stream (a file, a socket, a compression stream, an Http response, etc). For this, you have MemoryStream.CopyTo[Async], which also doesn't perform any copies.

How to read file bytes from byte offset?

If I am given a .cmp file and a byte offset 0x598, how can I read a file from this offset?
I can ofcourse read file bytes like this
byte[] fileBytes = File.ReadAllBytes("upgradefile.cmp");
But how can I read it from byte offset 0x598
To explain a bit more, actually from this offset the actual data starts that I have to read and before this byte offset it is just header data, so basically I have to read file from that offset till end.

Try code like this:
using (BinaryReader reader = new BinaryReader(File.Open("upgradefile.cmp", FileMode.Open)))
{
long offset = 0x598;
if (reader.BaseStream.Length > offset)
{
reader.BaseStream.Seek(offset, SeekOrigin.Begin);
byte[]fileBytes = reader.ReadBytes((int) (reader.BaseStream.Length - offset));
}
}

If you are not familiar with Streams, Linq, or whatever, I have simplest solution for you:
Read entire file into memory (I hope you deal with small files):
byte[] fileBytes = File.ReadAllBytes("upgradefile.cmp");
Calculate how many bytes are present in array after given offset:
long startOffset = 0x598; // this is just hexadecimal representation for human, it can be decimal or whatever
long howManyBytesToRead = fileBytes.Length - startOffset;
Then just copy data to new array:
byte[] newArray = new byte[howManyBytesToRead];
long pos = 0;
for (int i = startOffset; i < fileBytes.Length; i++)
{
newArray[pos] = fileBytes[i];
pos = pos + 1;
}
If you understand how it works you can look at Array.Copy method in Microsoft documentation.

By not using ReadAllBytes.
Get a stream, move to potition, read rest of files.
You basically complain that a convenience method made to allow a one line read of a whole file is not what you want - ignoring that it is just that, a convenience method. The normal way to deal with files is opening them and using a Stream.

Unable to read beyond the end of the stream

I did some quick method to write a file from a stream but it's not done yet. I receive this exception and I can't find why:
Unable to read beyond the end of the stream
Is there anyone who could help me debug it?
public static bool WriteFileFromStream(Stream stream, string toFile)
{
FileStream fileToSave = new FileStream(toFile, FileMode.Create);
BinaryWriter binaryWriter = new BinaryWriter(fileToSave);
using (BinaryReader binaryReader = new BinaryReader(stream))
{
int pos = 0;
int length = (int)stream.Length;
while (pos < length)
{
int readInteger = binaryReader.ReadInt32();
binaryWriter.Write(readInteger);
pos += sizeof(int);
}
}
return true;
}
Thanks a lot!

Not really an answer to your question but this method could be so much simpler like this:
public static void WriteFileFromStream(Stream stream, string toFile)
{
// dont forget the using for releasing the file handle after the copy
using (FileStream fileToSave = new FileStream(toFile, FileMode.Create))
{
stream.CopyTo(fileToSave);
}
}
Note that i also removed the return value since its pretty much useless since in your code, there is only 1 return statement
Apart from that, you perform a Length check on the stream but many streams dont support checking Length.
As for your problem, you first check if the stream is at its end. If not, you read 4 bytes. Here is the problem. Lets say you have a input stream of 6 bytes. First you check if the stream is at its end. The answer is no since there are 6 bytes left. You read 4 bytes and check again. Ofcourse the answer is still no since there are 2 bytes left. Now you read another 4 bytes but that ofcourse fails since there are only 2 bytes. (readInt32 reads the next 4 bytes).

I presume that the input stream have ints only (Int32). You need to test the PeekChar() method,
while (binaryReader.PeekChar() != -1)
{
int readInteger = binaryReader.ReadInt32();
binaryWriter.Write(readInteger);
}

You are doing while (pos < length) and length is the actual length of the stream in bytes. So you are effectively counting the bytes in the stream and then trying to read that many number of ints (which is incorrect). You could take length to be stream.Length / 4 since an Int32 is 4 bytes.

try
int length = (int)binaryReader.BaseStream.Length;

After reading the stream by the binary reader the position of the stream is at the end, you have to set the position to zero "stream.position=0;"

Better/faster way to fill a big array in C#

I have 3 *.dat files (346KB,725KB,1762KB) that are filled with a json-string of "big" int-Arrays.
Each time my object is created (several times) I take those three files and use JsonConvert.DeserializeObject to deserialize the arrays into the object.
I thought about using binary-files instead of a json-string or could I even save these arrays directly? I dont need to use these files, it's just the location the data is currently saved. I would gladly switch to anything faster.
What are the different ways to speed up the initialization of these objects?

The fastest way is to manually serialize the data.
An easy way to do this is by creating a FileStream, and then wrapping it in a BinaryWriter/BinaryReader.
You have access to functions to write the basic data structures (numbers, string, char, byte[] and char[]).
An easy way to write a int[] (unneccesary if it's fixed size) is by prepending the length of the array with either an int/long (depending on the size, unsigned doesn't really give any advantages, since arrays use signed datatypes for their length storage). And then write all the ints.
Two ways to write all the ints would be:
1. Simply loop over the entire array.
2. Convert it into a byte[] and write it using BinaryWriter.Write(byte[])
These is how you can implement them both:
// Writing
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];
writer.Write(intArr.Length);
for (int i = 0; i < intArr.Length; i++)
writer.Write(intArr[i]);
// Reading
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];
for (int i = 0; i < intArr.Length; i++)
intArr[i] = reader.ReadInt32();
// Writing, method 2
BinaryWriter writer = new BinaryWriter(new FileStream(...));
int[] intArr = new int[1000];
byte[] byteArr = new byte[intArr.Length * sizeof(int)];
Buffer.BlockCopy(intArr, 0, byteArr, 0, intArr.Length * sizeof(int));
writer.Write(intArr.Length);
writer.Write(byteArr);
// Reading, method 2
BinaryReader reader = new BinaryReader(new FileStream(...));
int[] intArr = new int[reader.ReadInt32()];
byte[] byteArr = reader.ReadBytes(intArr.Length * sizeof(int));
Buffer.BlockCopy(byteArr, 0, intArr, 0, byteArr.Length);
I decided to put this all to the test, with an array of 10000 integers I ran the test 10000 times.
It resulted in method one consumes averagely 888200ns on my system (about 0.89ms).
While method 2 only consumes averagely 568600ns on my system (0.57ms averagely).
Both times include the work the garbage collector has to do.
Obviously method 2 is faster than method 1, though possibly less readable.
Another reason why method 1 can be better than method 2 is because method 2 requires double the amount of RAM free than data you're going to write (the original int[] and the byte[] that's converted from the int[]), when dealing with limited RAM/extremely large files (talking about 512MB+), though if this is the case, you can always make a hybrid solution, by for example writing away 128MB at a time.
Note that method 1 also requires this extra space, but because it's split down in 1 operation per item of the int[], it can release the memory a lot earlier.
Something like this, will write 128MB of an int[] at a time:
const int WRITECOUNT = 32 * 1024 * 1024; // 32 * sizeof(int)MB
int[] intArr = new int[140 * 1024 * 1024]; // 140 * sizeof(int)MB
for (int i = 0; i < intArr.Length; i++)
intArr[i] = i;
byte[] byteArr = new byte[WRITECOUNT * sizeof(int)]; // 128MB
int dataDone = 0;
using (Stream fileStream = new FileStream("data.dat", FileMode.Create))
using (BinaryWriter writer = new BinaryWriter(fileStream))
{
while (dataDone < intArr.Length)
{
int dataToWrite = intArr.Length - dataDone;
if (dataToWrite > WRITECOUNT) dataToWrite = WRITECOUNT;
Buffer.BlockCopy(intArr, dataDone, byteArr, 0, dataToWrite * sizeof(int));
writer.Write(byteArr);
dataDone += dataToWrite;
}
}
Note that this is just for writing, reading works differently too :P.
I hope this gives you some more insight in dealing with very large data files :).

If you've just got a bunch of integers, then using JSON will indeed be pretty inefficient in terms of parsing. You can use BinaryReader and BinaryWriter to write binary files efficiently... but it's not clear to me why you need to read the file every time you create an object anyway. Why can't each new object keep a reference to the original array, which has been read once? Or if they need to mutate the data, you could keep one "canonical source" and just copy that array in memory each time you create an object.

The fastest way to create a byte array from an array of integers is to use Buffer.BlockCopy
byte[] result = new byte[a.Length * sizeof(int)];
Buffer.BlockCopy(a, 0, result, 0, result.Length);
// write result to FileStream or wherever
If you store the size of the array in the first element, you can use it again to deserialize. Make sure everything fits into memory, but looking at your file sizes it should.
var buffer = File.ReadAllBytes(#"...");
int size = BitConverter.ToInt32(buffer,0);
var result = new int[size];
Buffer.BlockCopy(buffer, 0, result, result.length);
Binary is not human readable, but definetely faster than JSON.

Problem with comparing 2 files - false when should be true?

I tried using byte by byte comparison and also comparing calculated hash of files (code samples below). I have a file, copy it - compare both - result is TRUE. But problem starts when I open one of the files:
With MS word files - after opening and closing one of files, result is still TRUE, but, for example, when I delete last symbol in file, then write it back again, and then try to compare again - result FALSE. Files are basically the same, bet theoretically it seems that in byte-by-byte they are not the same anymore.
With Excel files - even opening a file causes function to return false. Should it really be like that? Only thing that I can think of that has changed is Last Access time. But does it takes into consideration when comparing byte-by-byte?
So I wanted to ask, does this comparison really should work like this, and is there anything I could do to evade this? In my program I will compare mostly.pdf files, where editing won't be much of an option, but I still wanted to know why is it acting like this.
Byte-by-Byte with buffer:
static bool FilesAreEqualFaster(string f1, string f2)
{
// get file length and make sure lengths are identical
long length = new FileInfo(f1).Length;
if (length != new FileInfo(f2).Length)
return false;
byte[] buf1 = new byte[4096];
byte[] buf2 = new byte[4096];
// open both for reading
using (FileStream stream1 = File.OpenRead(f1))
using (FileStream stream2 = File.OpenRead(f2))
{
// compare content for equality
int b1, b2;
while (length > 0)
{
// figure out how much to read
int toRead = buf1.Length;
if (toRead > length)
toRead = (int)length;
length -= toRead;
// read a chunk from each and compare
b1 = stream1.Read(buf1, 0, toRead);
b2 = stream2.Read(buf2, 0, toRead);
for (int i = 0; i < toRead; ++i)
if (buf1[i] != buf2[i])
return false;
}
}
return true;
}
Hash:
private static bool CompareFileHashes(string fileName1, string fileName2)
{
// Compare file sizes before continuing.
// If sizes are equal then compare bytes.
if (CompareFileSizes(fileName1, fileName2))
{
// Create an instance of System.Security.Cryptography.HashAlgorithm
HashAlgorithm hash = HashAlgorithm.Create();
// Declare byte arrays to store our file hashes
byte[] fileHash1;
byte[] fileHash2;
// Open a System.IO.FileStream for each file.
// Note: With the 'using' keyword the streams
// are closed automatically.
using (FileStream fileStream1 = new FileStream(fileName1, FileMode.Open),
fileStream2 = new FileStream(fileName2, FileMode.Open))
{
// Compute file hashes
fileHash1 = hash.ComputeHash(fileStream1);
fileHash2 = hash.ComputeHash(fileStream2);
}
return BitConverter.ToString(fileHash1) == BitConverter.ToString(fileHash2);
}
else
{
return false;
}
}

Aside from anything else, this code is wrong:
b1 = stream1.Read(buf1, 0, toRead);
b2 = stream2.Read(buf2, 0, toRead);
for (int i = 0; i < toRead; ++i)
if (buf1[i] != buf2[i])
return false;
You're ignoring the possibility of b1 and b2 being unequal to each other and to toRead. What if you only read 10 bytes from the first stream and 20 from the second, when you asked for 30? You may not have reached the end of the files, but it can still potentially return you less data than you ask for. Never ignore the return value of Stream.Read. (You're saving it in a variable but then ignoring the variable.)
Basically you'll need to have independent buffers, which are replenished when necessary - keep track of where you are within each buffer, and how much useful data is there. Read more data into each buffer when you need to.
Then there's the other problem of files actually changing just by opening them, as Henk mentioned.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Processing Huge Files In C# - c#

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

Instead of reading file byte by byte read it by buffer: buffer = new byte[bufferSize]; currentPos = 0; length = (int)Stream .Length; while ((count = Stream.Read(buffer, currentPos, bufferSize)) > 0) { currentPos += count; .... }

Related

Does converting between byte[] and MemoryStream cause overhead?

How to read file bytes from byte offset?

Unable to read beyond the end of the stream

Better/faster way to fill a big array in C#

Problem with comparing 2 files - false when should be true?

Categories

Resources