I have a class Value
the output of Value is used as an input to other classes and eventually in Main.
In Main a logic is performed and output is produced for first 512 bits. I want my program to return back to value() to start with next 512 bits of file.txt. How can I do that?
public static byte[] Value()
{
byte[] numbers = new byte[9999];
using (FileStream fs = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
for (int i = 0; i < 512; i++)
{
numbers[i] = Byte.Parse(line[i].ToString());
}
}
}
return numbers;
}
What can be done is to pass Value() an offset and a length parameter.
But there is a problem with your method, you are actually taking the first bytes for each line in the file, which I don't know is what you want to do. So I corrected this to make sure you return only length bytes.
using System.Linq Skip and Take methods, you may find things easier as well
public static byte[] Value(int startOffset, int length)
{
byte allBytes = File.ReadAllBytes(#"C:\Users\file.txt");
return allBytes.Skip(startOffset).Take(length);
}
It seems like what you are trying to do is use a recursive call on Value() this is based on your comment, but it is not clear, so I am going to do that assumption.
there is a problem I see and it's like in your scenario you're returning a byte[], So I modified your code a little bit to make it as closest as your's.
/// <summary>
/// This method will call your `value` methodand return the bytes and it is the entry point for the loci.
/// </summary>
/// <returns></returns>
public static byte[] ByteValueCaller()
{
byte[] numbers = new byte[9999];
Value(0, numbers);
return numbers;
}
public static void Value(int startingByte, byte[] numbers)
{
using (FileStream fs = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BinaryReader br = new BinaryReader(fs))
{
//64bytes == 512bit
//determines if the last position to use is inside your stream, or if the last position is the end of the stream.
int bytesToRead = startingByte + 64 > br.BaseStream.Length ? (int)br.BaseStream.Length - startingByte : 64;
//move your stream to the given possition
br.BaseStream.Seek(startingByte, SeekOrigin.Begin);
//populates databuffer with the given bytes
byte[] dataBuffer = br.ReadBytes(bytesToRead);
//This method will migrate from our temporal databuffer to the numbers array.
TransformBufferArrayToNumbers(startingByte, dataBuffer, numbers);
//recursive call to the same
if (startingByte + bytesToRead < fs.Length)
Value(startingByte + bytesToRead, numbers);
}
static void TransformBufferArrayToNumbers(int startingByte, byte[] dataBuffer, byte[] numbers)
{
for (var i = 0; i < dataBuffer.Length; i++)
{
numbers[startingByte + i] = dataBuffer[i];
}
}
}
Also, be careful with the byte[9999] as you are limiting the characters you can get, if that's a hardcoded limit, I will add also that information on the if that determines the recursive call.
#TiGreX
public static List<byte> ByteValueCaller()
{
List<byte> numbers = new List<byte>();
GetValue(0, numbers);
return numbers;
}
public static void GetValue(int startingByte, List<byte> numbers)
{
using (FileStream fs = File.Open(#"C:\Users\file1.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BinaryReader br = new BinaryReader(fs))
{
//64bytes == 512bit
//determines if the last position to use is inside your stream, or if the last position is the end of the stream.
int bytesToRead = startingByte + 64 > br.BaseStream.Length ? (int)br.BaseStream.Length - startingByte : 64;
//move your stream to the given possition
br.BaseStream.Seek(startingByte, SeekOrigin.Begin);
//populates databuffer with the given bytes
byte[] dataBuffer = br.ReadBytes(bytesToRead);
numbers.AddRange(dataBuffer);
//recursive call to the same
if (startingByte + bytesToRead < fs.Length)
GetValue(startingByte + bytesToRead, numbers);
}
}
Is there an elegant to emulate the StreamReader.ReadToEnd method with BinaryReader? Perhaps to put all the bytes into a byte array?
I do this:
read1.ReadBytes((int)read1.BaseStream.Length);
...but there must be a better way.
Original Answer (Read Update Below!)
Simply do:
byte[] allData = read1.ReadBytes(int.MaxValue);
The documentation says that it will read all bytes until the end of the stream is reached.
Update
Although this seems elegant, and the documentation seems to indicate that this would work, the actual implementation (checked in .NET 2, 3.5, and 4) allocates a full-size byte array for the data, which will probably cause an OutOfMemoryException on a 32-bit system.
Therefore, I would say that actually there isn't an elegant way.
Instead, I would recommend the following variation of #iano's answer. This variant doesn't rely on .NET 4:
Create an extension method for BinaryReader (or Stream, the code is the same for either).
public static byte[] ReadAllBytes(this BinaryReader reader)
{
const int bufferSize = 4096;
using (var ms = new MemoryStream())
{
byte[] buffer = new byte[bufferSize];
int count;
while ((count = reader.Read(buffer, 0, buffer.Length)) != 0)
ms.Write(buffer, 0, count);
return ms.ToArray();
}
}
There is not an easy way to do this with BinaryReader. If you don't know the count you need to read ahead of time, a better bet is to use MemoryStream:
public byte[] ReadAllBytes(Stream stream)
{
using (var ms = new MemoryStream())
{
stream.CopyTo(ms);
return ms.ToArray();
}
}
To avoid the additional copy when calling ToArray(), you could instead return the Position and buffer, via GetBuffer().
To copy the content of a stream to another, I've solved reading "some" bytes until the end of the file is reached:
private const int READ_BUFFER_SIZE = 1024;
using (BinaryReader reader = new BinaryReader(responseStream))
{
using (BinaryWriter writer = new BinaryWriter(File.Open(localPath, FileMode.Create)))
{
int byteRead = 0;
do
{
byte[] buffer = reader.ReadBytes(READ_BUFFER_SIZE);
byteRead = buffer.Length;
writer.Write(buffer);
byteTransfered += byteRead;
} while (byteRead == READ_BUFFER_SIZE);
}
}
Had the same problem.
First, get the file's size using FileInfo.Length.
Next, create a byte array and set its value to BinaryReader.ReadBytes(FileInfo.Length).
e.g.
var size = new FileInfo(yourImagePath).Length;
byte[] allBytes = yourReader.ReadBytes(System.Convert.ToInt32(size));
Another approach to this problem is to use C# extension methods:
public static class StreamHelpers
{
public static byte[] ReadAllBytes(this BinaryReader reader)
{
// Pre .Net version 4.0
const int bufferSize = 4096;
using (var ms = new MemoryStream())
{
byte[] buffer = new byte[bufferSize];
int count;
while ((count = reader.Read(buffer, 0, buffer.Length)) != 0)
ms.Write(buffer, 0, count);
return ms.ToArray();
}
// .Net 4.0 or Newer
using (var ms = new MemoryStream())
{
stream.CopyTo(ms);
return ms.ToArray();
}
}
}
Using this approach will allow for both reusable as well as readable code.
I use this, which utilizes the underlying BaseStream property to give you the length info you need. It keeps things nice and simple.
Below are three extension methods on BinaryReader:
The first reads from wherever the stream's current position is to the end
The second reads the entire stream in one go
The third utilizes the Range type to specify the subset of data you are interested in.
public static class BinaryReaderExtensions {
public static byte[] ReadBytesToEnd(this BinaryReader binaryReader) {
var length = binaryReader.BaseStream.Length - binaryReader.BaseStream.Position;
return binaryReader.ReadBytes((int)length);
}
public static byte[] ReadAllBytes(this BinaryReader binaryReader) {
binaryReader.BaseStream.Position = 0;
return binaryReader.ReadBytes((int)binaryReader.BaseStream.Length);
}
public static byte[] ReadBytes(this BinaryReader binaryReader, Range range) {
var (offset, length) = range.GetOffsetAndLength((int)binaryReader.BaseStream.Length);
binaryReader.BaseStream.Position = offset;
return binaryReader.ReadBytes(length);
}
}
Using them is then trivial and clear...
// 1 - Reads everything in as a byte array
var rawBytes = myBinaryReader.ReadAllBytes();
// 2 - Reads a string, then reads the remaining data as a byte array
var someString = myBinaryReader.ReadString();
var rawBytes = myBinaryReader.ReadBytesToEnd();
// 3 - Uses a range to read the last 44 bytes
var rawBytes = myBinaryReader.ReadBytes(^44..);
Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
Would a checksum comparison such as CRC be faster?
Are there any .NET libraries that can generate a checksum for a file?
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
Here's what I came up with:
const int BYTES_TO_READ = sizeof(Int64);
static bool FilesAreEqual(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
byte[] one = new byte[BYTES_TO_READ];
byte[] two = new byte[BYTES_TO_READ];
for (int i = 0; i < iterations; i++)
{
fs1.Read(one, 0, BYTES_TO_READ);
fs2.Read(two, 0, BYTES_TO_READ);
if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
return false;
}
}
return true;
}
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
Here's the ReadByte and hashing methods I used, for comparison purposes:
static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
for (int i = 0; i < first.Length; i++)
{
if (fs1.ReadByte() != fs2.ReadByte())
return false;
}
}
return true;
}
static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
{
byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());
for (int i=0; i<firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i])
return false;
}
return true;
}
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:
• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));
• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
fi1.Length == fi2.Length &&
(fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
File.ReadAllBytes(fi2.FullName)));
Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.
Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
In addition to Reed Copsey's answer:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
bool result;
if (fileInfo1.Length != fileInfo2.Length)
{
result = false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
result = StreamsContentsAreEqual(file1, file2);
}
}
}
return result;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
Edit: This method would not work for comparing binary files!
In .NET 4.0, the File class has the following two new methods:
public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)
Which means you could use:
bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.
Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.
You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.
If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.
My answer is a derivative of #lars but fixes the bug in the call to Stream.Read. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:
using System;
using System.IO;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
}
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return StreamsContentsAreEqual(file1, file2);
}
}
}
}
private static int ReadFullBuffer(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = ReadFullBuffer(stream1, buffer1);
int count2 = ReadFullBuffer(stream2, buffer2);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Or if you want to be super-awesome, you can use the async variant:
using System;
using System.IO;
using System.Threading.Tasks;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
}
public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
}
}
}
}
private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Honestly, I think you need to prune your search tree down as much as possible.
Things to check before going byte-by-byte:
Are sizes the same?
Is the last byte in file A different than file B
Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.
Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.
Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9
Here is a proposal to do it with AVX2 SIMD instructions:
using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace FileCompare;
public static class FastFileCompare
{
public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
{
if (fileInfo1.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
}
if (fileInfo2.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
using FileStream fileStream01 = fileInfo1.OpenRead();
using FileStream fileStream02 = fileInfo2.OpenRead();
ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
Array.Fill<byte>(buffer1, 0);
Array.Fill<byte>(buffer2, 0);
try
{
while (true)
{
int len1 = 0;
for (int read;
len1 < buffer1.Length &&
(read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
len1 += read)
{
}
int len2 = 0;
for (int read;
len2 < buffer1.Length &&
(read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
len2 += read)
{
}
if (len1 != len2)
{
return false;
}
if (len1 == 0)
{
return true;
}
unsafe
{
fixed (byte* pb1 = buffer1)
{
fixed (byte* pb2 = buffer2)
{
int vectorSize = Vector256<byte>.Count;
for (int processed = 0; processed < len1; processed += vectorSize)
{
Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
if (Avx2.MoveMask(result) != -1)
{
return false;
}
}
}
}
}
}
}
finally
{
sharedArrayPool.Return(buffer1);
sharedArrayPool.Return(buffer2);
}
}
}
If the files are not too big, you can use:
public static byte[] ComputeFileHash(string fileName)
{
using (var stream = File.OpenRead(fileName))
return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}
It will only be feasible to compare hashes if the hashes are useful to store.
(Edited the code to something much cleaner.)
My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.
So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:
for (int i = 0; i < count1; i++)
{
if (buffer1[i] != buffer2[i])
return false;
}
I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.
Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.
You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.
This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.
If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)
open both files f1, f2
get the respective file length l1, l2
if l1 != l2 the files are different; stop
mmap() both files
use memcmp() on the mmap()ed files
OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.
Something (hopefully) reasonably efficient:
public class FileCompare
{
public static bool FilesEqual(string fileName1, string fileName2)
{
return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
}
/// <summary>
///
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <param name="bufferSize">8kb seemed like a good default</param>
/// <returns></returns>
public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
{
if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
using (var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
while (true)
{
var bytesRead1 = stream1.Read(buffer1, 0, bufferSize);
var bytesRead2 = stream2.Read(buffer2, 0, bufferSize);
if (bytesRead1 != bytesRead2) return false;
if (bytesRead1 == 0) return true;
if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
}
}
}
}
/// <summary>
///
/// </summary>
/// <param name="array1"></param>
/// <param name="array2"></param>
/// <param name="bytesToCompare"> 0 means compare entire arrays</param>
/// <returns></returns>
public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
{
if (array1.Length != array2.Length) return false;
var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
var tailIdx = length - length % sizeof(Int64);
//check in 8 byte chunks
for (var i = 0; i < tailIdx; i += sizeof(Int64))
{
if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
}
//check the remainder of the array, always shorter than 8 bytes
for (var i = tailIdx; i < length; i++)
{
if (array1[i] != array2[i]) return false;
}
return true;
}
}
Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.
I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.
As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).
public static bool AreFilesIdenticalFast(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
}
public static bool AreFilesIdentical(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdentical);
}
public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
{
if (path1 == null)
throw new ArgumentNullException(nameof(path1));
if (path2 == null)
throw new ArgumentNullException(nameof(path2));
if (areStreamsIdentical == null)
throw new ArgumentNullException(nameof(path2));
if (!File.Exists(path1) || !File.Exists(path2))
return false;
using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
if (valueFile.Length != thisFile.Length)
return false;
if (!areStreamsIdentical(thisFile, valueFile))
return false;
}
}
return true;
}
public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var tasks = new List<Task<bool>>();
do
{
// consumes more memory (two buffers for each tasks)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
{
int read3 = stream2.Read(buffer2, 0, 1);
if (read3 != 0) // not eof
return false;
break;
}
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
// consumes more cpu
var task = Task.Run(() =>
{
return IsSame(buffer1, buffer2);
});
tasks.Add(task);
}
while (true);
Task.WaitAll(tasks.ToArray());
return !tasks.Any(t => !t.Result);
}
public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
var tasks = new List<Task<bool>>();
do
{
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
return stream2.Read(buffer2, 0, 1) == 0; // check not eof
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
if (!IsSame(buffer1, buffer2))
return false;
}
while (true);
}
public static bool IsSame(byte[] bytes1, byte[] bytes2)
{
if (bytes1 == null)
throw new ArgumentNullException(nameof(bytes1));
if (bytes2 == null)
throw new ArgumentNullException(nameof(bytes2));
if (bytes1.Length != bytes2.Length)
return false;
for (int i = 0; i < bytes1.Length; i++)
{
if (bytes1[i] != bytes2[i])
return false;
}
return true;
}
I think there are applications where "hash" is faster than comparing byte by byte.
If you need to compare a file with others or have a thumbnail of a photo that can change.
It depends on where and how it is using.
private bool CompareFilesByte(string file1, string file2)
{
using (var fs1 = new FileStream(file1, FileMode.Open))
using (var fs2 = new FileStream(file2, FileMode.Open))
{
if (fs1.Length != fs2.Length) return false;
int b1, b2;
do
{
b1 = fs1.ReadByte();
b2 = fs2.ReadByte();
if (b1 != b2 || b1 < 0) return false;
}
while (b1 >= 0);
}
return true;
}
private string HashFile(string file)
{
using (var fs = new FileStream(file, FileMode.Open))
using (var reader = new BinaryReader(fs))
{
var hash = new SHA512CryptoServiceProvider();
hash.ComputeHash(reader.ReadBytes((int)file.Length));
return Convert.ToBase64String(hash.Hash);
}
}
private bool CompareFilesWithHash(string file1, string file2)
{
var str1 = HashFile(file1);
var str2 = HashFile(file2);
return str1 == str2;
}
Here, you can get what is the fastest.
var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));
Optionally, we can save the hash in a database.
Hope this can help
This I have found works well comparing first the length without reading data and then comparing the read byte sequence
private static bool IsFileIdentical(string a, string b)
{
if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}
Yet another answer, derived from #chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:
/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
if (file1 == file2)
return true;
FileInfo file1Info = new FileInfo(file1);
FileInfo file2Info = new FileInfo(file2);
if (!file1Info.Exists && !file2Info.Exists)
return true;
if (!file1Info.Exists && file2Info.Exists)
return false;
if (file1Info.Exists && !file2Info.Exists)
return false;
if (file1Info.Length != file2Info.Length)
return false;
using (FileStream file1Stream = file1Info.OpenRead())
using (FileStream file2Stream = file2Info.OpenRead())
{
byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
for (int i = 0; i < firstHash.Length; i++)
{
if (i>=secondHash.Length||firstHash[i] != secondHash[i])
return false;
}
return true;
}
}
Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)
public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
if (actualFile.Length != expectedFile.Length) {
throw new Exception($"File {actualFile.Name} has different length in actual and expected directories.");
}
// compare the files on a byte level
using var actualStream = actualFile.OpenRead();
using var expectedStream = expectedFile.OpenRead();
var actualBuffer = new byte[1024];
var expectedBuffer = new byte[1024];
int actualBytesRead;
int expectedBytesRead;
do {
actualBytesRead = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
if (actualBytesRead != expectedBytesRead) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
if (!actualBuffer.SequenceEqual(expectedBuffer)) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
} while (actualBytesRead > 0);
}
I find the usage of SequenceEqual particular interesting.
This question already has answers here:
Creating a byte array from a stream
(18 answers)
Closed 6 years ago.
Is there a simple way or method to convert a Stream into a byte[] in C#?
The shortest solution I know:
using(var memoryStream = new MemoryStream())
{
sourceStream.CopyTo(memoryStream);
return memoryStream.ToArray();
}
Call next function like
byte[] m_Bytes = StreamHelper.ReadToEnd (mystream);
Function:
public static byte[] ReadToEnd(System.IO.Stream stream)
{
long originalPosition = 0;
if(stream.CanSeek)
{
originalPosition = stream.Position;
stream.Position = 0;
}
try
{
byte[] readBuffer = new byte[4096];
int totalBytesRead = 0;
int bytesRead;
while ((bytesRead = stream.Read(readBuffer, totalBytesRead, readBuffer.Length - totalBytesRead)) > 0)
{
totalBytesRead += bytesRead;
if (totalBytesRead == readBuffer.Length)
{
int nextByte = stream.ReadByte();
if (nextByte != -1)
{
byte[] temp = new byte[readBuffer.Length * 2];
Buffer.BlockCopy(readBuffer, 0, temp, 0, readBuffer.Length);
Buffer.SetByte(temp, totalBytesRead, (byte)nextByte);
readBuffer = temp;
totalBytesRead++;
}
}
}
byte[] buffer = readBuffer;
if (readBuffer.Length != totalBytesRead)
{
buffer = new byte[totalBytesRead];
Buffer.BlockCopy(readBuffer, 0, buffer, 0, totalBytesRead);
}
return buffer;
}
finally
{
if(stream.CanSeek)
{
stream.Position = originalPosition;
}
}
}
I use this extension class:
public static class StreamExtensions
{
public static byte[] ReadAllBytes(this Stream instream)
{
if (instream is MemoryStream)
return ((MemoryStream) instream).ToArray();
using (var memoryStream = new MemoryStream())
{
instream.CopyTo(memoryStream);
return memoryStream.ToArray();
}
}
}
Just copy the class to your solution and you can use it on every stream:
byte[] bytes = myStream.ReadAllBytes()
Works great for all my streams and saves a lot of code!
Of course you can modify this method to use some of the other approaches here to improve performance if needed, but I like to keep it simple.
In .NET Framework 4 and later, the Stream class has a built-in CopyTo method that you can use.
For earlier versions of the framework, the handy helper function to have is:
public static void CopyStream(Stream input, Stream output)
{
byte[] b = new byte[32768];
int r;
while ((r = input.Read(b, 0, b.Length)) > 0)
output.Write(b, 0, r);
}
Then use one of the above methods to copy to a MemoryStream and call GetBuffer on it:
var file = new FileStream("c:\\foo.txt", FileMode.Open);
var mem = new MemoryStream();
// If using .NET 4 or later:
file.CopyTo(mem);
// Otherwise:
CopyStream(file, mem);
// getting the internal buffer (no additional copying)
byte[] buffer = mem.GetBuffer();
long length = mem.Length; // the actual length of the data
// (the array may be longer)
// if you need the array to be exactly as long as the data
byte[] truncated = mem.ToArray(); // makes another copy
Edit: originally I suggested using Jason's answer for a Stream that supports the Length property. But it had a flaw because it assumed that the Stream would return all its contents in a single Read, which is not necessarily true (not for a Socket, for example.) I don't know if there is an example of a Stream implementation in the BCL that does support Length but might return the data in shorter chunks than you request, but as anyone can inherit Stream this could easily be the case.
It's probably simpler for most cases to use the above general solution, but supposing you did want to read directly into an array that is bigEnough:
byte[] b = new byte[bigEnough];
int r, offset;
while ((r = input.Read(b, offset, b.Length - offset)) > 0)
offset += r;
That is, repeatedly call Read and move the position you will be storing the data at.
Byte[] Content = new BinaryReader(file.InputStream).ReadBytes(file.ContentLength);
byte[] buf; // byte array
Stream stream=Page.Request.InputStream; //initialise new stream
buf = new byte[stream.Length]; //declare arraysize
stream.Read(buf, 0, buf.Length); // read from stream to byte array
Ok, maybe I'm missing something here, but this is the way I do it:
public static Byte[] ToByteArray(this Stream stream) {
Int32 length = stream.Length > Int32.MaxValue ? Int32.MaxValue : Convert.ToInt32(stream.Length);
Byte[] buffer = new Byte[length];
stream.Read(buffer, 0, length);
return buffer;
}
if you post a file from mobile device or other
byte[] fileData = null;
using (var binaryReader = new BinaryReader(Request.Files[0].InputStream))
{
fileData = binaryReader.ReadBytes(Request.Files[0].ContentLength);
}
Stream s;
int len = (int)s.Length;
byte[] b = new byte[len];
int pos = 0;
while((r = s.Read(b, pos, len - pos)) > 0) {
pos += r;
}
A slightly more complicated solution is necesary is s.Length exceeds Int32.MaxValue. But if you need to read a stream that large into memory, you might want to think about a different approach to your problem.
Edit: If your stream does not support the Length property, modify using Earwicker's workaround.
public static class StreamExtensions {
// Credit to Earwicker
public static void CopyStream(this Stream input, Stream output) {
byte[] b = new byte[32768];
int r;
while ((r = input.Read(b, 0, b.Length)) > 0) {
output.Write(b, 0, r);
}
}
}
[...]
Stream s;
MemoryStream ms = new MemoryStream();
s.CopyStream(ms);
byte[] b = ms.GetBuffer();
"bigEnough" array is a bit of a stretch. Sure, buffer needs to be "big ebough" but proper design of an application should include transactions and delimiters. In this configuration each transaction would have a preset length thus your array would anticipate certain number of bytes and insert it into correctly sized buffer. Delimiters would ensure transaction integrity and would be supplied within each transaction. To make your application even better, you could use 2 channels (2 sockets). One would communicate fixed length control message transactions that would include information about size and sequence number of data transaction to be transferred using data channel. Receiver would acknowledge buffer creation and only then data would be sent.
If you have no control over stream sender than you need multidimensional array as a buffer. Component arrays would be small enough to be manageable and big enough to be practical based on your estimate of expected data. Process logic would seek known start delimiters and then ending delimiter in subsequent element arrays. Once ending delimiter is found, new buffer would be created to store relevant data between delimiters and initial buffer would have to be restructured to allow data disposal.
As far as a code to convert stream into byte array is one below.
Stream s = yourStream;
int streamEnd = Convert.ToInt32(s.Length);
byte[] buffer = new byte[streamEnd];
s.Read(buffer, 0, streamEnd);
Quick and dirty technique:
static byte[] StreamToByteArray(Stream inputStream)
{
if (!inputStream.CanRead)
{
throw new ArgumentException();
}
// This is optional
if (inputStream.CanSeek)
{
inputStream.Seek(0, SeekOrigin.Begin);
}
byte[] output = new byte[inputStream.Length];
int bytesRead = inputStream.Read(output, 0, output.Length);
Debug.Assert(bytesRead == output.Length, "Bytes read from stream matches stream length");
return output;
}
Test:
static void Main(string[] args)
{
byte[] data;
string path = #"C:\Windows\System32\notepad.exe";
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read))
{
data = StreamToByteArray(fs);
}
Debug.Assert(data.Length > 0);
Debug.Assert(new FileInfo(path).Length == data.Length);
}
I would ask, why do you want to read a stream into a byte[], if you are wishing to copy the contents of a stream, may I suggest using MemoryStream and writing your input stream into a memory stream.
You could also try just reading in parts at a time and expanding the byte array being returned:
public byte[] StreamToByteArray(string fileName)
{
byte[] total_stream = new byte[0];
using (Stream input = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
byte[] stream_array = new byte[0];
// Setup whatever read size you want (small here for testing)
byte[] buffer = new byte[32];// * 1024];
int read = 0;
while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
{
stream_array = new byte[total_stream.Length + read];
total_stream.CopyTo(stream_array, 0);
Array.Copy(buffer, 0, stream_array, total_stream.Length, read);
total_stream = stream_array;
}
}
return total_stream;
}