Read a fixed number of bytes when reading file using StreamReader - c#

I'm using this port of the Mozilla character set detector to determine a file's encoding and then using that to construct a StreamReader. So far, so good.
However, the file format I am reading is an odd one and from time to time it is necessary to skip a number of bytes. That is, a file that is otherwise text, in one or other encoding, will have some raw bytes embedded in it.
I would like to read the stream as text, up to the point that I hit some text that indicates a byte stream follows, then I would like to read the byte stream, then resume reading as text. What is the best way of doing this (balance of simplicity and performance)?
I can't rely on seeking against the FileStream underlying the the StreamReader (and then discarding the buffered data in the latter) because I don't know how many bytes were used in reading the characters up to that point. I might abandon using StreamReader and switch to a bespoke class that uses parallel arrays of bytes and chars, populates the latter from the former using a decoder, and tracks the position in the byte array every time a character is read by using the encoding to calculate the number of bytes used for the character. Yuk.
To further clarify, the file has this format:
[encoded chars][embedded bytes indicator + len][len bytes][encoded chars]...
Where there many be zero one or many blocks of embedded bytes and the blocks of embedded chars may be any length.
So, for example:
ABC:123:DEF:456:$0099[0x00,0x01,0x02,... x 99]GHI:789:JKL:...
There are no line delimiters. I may have any number of fields (ABC, 123, ...) delimited by some character (in this case a colon). These fields may be in various codepages, including UTF-8 (not guaranteed to be single byte). When I hit a $ I know that the next 4 bytes contain a length (call it n), the next n bytes are to be read raw, and byte n + 1 will be another text field (GHI).

Proof of concept. This class works with UTF-16 string data, and ':' delimiters per OP. It expects binary length as a 4-byte, little-endian binary integer. It should be easy to adjust to more specific details of your (odd) file format. For example, any Decoder class should drop in to ReadString() and "just work".
To use it, construct it with a Stream class. For each individual data element, call ReportNextData(), which will tell you what kind of data is next, and then call the appropriate Read*() method. For binary data, call ReadBinaryLength() and then ReadBinaryData().
Note that ReadBinaryData() follows the stream contract; it is not guaranteed to return as many bytes as you asked for, so you may need to call it several times. However, if you ask for too many bytes, it will throw EndOfStreamException.
I tested it with this data (hex format):
410042004300240A0000000102030405060708090024050000000504030201580059005A003A310032003300
Which is:
ABC$[10][1234567890]$[5][54321]XYZ:123
Scan the data like so:
OddFileReader.NextData nextData;
while ((nextData = reader.ReportNextData()) != OddFileReader.NextData.Eof)
{
// Call appropriate Read*() here.
}
public class OddFileReader : IDisposable
{
public enum NextData
{
Unknown,
Eof,
String,
BinaryLength,
BinaryData
}
private Stream source;
private byte[] byteBuffer;
private int bufferOffset;
private int bufferEnd;
private NextData nextData;
private int binaryOffset;
private int binaryEnd;
private char[] characterBuffer;
public OddFileReader(Stream source)
{
this.source = source;
}
public NextData ReportNextData()
{
if (nextData != NextData.Unknown)
{
return nextData;
}
if (!PopulateBufferIfNeeded(1))
{
return (nextData = NextData.Eof);
}
if (byteBuffer[bufferOffset] == '$')
{
return (nextData = NextData.BinaryLength);
}
else
{
return (nextData = NextData.String);
}
}
public string ReadString()
{
ReportNextData();
if (nextData == NextData.Eof)
{
throw new EndOfStreamException();
}
else if (nextData != NextData.String)
{
throw new InvalidOperationException("Attempt to read non-string data as string");
}
if (characterBuffer == null)
{
characterBuffer = new char[1];
}
StringBuilder stringBuilder = new StringBuilder();
Decoder decoder = Encoding.Unicode.GetDecoder();
while (nextData == NextData.String)
{
byte b = byteBuffer[bufferOffset];
if (b == '$')
{
nextData = NextData.BinaryLength;
break;
}
else if (b == ':')
{
nextData = NextData.Unknown;
bufferOffset++;
break;
}
else
{
if (decoder.GetChars(byteBuffer, bufferOffset++, 1, characterBuffer, 0) == 1)
{
stringBuilder.Append(characterBuffer[0]);
}
if (bufferOffset == bufferEnd && !PopulateBufferIfNeeded(1))
{
nextData = NextData.Eof;
break;
}
}
}
return stringBuilder.ToString();
}
public int ReadBinaryLength()
{
ReportNextData();
if (nextData == NextData.Eof)
{
throw new EndOfStreamException();
}
else if (nextData != NextData.BinaryLength)
{
throw new InvalidOperationException("Attempt to read non-binary-length data as binary length");
}
bufferOffset++;
if (!PopulateBufferIfNeeded(sizeof(Int32)))
{
nextData = NextData.Eof;
throw new EndOfStreamException();
}
binaryEnd = BitConverter.ToInt32(byteBuffer, bufferOffset);
binaryOffset = 0;
bufferOffset += sizeof(Int32);
nextData = NextData.BinaryData;
return binaryEnd;
}
public int ReadBinaryData(byte[] buffer, int offset, int count)
{
ReportNextData();
if (nextData == NextData.Eof)
{
throw new EndOfStreamException();
}
else if (nextData != NextData.BinaryData)
{
throw new InvalidOperationException("Attempt to read non-binary data as binary data");
}
if (count > binaryEnd - binaryOffset)
{
throw new EndOfStreamException();
}
int bytesRead;
if (bufferOffset < bufferEnd)
{
bytesRead = Math.Min(count, bufferEnd - bufferOffset);
Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
bufferOffset += bytesRead;
}
else if (count < byteBuffer.Length)
{
if (!PopulateBufferIfNeeded(1))
{
throw new EndOfStreamException();
}
bytesRead = Math.Min(count, bufferEnd - bufferOffset);
Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
bufferOffset += bytesRead;
}
else
{
bytesRead = source.Read(buffer, offset, count);
}
binaryOffset += bytesRead;
if (binaryOffset == binaryEnd)
{
nextData = NextData.Unknown;
}
return bytesRead;
}
private bool PopulateBufferIfNeeded(int minimumBytes)
{
if (byteBuffer == null)
{
byteBuffer = new byte[8192];
}
if (bufferEnd - bufferOffset < minimumBytes)
{
int shiftCount = bufferEnd - bufferOffset;
if (shiftCount > 0)
{
Array.Copy(byteBuffer, bufferOffset, byteBuffer, 0, shiftCount);
}
bufferOffset = 0;
bufferEnd = shiftCount;
while (bufferEnd - bufferOffset < minimumBytes)
{
int bytesRead = source.Read(byteBuffer, bufferEnd, byteBuffer.Length - bufferEnd);
if (bytesRead == 0)
{
return false;
}
bufferEnd += bytesRead;
}
}
return true;
}
public void Dispose()
{
Stream source = this.source;
this.source = null;
if (source != null)
{
source.Dispose();
}
}
}

Related

How can I get the data of 8bit wav file

I'm making a demo about sound in WindowsForm, I created 3 classes for taking data of wave file. Code is below:
RiffBlock
public class RiffBlock
{
private byte[] riffID;
private uint riffSize;
private byte[] riffFormat;
public byte[] RiffID
{
get { return riffID; }
}
public uint RiffSize
{
get { return (riffSize); }
}
public byte[] RiffFormat
{
get { return riffFormat; }
}
public RiffBlock()
{
riffID = new byte[4];
riffFormat = new byte[4];
}
public void ReadRiff(FileStream inFS)
{
inFS.Read(riffID, 0, 4);
BinaryReader binRead = new BinaryReader(inFS);
riffSize = binRead.ReadUInt32();
inFS.Read(riffFormat, 0, 4);
}
}
FormatBlock
public class FormatBlock
{
private byte[] fmtID;
private uint fmtSize;
private ushort fmtTag;
private ushort fmtChannels;
private uint fmtSamplesPerSec;
private uint fmtAverageBytesPerSec;
private ushort fmtBlockAlign;
private ushort fmtBitsPerSample;
public byte[] FmtID
{
get { return fmtID; }
}
public uint FmtSize
{
get { return fmtSize; }
}
public ushort FmtTag
{
get { return fmtTag; }
}
public ushort Channels
{
get { return fmtChannels; }
}
public uint SamplesPerSec
{
get { return fmtSamplesPerSec; }
}
public uint AverageBytesPerSec
{
get { return fmtAverageBytesPerSec; }
}
public ushort BlockAlign
{
get { return fmtBlockAlign; }
}
public ushort BitsPerSample
{
get { return fmtBitsPerSample; }
}
public FormatBlock()
{
fmtID = new byte[4];
}
public void ReadFmt(FileStream inFS)
{
inFS.Read(fmtID, 0, 4);
BinaryReader binRead = new BinaryReader(inFS);
fmtSize = binRead.ReadUInt32();
fmtTag = binRead.ReadUInt16();
fmtChannels = binRead.ReadUInt16();
fmtSamplesPerSec = binRead.ReadUInt32();
fmtAverageBytesPerSec = binRead.ReadUInt32();
fmtBlockAlign = binRead.ReadUInt16();
fmtBitsPerSample = binRead.ReadUInt16();
// This accounts for the variable format header size
// 12 bytes of Riff Header, 4 bytes for FormatId, 4 bytes for FormatSize & the Actual size of the Format Header
inFS.Seek(fmtSize + 20, System.IO.SeekOrigin.Begin);
}
}
DataBlock
public class DataBlock
{
private byte[] dataID;
private uint dataSize;
private Int16[] data;
private int dataNumSamples;
public byte[] DataID
{
get { return dataID; }
}
public uint DataSize
{
get { return dataSize; }
}
public Int16 this[int pos]
{
get { return data[pos]; }
}
public int NumSamples
{
get { return dataNumSamples; }
}
public DataBlock()
{
dataID = new byte[4];
}
public void ReadData(FileStream inFS)
{
inFS.Read(dataID, 0, 4);
BinaryReader binRead = new BinaryReader(inFS);
dataSize = binRead.ReadUInt32();
data = new Int16[dataSize];
inFS.Seek(40, SeekOrigin.Begin);
dataNumSamples = (int)(dataSize / 2);
for (int i = 0; i < dataNumSamples; i++)
{
data[i] = binRead.ReadInt16();
}
}
}
It works ok with only 16bit wave file, but when I choose a 8 bit wav file or another, the result of this command dataSize = binRead.ReadUInt32();is only 4 although the file size is big.
How I can get the data of 8bit, 24bit... wav file?
Some solutions is appreciated, thank you very much.
Your reading methodology is flawed. The length is correct but for an 8 bits per sample file you should be reading bytes not words; as it stands the data will be incorrect and the value returned by the NumSamples property will be wrong.
In my case, the sub chunk size is 1160, the number of channels is 1 (mono) and the bits per sample is 8 (byte). You will need to decode the bits per sample and adjust your reading accordingly. For the WAV file I used, your program allocated a data array the correct length but 16 bit, divided the data length by 2 and called this the number of samples (wrong, it should be 1160) and then proceeded to read 580 word values from the stream.
Edit: My ancient code will not cut it in the modern age (I seem to recall having to modify it some years ago to cope with at least one additional chunk type but the details escape me).
This is what you get; anything more and your question should read "Could someone write me a program to load WAV files", as it is, we are way beyond the original question and it is time for you to knuckle down and make it work how you need it to :-)
References:
http://www.neurophys.wisc.edu/auditory/riff-format.txt
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
http://www.shikadi.net/moddingwikiResource_Interchange_File_Format_(RIFF)
http://soundfile.sapp.org/doc/WaveFormat/
All are very useful.
///<summary>
///Reads up to 16 bit WAV files
///</summary>
///<remarks> Things have really changed in the last 15 years</remarks>
public class RiffLoader
{
public enum RiffStatus { Unknown = 0, OK, FileError, FormatError, UnsupportedFormat };
LinkedList<Chunk> chunks = new LinkedList<Chunk>();
RiffStatus status = RiffStatus.Unknown;
List<String> errorMessages = new List<string>();
String source;
public String SourceName { get { return source; } }
public RiffStatus Status { get { return status; } }
public String[] Messages { get { return errorMessages.ToArray(); } }
enum chunkType { Unknown = 0, NoMore, Riff, Fmt, Fact, Data, Error = -1 };
static Int32 scan32bits(byte[] source, int offset = 0)
{
return source[offset] | source[offset + 1] << 8 | source[offset + 2] << 16 | source[offset + 3] << 24;
}
static Int32 scan16bits(byte[] source, int offset = 0)
{
return source[offset] | source[offset + 1] << 8;
}
static Int32 scan8bits(byte[] source, int offset = 0)
{
return source[offset];
}
abstract class Chunk
{
public chunkType Ident = chunkType.Unknown;
public int ByteCount = 0;
}
class RiffChunk : Chunk
{
public RiffChunk(int count)
{
this.Ident = chunkType.Riff;
this.ByteCount = count;
}
}
class FmtChunk : Chunk
{
int formatCode = 0;
int channels = 0;
int samplesPerSec = 0;
int avgBytesPerSec = 0;
int blockAlign = 0;
int bitsPerSample = 0;
int significantBits = 0;
public int Format { get { return formatCode; } }
public int Channels { get { return channels; } }
public int BlockAlign { get { return blockAlign; } }
public int BytesPerSample { get { return bitsPerSample / 8 + ((bitsPerSample % 8) > 0 ? 1 : 0); } }
public int BitsPerSample
{
get
{
if (significantBits > 0)
return significantBits;
return bitsPerSample;
}
}
public FmtChunk(byte[] buffer) : base()
{
int size = buffer.Length;
// if the length is 18 then buffer 16,17 should be 00 00 (I don't bother checking)
if (size != 16 && size != 18 && size != 40)
return;
formatCode = scan16bits(buffer, 0);
channels = scan16bits(buffer, 2);
samplesPerSec = scan32bits(buffer, 4);
avgBytesPerSec = scan32bits(buffer, 8);
blockAlign = scan16bits(buffer, 12);
bitsPerSample = scan16bits(buffer, 14);
if (formatCode == 0xfffe) // EXTENSIBLE
{
if (size != 40)
return;
significantBits = scan16bits(buffer, 18);
// skiping speaker map
formatCode = scan16bits(buffer, 24); // first two bytes of the GUID
// the rest of the GUID is fixed, decode it and check it if you wish
}
this.Ident = chunkType.Fmt;
this.ByteCount = size;
}
}
class DataChunk : Chunk
{
byte[] samples = null;
///<summary>
///Create a data chunk
///<para>
///The supplied buffer must be correctly sized with zero offset and must be purely for this class
///</para>
///<summary>
///<param name="buffer">source array</param>
public DataChunk(byte[] buffer)
{
this.Ident = chunkType.Data;
this.ByteCount = buffer.Length;
samples = buffer;
}
public enum SampleStatus { OK, Duff }
public class Samples
{
public SampleStatus Status = SampleStatus.Duff;
public List<int[]> Channels = new List<int[]>();
#if false // debugger helper method
/*
** Change #if false to #if true to include this
** Break at end of GetSamples on "return retval"
** open immediate window and type retval.DumpLast(16)
** look in output window for dump of last 16 entries
*/
public int DumpLast(int count)
{
for (int i = Channels[0].Length - count; i < Channels[0].Length; i++)
Console.WriteLine(String.Format("{0:X4} {1:X4},{2:X4}", i, Channels[0][i], Channels[1][i]));
return 0;
}
#endif
}
/*
** Return the decoded samples
*/
public Samples GetSamples(FmtChunk format)
{
Samples retval = new Samples();
int samplesPerChannel = this.ByteCount / (format.BytesPerSample * format.Channels);
int mask = 0, sign=0;
int [][] samples = new int [format.Channels][];
for (int c = 0; c < format.Channels; c++)
samples[c] = new int[samplesPerChannel];
if (format.BitsPerSample >= 8 && format.BitsPerSample <= 16) // 24+ is left as an excercise
{
sign = (int)Math.Floor(Math.Pow(2, format.BitsPerSample - 1));
mask = (int)Math.Floor(Math.Pow(2, format.BitsPerSample)) - 1;
int offset = 0, index = 0;
int s = 0;
while (index < samplesPerChannel)
{
for (int c = 0; c < format.Channels; c++)
{
switch (format.BytesPerSample)
{
case 1:
s = scan8bits(this.samples, offset) & mask;
break;
case 2:
s = scan16bits(this.samples, offset) & mask;
break;
}
// sign extend the data to Int32
samples[c][index] = s | ((s & sign) != 0 ? ~mask : 0);
offset += format.BytesPerSample;
}
++index;
}
retval.Channels = new List<int[]>(samples);
retval.Status = SampleStatus.OK;
}
return retval;
}
}
class FactChunk : Chunk
{
int samplesPerChannel;
public int SamplesPerChannel { get { return samplesPerChannel; } }
public FactChunk(byte[] buffer)
{
this.Ident = chunkType.Fact;
this.ByteCount = buffer.Length;
if (buffer.Length >= 4)
samplesPerChannel = scan32bits(buffer);
}
}
class DummyChunk : Chunk
{
public DummyChunk(int size, chunkType type = chunkType.Unknown)
{
this.Ident = type;
this.ByteCount = size;
}
}
public RiffLoader(String fileName)
{
source = fileName;
try
{
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
using (BinaryReader reader = new BinaryReader(fs))
{
Chunk c = getChunk(fs, reader);
if (c.Ident != chunkType.Riff)
{
status = RiffStatus.FileError;
errorMessages.Add(String.Format("Error loading \"{0}\": No valid header"));
}
chunks.AddLast(c);
c = getChunk(fs, reader);
if (c.Ident != chunkType.Fmt)
{
status = RiffStatus.FileError;
errorMessages.Add(String.Format("Error loading \"{0}\": No format chunk"));
}
chunks.AddLast(c);
/*
** From now on we just keep scanning to the end of the file
*/
while (fs.Position < fs.Length)
{
c = getChunk(fs, reader);
switch (c.Ident)
{
case chunkType.Fact:
case chunkType.Data:
chunks.AddLast(c);
break;
case chunkType.Unknown:
break; // skip it - don't care what it is
}
}
FmtChunk format = null;
foreach (var chunk in chunks)
{
switch(chunk.Ident)
{
case chunkType.Fmt:
format = chunk as FmtChunk;
break;
case chunkType.Data:
if (format != null)
{
DataChunk dc = chunk as DataChunk;
var x = dc.GetSamples(format);
}
break;
}
}
}
}
catch (Exception e)
{
status = RiffStatus.FileError;
errorMessages.Add(String.Format("Error loading \"{0}\": {1}", e.Message));
}
}
/*
** Get a chunk of data from the file - knows nothing of the internal format of the chunk.
*/
Chunk getChunk(FileStream stream, BinaryReader reader)
{
byte[] buffer;
int size;
buffer = reader.ReadBytes(8);
if (buffer.Length == 8)
{
String prefix = new String(Encoding.ASCII.GetChars(buffer, 0, 4));
size = scan32bits(buffer, 4);
if (size + stream.Position <= stream.Length) // skip if there isn't enough data
{
if (String.Compare(prefix, "RIFF") == 0)
{
/*
** only "WAVE" type is acceptable
**
** Don't read size bytes or the entire file will end up in the RIFF chunk
*/
if (size >= 4)
{
buffer = reader.ReadBytes(4);
String ident = new String(Encoding.ASCII.GetChars(buffer, 0, 4));
if (String.CompareOrdinal(ident, "WAVE") == 0)
return new RiffChunk(size - 4);
}
}
else if (String.Compare(prefix, "fmt ") == 0)
{
if (size >= 16)
{
buffer = reader.ReadBytes(size);
if (buffer.Length == size)
return new FmtChunk(buffer);
}
}
else if (String.Compare(prefix, "fact") == 0)
{
if (size >= 4)
{
buffer = reader.ReadBytes(4);
if (buffer.Length == size)
return new FactChunk(buffer);
}
}
else if (String.Compare(prefix, "data") == 0)
{
// assume that there has to be data
if (size > 0)
{
buffer = reader.ReadBytes(size);
if ((size & 1) != 0) // odd length?
{
if (stream.Position < stream.Length)
reader.ReadByte();
else
size = -1; // force an error - there should be a pad byte
}
if (buffer.Length == size)
return new DataChunk(buffer);
}
}
else
{
/*
** there are a number of weird and wonderful block types - assume there has to be data
*/
if (size > 0)
{
buffer = reader.ReadBytes(size);
if ((size & 1) != 0) // odd length?
{
if (stream.Position < stream.Length)
reader.ReadByte();
else
size = -1; // force an error - there should be a pad byte
}
if (buffer.Length == size)
{
DummyChunk skip = new DummyChunk(size);
return skip;
}
}
}
}
}
return new DummyChunk(0, chunkType.Error);
}
}
You will need to add properties as required and code to navigate the returned linked list. Particular properties you may need are the sample rates in the format block, assuming you are going to process the data in some way.
Adding 24 to 32 bits is simple, if you need to go beyond 32 bits you will have to switch to int64's.
I have tested it with some good samples from http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/Samples.html, as well as your 8 bit file, and it decodes OK and seems to have the correct data.
You should be more than capable of making it work now, good luck.
Edit (again, like the proverbial dog...):
Of course, I should have sign extended the value, so DataChunk.GetSamples() now does that. Looking at the Codeproject files (it isn't the greatest code by the way, but the guy does say that he is only just learning C#, so fair play for tackling a graphical user control) it is obvious that the data is signed. It's a shame that he didn't standardise his source to be an array of Int32's, then it wouldn't matter how many bits the WAV was encoded in.
You could use the Naudio library:
https://naudio.codeplex.com/ (take a look at their website, there are quite a lot of tutorials).
Hope this helps :).

FileStream delimiter?

I have a large file with (text/Binary) format.
file format: (0 represent a byte)
00000FileName0000000Hello
World
world1
...
0000000000000000000000
Currently i'm using FileStream and i want to read the Hello.
I Know where Hello start, and it ends with a 0x0D 0x0A.
I also need to go back if the words is not equal to Hello.
How can i read until a carriage return?
is there any PEEK like function in FileStream so i can move back the read pointer`?
is FileStream even a good choice in this case?
You can use the method FileStream.Seek to change the read/write position.
You can use BinaryReader for reading binary content; however, it uses an inner buffer so you cannot rely the underlying Stream.Position anymore, because it can read more bytes in the background than you want. But you can re-implement its needed methods:
private byte[] ReadBytes(Stream s, int count)
{
buffer = new byte[count];
if (count == 0)
{
return buffer;
}
// reading one byte
if (count == 1)
{
int value = s.ReadByte();
if (value == -1)
threw new IOException("Out of stream");
buffer[0] = (byte)value;
return buffer;
}
// reading multiple bytes
int offset = 0;
do
{
int readBytes = s.Read(buffer, offset, count - offset);
if (readBytes == 0)
threw new IOException("Out of stream");
offset += readBytes;
}
while (offset < count);
return buffer;
}
public int ReadInt32(Stream s)
{
byte[] buffer = ReadBytes(s, 4);
return BitConverter.ToInt32(buffer, 0);
}
// similarly, write ReadInt16/64, etc, whatever you need
Assuming that you are on the start position, you can write a ReadString, too:
private string ReadString(Stream s, char delimiter)
{
var result = new List<char>();
int c;
while ((c = s.ReadByte()) != -1 && (char)c != delimiter)
{
result.Add((char)c);
}
return new string(result.ToArray());
}
Usage:
FileStream fs = GetMyFile(); // todo
if (!fs.CanSeek)
throw new NotSupportedException("sorry");
long posCurrent = fs.Position; // save current position
int posHello = ReadInt32(fs); // read position of "hello"
fs.Seek(posHello, SeekOrigin.Begin); // seeking to hello
string hello = ReadString(fs, '\n'); // reading hello
fs.Seek(posCurrent, SeekOrigin.Begin); // seeking back

How to implement a lazy stream chunk enumerator?

I'm trying to split a byte stream into chunks of increasing size.
The source stream contains an unknown number of bytes and is expensive to read. The output of the enumerator should be byte arrays of increasing size, starting at 8KB up to 1MB.
This is very simple to do by simply reading the whole stream, storing it in an array and taking the relevant pieces out. However, since the stream may be very large, reading it at once is unfeasible. Also, while performance is not the main concern, it is important to keep system load very low.
While implementing this I noticed that it's relatively difficult to keep the code short and maintainable. There are a few stream related issues to keep in mind, too (for instance, Stream.Read might not fill the buffer even though it succeeded).
I did not find any existing classes that help for my case, nor could I find something close on the net. How would you implement such a class?
public IEnumerable<BufferWrapper> getBytes(Stream stream)
{
List<int> bufferSizes = new List<int>() { 8192, 65536, 220160, 1048576 };
int count = 0;
int bufferSizePostion = 0;
byte[] buffer = new byte[bufferSizes[0]];
bool done = false;
while (!done)
{
BufferWrapper nextResult = new BufferWrapper();
nextResult.bytesRead = stream.Read(buffer, 0, buffer.Length);
nextResult.buffer = buffer;
done = nextResult.bytesRead == 0;
if (!done)
{
yield return nextResult;
count++;
if (count > 10 && bufferSizePostion < bufferSizes.Count)
{
count = 0;
bufferSizePostion++;
buffer = new byte[bufferSizes[bufferSizePostion]];
}
}
}
}
public class BufferWrapper
{
public byte[] buffer { get; set; }
public int bytesRead { get; set; }
}
Obviously the logic for when to move up in buffer size, and how to choose what that size is could be altered.
Someone could also probably find a better way of handling the last buffer to be sent, as this isn't the most efficient way.
For reference, the implementation I currently use, already with improvements as per the answer by #Servy
private const int InitialBlockSize = 8 * 1024;
private const int MaximumBlockSize = 1024 * 1024;
private Stream _Stream;
private int _Size = InitialBlockSize;
public byte[] Current
{
get;
private set;
}
public bool MoveNext ()
{
if (_Size < 0) {
return false;
}
var buf = new byte[_Size];
int count = 0;
while (count < _Size) {
int read = _Stream.Read (buf, count, _Size - count);
if (read == 0) {
break;
}
count += read;
}
if (count == _Size) {
Current = buf;
if (_Size <= MaximumBlockSize / 2) {
_Size *= 2;
}
}
else {
Current = new byte[count];
Array.Copy (buf, Current, count);
_Size = -1;
}
return true;
}

Sockets and Streams - Mixing StreamReader and BinaryReader

I'm working with a socket connection - to make things easier I get the socket's NetworkStream and wrap it up in a StreamReader which makes it easier to work with the largely textual content my socket receives from the server.
However there are times when the server sends binary information, like so:
TEXT
MORETEXT
500 BYTES OF BINARY DATA FOLLOWS THIS LINE
{500 bytes of binary data}
I'm reading the text content with the StreamReader fine, but because the StreamReader has its own buffer it means the StreamReader grabs the binary data before I can switch to the BinaryReader to read the 500 bytes of binary data.
Is there a way around this? I'd like the ability to read the textual data whilst still being able to read binary data.
I should do my research better; it turns out that the BinaryReader class already contains string and character processing methods (though it needs a few, like ReadLine, which can easily be added by subclassing it).
It's strange then, why BinaryReader doesn't subclass TextReader as it is more than capable of.
Here's an extension of BinaryReader that you can use to perform ReadLine and the usual BinaryReader stuff.
public class LineReader : BinaryReader
{
private Encoding _encoding;
private Decoder _decoder;
const int bufferSize = 1024;
private char[] _LineBuffer = new char[bufferSize];
public LineReader(Stream stream, int bufferSize, Encoding encoding)
: base(stream, encoding)
{
this._encoding = encoding;
this._decoder = encoding.GetDecoder();
}
public string ReadLine()
{
int pos = 0;
char[] buf = new char[2];
StringBuilder stringBuffer = null;
bool lineEndFound = false;
while(base.Read(buf, 0, 2) > 0)
{
if (buf[1] == '\r')
{
// grab buf[0]
this._LineBuffer[pos++] = buf[0];
// get the '\n'
char ch = base.ReadChar();
Debug.Assert(ch == '\n');
lineEndFound = true;
}
else if (buf[0] == '\r')
{
lineEndFound = true;
}
else
{
this._LineBuffer[pos] = buf[0];
this._LineBuffer[pos+1] = buf[1];
pos += 2;
if (pos >= bufferSize)
{
stringBuffer = new StringBuilder(bufferSize + 80);
stringBuffer.Append(this._LineBuffer, 0, bufferSize);
pos = 0;
}
}
if (lineEndFound)
{
if (stringBuffer == null)
{
if (pos > 0)
return new string(this._LineBuffer, 0, pos);
else
return string.Empty;
}
else
{
if (pos > 0)
stringBuffer.Append(this._LineBuffer, 0, pos);
return stringBuffer.ToString();
}
}
}
if (stringBuffer != null)
{
if (pos > 0)
stringBuffer.Append(this._LineBuffer, 0, pos);
return stringBuffer.ToString();
}
else
{
if (pos > 0)
return new string(this._LineBuffer, 0, pos);
else
return null;
}
}
}

How to compare 2 files fast using .NET?

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
Would a checksum comparison such as CRC be faster?
Are there any .NET libraries that can generate a checksum for a file?
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
Here's what I came up with:
const int BYTES_TO_READ = sizeof(Int64);
static bool FilesAreEqual(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
byte[] one = new byte[BYTES_TO_READ];
byte[] two = new byte[BYTES_TO_READ];
for (int i = 0; i < iterations; i++)
{
fs1.Read(one, 0, BYTES_TO_READ);
fs2.Read(two, 0, BYTES_TO_READ);
if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
return false;
}
}
return true;
}
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
Here's the ReadByte and hashing methods I used, for comparison purposes:
static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
for (int i = 0; i < first.Length; i++)
{
if (fs1.ReadByte() != fs2.ReadByte())
return false;
}
}
return true;
}
static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
{
byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());
for (int i=0; i<firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i])
return false;
}
return true;
}
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:
• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));
• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
fi1.Length == fi2.Length &&
(fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
File.ReadAllBytes(fi2.FullName)));
Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.
Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
In addition to Reed Copsey's answer:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
bool result;
if (fileInfo1.Length != fileInfo2.Length)
{
result = false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
result = StreamsContentsAreEqual(file1, file2);
}
}
}
return result;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
Edit: This method would not work for comparing binary files!
In .NET 4.0, the File class has the following two new methods:
public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)
Which means you could use:
bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.
Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.
You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.
If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.
My answer is a derivative of #lars but fixes the bug in the call to Stream.Read. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:
using System;
using System.IO;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
}
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return StreamsContentsAreEqual(file1, file2);
}
}
}
}
private static int ReadFullBuffer(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = ReadFullBuffer(stream1, buffer1);
int count2 = ReadFullBuffer(stream2, buffer2);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Or if you want to be super-awesome, you can use the async variant:
using System;
using System.IO;
using System.Threading.Tasks;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
}
public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
}
}
}
}
private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Honestly, I think you need to prune your search tree down as much as possible.
Things to check before going byte-by-byte:
Are sizes the same?
Is the last byte in file A different than file B
Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.
Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.
Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9
Here is a proposal to do it with AVX2 SIMD instructions:
using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace FileCompare;
public static class FastFileCompare
{
public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
{
if (fileInfo1.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
}
if (fileInfo2.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
using FileStream fileStream01 = fileInfo1.OpenRead();
using FileStream fileStream02 = fileInfo2.OpenRead();
ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
Array.Fill<byte>(buffer1, 0);
Array.Fill<byte>(buffer2, 0);
try
{
while (true)
{
int len1 = 0;
for (int read;
len1 < buffer1.Length &&
(read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
len1 += read)
{
}
int len2 = 0;
for (int read;
len2 < buffer1.Length &&
(read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
len2 += read)
{
}
if (len1 != len2)
{
return false;
}
if (len1 == 0)
{
return true;
}
unsafe
{
fixed (byte* pb1 = buffer1)
{
fixed (byte* pb2 = buffer2)
{
int vectorSize = Vector256<byte>.Count;
for (int processed = 0; processed < len1; processed += vectorSize)
{
Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
if (Avx2.MoveMask(result) != -1)
{
return false;
}
}
}
}
}
}
}
finally
{
sharedArrayPool.Return(buffer1);
sharedArrayPool.Return(buffer2);
}
}
}
If the files are not too big, you can use:
public static byte[] ComputeFileHash(string fileName)
{
using (var stream = File.OpenRead(fileName))
return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}
It will only be feasible to compare hashes if the hashes are useful to store.
(Edited the code to something much cleaner.)
My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.
So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:
for (int i = 0; i < count1; i++)
{
if (buffer1[i] != buffer2[i])
return false;
}
I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.
Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.
You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.
This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.
If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)
open both files f1, f2
get the respective file length l1, l2
if l1 != l2 the files are different; stop
mmap() both files
use memcmp() on the mmap()ed files
OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.
Something (hopefully) reasonably efficient:
public class FileCompare
{
public static bool FilesEqual(string fileName1, string fileName2)
{
return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
}
/// <summary>
///
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <param name="bufferSize">8kb seemed like a good default</param>
/// <returns></returns>
public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
{
if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
using (var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
while (true)
{
var bytesRead1 = stream1.Read(buffer1, 0, bufferSize);
var bytesRead2 = stream2.Read(buffer2, 0, bufferSize);
if (bytesRead1 != bytesRead2) return false;
if (bytesRead1 == 0) return true;
if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
}
}
}
}
/// <summary>
///
/// </summary>
/// <param name="array1"></param>
/// <param name="array2"></param>
/// <param name="bytesToCompare"> 0 means compare entire arrays</param>
/// <returns></returns>
public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
{
if (array1.Length != array2.Length) return false;
var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
var tailIdx = length - length % sizeof(Int64);
//check in 8 byte chunks
for (var i = 0; i < tailIdx; i += sizeof(Int64))
{
if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
}
//check the remainder of the array, always shorter than 8 bytes
for (var i = tailIdx; i < length; i++)
{
if (array1[i] != array2[i]) return false;
}
return true;
}
}
Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.
I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.
As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).
public static bool AreFilesIdenticalFast(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
}
public static bool AreFilesIdentical(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdentical);
}
public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
{
if (path1 == null)
throw new ArgumentNullException(nameof(path1));
if (path2 == null)
throw new ArgumentNullException(nameof(path2));
if (areStreamsIdentical == null)
throw new ArgumentNullException(nameof(path2));
if (!File.Exists(path1) || !File.Exists(path2))
return false;
using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
if (valueFile.Length != thisFile.Length)
return false;
if (!areStreamsIdentical(thisFile, valueFile))
return false;
}
}
return true;
}
public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var tasks = new List<Task<bool>>();
do
{
// consumes more memory (two buffers for each tasks)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
{
int read3 = stream2.Read(buffer2, 0, 1);
if (read3 != 0) // not eof
return false;
break;
}
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
// consumes more cpu
var task = Task.Run(() =>
{
return IsSame(buffer1, buffer2);
});
tasks.Add(task);
}
while (true);
Task.WaitAll(tasks.ToArray());
return !tasks.Any(t => !t.Result);
}
public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
var tasks = new List<Task<bool>>();
do
{
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
return stream2.Read(buffer2, 0, 1) == 0; // check not eof
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
if (!IsSame(buffer1, buffer2))
return false;
}
while (true);
}
public static bool IsSame(byte[] bytes1, byte[] bytes2)
{
if (bytes1 == null)
throw new ArgumentNullException(nameof(bytes1));
if (bytes2 == null)
throw new ArgumentNullException(nameof(bytes2));
if (bytes1.Length != bytes2.Length)
return false;
for (int i = 0; i < bytes1.Length; i++)
{
if (bytes1[i] != bytes2[i])
return false;
}
return true;
}
I think there are applications where "hash" is faster than comparing byte by byte.
If you need to compare a file with others or have a thumbnail of a photo that can change.
It depends on where and how it is using.
private bool CompareFilesByte(string file1, string file2)
{
using (var fs1 = new FileStream(file1, FileMode.Open))
using (var fs2 = new FileStream(file2, FileMode.Open))
{
if (fs1.Length != fs2.Length) return false;
int b1, b2;
do
{
b1 = fs1.ReadByte();
b2 = fs2.ReadByte();
if (b1 != b2 || b1 < 0) return false;
}
while (b1 >= 0);
}
return true;
}
private string HashFile(string file)
{
using (var fs = new FileStream(file, FileMode.Open))
using (var reader = new BinaryReader(fs))
{
var hash = new SHA512CryptoServiceProvider();
hash.ComputeHash(reader.ReadBytes((int)file.Length));
return Convert.ToBase64String(hash.Hash);
}
}
private bool CompareFilesWithHash(string file1, string file2)
{
var str1 = HashFile(file1);
var str2 = HashFile(file2);
return str1 == str2;
}
Here, you can get what is the fastest.
var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));
Optionally, we can save the hash in a database.
Hope this can help
This I have found works well comparing first the length without reading data and then comparing the read byte sequence
private static bool IsFileIdentical(string a, string b)
{
if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}
Yet another answer, derived from #chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:
/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
if (file1 == file2)
return true;
FileInfo file1Info = new FileInfo(file1);
FileInfo file2Info = new FileInfo(file2);
if (!file1Info.Exists && !file2Info.Exists)
return true;
if (!file1Info.Exists && file2Info.Exists)
return false;
if (file1Info.Exists && !file2Info.Exists)
return false;
if (file1Info.Length != file2Info.Length)
return false;
using (FileStream file1Stream = file1Info.OpenRead())
using (FileStream file2Stream = file2Info.OpenRead())
{
byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
for (int i = 0; i < firstHash.Length; i++)
{
if (i>=secondHash.Length||firstHash[i] != secondHash[i])
return false;
}
return true;
}
}
Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)
public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
if (actualFile.Length != expectedFile.Length) {
throw new Exception($"File {actualFile.Name} has different length in actual and expected directories.");
}
// compare the files on a byte level
using var actualStream = actualFile.OpenRead();
using var expectedStream = expectedFile.OpenRead();
var actualBuffer = new byte[1024];
var expectedBuffer = new byte[1024];
int actualBytesRead;
int expectedBytesRead;
do {
actualBytesRead = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
if (actualBytesRead != expectedBytesRead) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
if (!actualBuffer.SequenceEqual(expectedBuffer)) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
} while (actualBytesRead > 0);
}
I find the usage of SequenceEqual particular interesting.

Categories