How to read a text file reversely with iterator in C# - c#

I need to process a large file, around 400K lines and 200 M. But sometimes I have to process from bottom up. How can I use iterator (yield return) here? Basically I don't like to load everything in memory. I know it is more efficient to use iterator in .NET.

Reading text files backwards is really tricky unless you're using a fixed-size encoding (e.g. ASCII). When you've got variable-size encoding (such as UTF-8) you will keep having to check whether you're in the middle of a character or not when you fetch data.
There's nothing built into the framework, and I suspect you'd have to do separate hard coding for each variable-width encoding.
EDIT: This has been somewhat tested - but that's not to say it doesn't still have some subtle bugs around. It uses StreamUtil from MiscUtil, but I've included just the necessary (new) method from there at the bottom. Oh, and it needs refactoring - there's one pretty hefty method, as you'll see:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace MiscUtil.IO
{
/// <summary>
/// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream
/// (or a filename for convenience) and yields lines from the end of the stream backwards.
/// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream
/// returned by the function must be seekable.
/// </summary>
public sealed class ReverseLineReader : IEnumerable<string>
{
/// <summary>
/// Buffer size to use by default. Classes with internal access can specify
/// a different buffer size - this is useful for testing.
/// </summary>
private const int DefaultBufferSize = 4096;
/// <summary>
/// Means of creating a Stream to read from.
/// </summary>
private readonly Func<Stream> streamSource;
/// <summary>
/// Encoding to use when converting bytes to text
/// </summary>
private readonly Encoding encoding;
/// <summary>
/// Size of buffer (in bytes) to read each time we read from the
/// stream. This must be at least as big as the maximum number of
/// bytes for a single character.
/// </summary>
private readonly int bufferSize;
/// <summary>
/// Function which, when given a position within a file and a byte, states whether
/// or not the byte represents the start of a character.
/// </summary>
private Func<long,byte,bool> characterStartDetector;
/// <summary>
/// Creates a LineReader from a stream source. The delegate is only
/// called when the enumerator is fetched. UTF-8 is used to decode
/// the stream into text.
/// </summary>
/// <param name="streamSource">Data source</param>
public ReverseLineReader(Func<Stream> streamSource)
: this(streamSource, Encoding.UTF8)
{
}
/// <summary>
/// Creates a LineReader from a filename. The file is only opened
/// (or even checked for existence) when the enumerator is fetched.
/// UTF8 is used to decode the file into text.
/// </summary>
/// <param name="filename">File to read from</param>
public ReverseLineReader(string filename)
: this(filename, Encoding.UTF8)
{
}
/// <summary>
/// Creates a LineReader from a filename. The file is only opened
/// (or even checked for existence) when the enumerator is fetched.
/// </summary>
/// <param name="filename">File to read from</param>
/// <param name="encoding">Encoding to use to decode the file into text</param>
public ReverseLineReader(string filename, Encoding encoding)
: this(() => File.OpenRead(filename), encoding)
{
}
/// <summary>
/// Creates a LineReader from a stream source. The delegate is only
/// called when the enumerator is fetched.
/// </summary>
/// <param name="streamSource">Data source</param>
/// <param name="encoding">Encoding to use to decode the stream into text</param>
public ReverseLineReader(Func<Stream> streamSource, Encoding encoding)
: this(streamSource, encoding, DefaultBufferSize)
{
}
internal ReverseLineReader(Func<Stream> streamSource, Encoding encoding, int bufferSize)
{
this.streamSource = streamSource;
this.encoding = encoding;
this.bufferSize = bufferSize;
if (encoding.IsSingleByte)
{
// For a single byte encoding, every byte is the start (and end) of a character
characterStartDetector = (pos, data) => true;
}
else if (encoding is UnicodeEncoding)
{
// For UTF-16, even-numbered positions are the start of a character.
// TODO: This assumes no surrogate pairs. More work required
// to handle that.
characterStartDetector = (pos, data) => (pos & 1) == 0;
}
else if (encoding is UTF8Encoding)
{
// For UTF-8, bytes with the top bit clear or the second bit set are the start of a character
// See http://www.cl.cam.ac.uk/~mgk25/unicode.html
characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0;
}
else
{
throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted");
}
}
/// <summary>
/// Returns the enumerator reading strings backwards. If this method discovers that
/// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown.
/// </summary>
public IEnumerator<string> GetEnumerator()
{
Stream stream = streamSource();
if (!stream.CanSeek)
{
stream.Dispose();
throw new NotSupportedException("Unable to seek within stream");
}
if (!stream.CanRead)
{
stream.Dispose();
throw new NotSupportedException("Unable to read within stream");
}
return GetEnumeratorImpl(stream);
}
private IEnumerator<string> GetEnumeratorImpl(Stream stream)
{
try
{
long position = stream.Length;
if (encoding is UnicodeEncoding && (position & 1) != 0)
{
throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length.");
}
// Allow up to two bytes for data from the start of the previous
// read which didn't quite make it as full characters
byte[] buffer = new byte[bufferSize + 2];
char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)];
int leftOverData = 0;
String previousEnd = null;
// TextReader doesn't return an empty string if there's line break at the end
// of the data. Therefore we don't return an empty string if it's our *first*
// return.
bool firstYield = true;
// A line-feed at the start of the previous buffer means we need to swallow
// the carriage-return at the end of this buffer - hence this needs declaring
// way up here!
bool swallowCarriageReturn = false;
while (position > 0)
{
int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize);
position -= bytesToRead;
stream.Position = position;
StreamUtil.ReadExactly(stream, buffer, bytesToRead);
// If we haven't read a full buffer, but we had bytes left
// over from before, copy them to the end of the buffer
if (leftOverData > 0 && bytesToRead != bufferSize)
{
// Buffer.BlockCopy doesn't document its behaviour with respect
// to overlapping data: we *might* just have read 7 bytes instead of
// 8, and have two bytes to copy...
Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData);
}
// We've now *effectively* read this much data.
bytesToRead += leftOverData;
int firstCharPosition = 0;
while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition]))
{
firstCharPosition++;
// Bad UTF-8 sequences could trigger this. For UTF-8 we should always
// see a valid character start in every 3 bytes, and if this is the start of the file
// so we've done a short read, we should have the character start
// somewhere in the usable buffer.
if (firstCharPosition == 3 || firstCharPosition == bytesToRead)
{
throw new InvalidDataException("Invalid UTF-8 data");
}
}
leftOverData = firstCharPosition;
int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0);
int endExclusive = charsRead;
for (int i = charsRead - 1; i >= 0; i--)
{
char lookingAt = charBuffer[i];
if (swallowCarriageReturn)
{
swallowCarriageReturn = false;
if (lookingAt == '\r')
{
endExclusive--;
continue;
}
}
// Anything non-line-breaking, just keep looking backwards
if (lookingAt != '\n' && lookingAt != '\r')
{
continue;
}
// End of CRLF? Swallow the preceding CR
if (lookingAt == '\n')
{
swallowCarriageReturn = true;
}
int start = i + 1;
string bufferContents = new string(charBuffer, start, endExclusive - start);
endExclusive = i;
string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd;
if (!firstYield || stringToYield.Length != 0)
{
yield return stringToYield;
}
firstYield = false;
previousEnd = null;
}
previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd);
// If we didn't decode the start of the array, put it at the end for next time
if (leftOverData != 0)
{
Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData);
}
}
if (leftOverData != 0)
{
// At the start of the final buffer, we had the end of another character.
throw new InvalidDataException("Invalid UTF-8 data at start of stream");
}
if (firstYield && string.IsNullOrEmpty(previousEnd))
{
yield break;
}
yield return previousEnd ?? "";
}
finally
{
stream.Dispose();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
}
// StreamUtil.cs:
public static class StreamUtil
{
public static void ReadExactly(Stream input, byte[] buffer, int bytesToRead)
{
int index = 0;
while (index < bytesToRead)
{
int read = input.Read(buffer, index, bytesToRead - index);
if (read == 0)
{
throw new EndOfStreamException
(String.Format("End of stream reached with {0} byte{1} left to read.",
bytesToRead - index,
bytesToRead - index == 1 ? "s" : ""));
}
index += read;
}
}
}
Feedback very welcome. This was fun :)

Attention: this approach doesn't work (explained in EDIT)
You could use File.ReadLines to get lines iterator
foreach (var line in File.ReadLines(#"C:\temp\ReverseRead.txt").Reverse())
{
if (noNeedToReadFurther)
break;
// process line here
Console.WriteLine(line);
}
EDIT:
After reading applejacks01's comment, I run some tests and it does look like .Reverse() actually loads whole file.
I used File.ReadLines() to print first line of a 40MB file - memory usage of console app was 5MB. Then, used File.ReadLines().Reverse() to print last line of same file - memory usage was 95MB.
Conclusion
Whatever `Reverse()' is doing, it is not a good choice for reading bottom of a big file.

Very fast solution for huge files: From C#, use PowerShell's Get-Content with the Tail parameter.
using System.Management.Automation;
using (PowerShell powerShell = PowerShell.Create())
{
string lastLine = powerShell.AddCommand("Get-Content")
.AddParameter("Path", #"c:\a.txt")
.AddParameter("Tail", 1)
.Invoke().FirstOrDefault()?.ToString();
}
Required reference: 'System.Management.Automation.dll' - may be somewhere like 'C:\Program Files (x86)\Reference Assemblies\Microsoft\WindowsPowerShell\3.0'
Using PowerShell incurs a small overhead but is worth it for huge files.

To create a file iterator you can do this:
EDIT:
This is my fixed version of a fixed-width reverse file reader:
public static IEnumerable<string> readFile()
{
using (FileStream reader = new FileStream(#"c:\test.txt",FileMode.Open,FileAccess.Read))
{
int i=0;
StringBuilder lineBuffer = new StringBuilder();
int byteRead;
while (-i < reader.Length)
{
reader.Seek(--i, SeekOrigin.End);
byteRead = reader.ReadByte();
if (byteRead == 10 && lineBuffer.Length > 0)
{
yield return Reverse(lineBuffer.ToString());
lineBuffer.Remove(0, lineBuffer.Length);
}
lineBuffer.Append((char)byteRead);
}
yield return Reverse(lineBuffer.ToString());
reader.Close();
}
}
public static string Reverse(string str)
{
char[] arr = new char[str.Length];
for (int i = 0; i < str.Length; i++)
arr[i] = str[str.Length - 1 - i];
return new string(arr);
}

I also add my solution. After reading some answers, nothing really fit to my case.
I'm reading byte by byte from from behind until I find a LineFeed, then I'm returing the collected bytes as string, without using buffering.
Usage:
var reader = new ReverseTextReader(path);
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
Implementation:
public class ReverseTextReader
{
private const int LineFeedLf = 10;
private const int LineFeedCr = 13;
private readonly Stream _stream;
private readonly Encoding _encoding;
public bool EndOfStream => _stream.Position == 0;
public ReverseTextReader(Stream stream, Encoding encoding)
{
_stream = stream;
_encoding = encoding;
_stream.Position = _stream.Length;
}
public string ReadLine()
{
if (_stream.Position == 0) return null;
var line = new List<byte>();
var endOfLine = false;
while (!endOfLine)
{
var b = _stream.ReadByteFromBehind();
if (b == -1 || b == LineFeedLf)
{
endOfLine = true;
}
line.Add(Convert.ToByte(b));
}
line.Reverse();
return _encoding.GetString(line.ToArray());
}
}
public static class StreamExtensions
{
public static int ReadByteFromBehind(this Stream stream)
{
if (stream.Position == 0) return -1;
stream.Position = stream.Position - 1;
var value = stream.ReadByte();
stream.Position = stream.Position - 1;
return value;
}
}

I put the file into a list line by line, then used List.Reverse();
StreamReader objReader = new StreamReader(filename);
string sLine = "";
ArrayList arrText = new ArrayList();
while (sLine != null)
{
sLine = objReader.ReadLine();
if (sLine != null)
arrText.Add(sLine);
}
objReader.Close();
arrText.Reverse();
foreach (string sOutput in arrText)
{
...

You can read the file one character at a time backwards and cache all characters until you reach a carriage return and/or line feed.
You then reverse the collected string and yeld it as a line.

There are good answers here already, and here's another LINQ-compatible class you can use which focuses on performance and support for large files. It assumes a "\r\n" line terminator.
Usage:
var reader = new ReverseTextReader(#"C:\Temp\ReverseTest.txt");
while (!reader.EndOfStream)
Console.WriteLine(reader.ReadLine());
ReverseTextReader Class:
/// <summary>
/// Reads a text file backwards, line-by-line.
/// </summary>
/// <remarks>This class uses file seeking to read a text file of any size in reverse order. This
/// is useful for needs such as reading a log file newest-entries first.</remarks>
public sealed class ReverseTextReader : IEnumerable<string>
{
private const int BufferSize = 16384; // The number of bytes read from the uderlying stream.
private readonly Stream _stream; // Stores the stream feeding data into this reader
private readonly Encoding _encoding; // Stores the encoding used to process the file
private byte[] _leftoverBuffer; // Stores the leftover partial line after processing a buffer
private readonly Queue<string> _lines; // Stores the lines parsed from the buffer
#region Constructors
/// <summary>
/// Creates a reader for the specified file.
/// </summary>
/// <param name="filePath"></param>
public ReverseTextReader(string filePath)
: this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), Encoding.Default)
{ }
/// <summary>
/// Creates a reader using the specified stream.
/// </summary>
/// <param name="stream"></param>
public ReverseTextReader(Stream stream)
: this(stream, Encoding.Default)
{ }
/// <summary>
/// Creates a reader using the specified path and encoding.
/// </summary>
/// <param name="filePath"></param>
/// <param name="encoding"></param>
public ReverseTextReader(string filePath, Encoding encoding)
: this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), encoding)
{ }
/// <summary>
/// Creates a reader using the specified stream and encoding.
/// </summary>
/// <param name="stream"></param>
/// <param name="encoding"></param>
public ReverseTextReader(Stream stream, Encoding encoding)
{
_stream = stream;
_encoding = encoding;
_lines = new Queue<string>(128);
// The stream needs to support seeking for this to work
if(!_stream.CanSeek)
throw new InvalidOperationException("The specified stream needs to support seeking to be read backwards.");
if (!_stream.CanRead)
throw new InvalidOperationException("The specified stream needs to support reading to be read backwards.");
// Set the current position to the end of the file
_stream.Position = _stream.Length;
_leftoverBuffer = new byte[0];
}
#endregion
#region Overrides
/// <summary>
/// Reads the next previous line from the underlying stream.
/// </summary>
/// <returns></returns>
public string ReadLine()
{
// Are there lines left to read? If so, return the next one
if (_lines.Count != 0) return _lines.Dequeue();
// Are we at the beginning of the stream? If so, we're done
if (_stream.Position == 0) return null;
#region Read and Process the Next Chunk
// Remember the current position
var currentPosition = _stream.Position;
var newPosition = currentPosition - BufferSize;
// Are we before the beginning of the stream?
if (newPosition < 0) newPosition = 0;
// Calculate the buffer size to read
var count = (int)(currentPosition - newPosition);
// Set the new position
_stream.Position = newPosition;
// Make a new buffer but append the previous leftovers
var buffer = new byte[count + _leftoverBuffer.Length];
// Read the next buffer
_stream.Read(buffer, 0, count);
// Move the position of the stream back
_stream.Position = newPosition;
// And copy in the leftovers from the last buffer
if (_leftoverBuffer.Length != 0)
Array.Copy(_leftoverBuffer, 0, buffer, count, _leftoverBuffer.Length);
// Look for CrLf delimiters
var end = buffer.Length - 1;
var start = buffer.Length - 2;
// Search backwards for a line feed
while (start >= 0)
{
// Is it a line feed?
if (buffer[start] == 10)
{
// Yes. Extract a line and queue it (but exclude the \r\n)
_lines.Enqueue(_encoding.GetString(buffer, start + 1, end - start - 2));
// And reset the end
end = start;
}
// Move to the previous character
start--;
}
// What's left over is a portion of a line. Save it for later.
_leftoverBuffer = new byte[end + 1];
Array.Copy(buffer, 0, _leftoverBuffer, 0, end + 1);
// Are we at the beginning of the stream?
if (_stream.Position == 0)
// Yes. Add the last line.
_lines.Enqueue(_encoding.GetString(_leftoverBuffer, 0, end - 1));
#endregion
// If we have something in the queue, return it
return _lines.Count == 0 ? null : _lines.Dequeue();
}
#endregion
#region IEnumerator<string> Interface
public IEnumerator<string> GetEnumerator()
{
string line;
// So long as the next line isn't null...
while ((line = ReadLine()) != null)
// Read and return it.
yield return line;
}
IEnumerator IEnumerable.GetEnumerator()
{
throw new NotImplementedException();
}
#endregion
}

I know this post is very old but as I couldn't find how to use the most voted solution, I finally found this:
here is the best answer I found with a low memory cost in VB and C#
http://www.blakepell.com/2010-11-29-backward-file-reader-vb-csharp-source
Hope, I'll help others with that because it tooks me hours to finally find this post!
[Edit]
Here is the c# code :
//*********************************************************************************************************************************
//
// Class: BackwardReader
// Initial Date: 11/29/2010
// Last Modified: 11/29/2010
// Programmer(s): Original C# Source - the_real_herminator
// http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/9acdde1a-03cd-4018-9f87-6e201d8f5d09
// VB Converstion - Blake Pell
//
//*********************************************************************************************************************************
using System.Text;
using System.IO;
public class BackwardReader
{
private string path;
private FileStream fs = null;
public BackwardReader(string path)
{
this.path = path;
fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
fs.Seek(0, SeekOrigin.End);
}
public string Readline()
{
byte[] line;
byte[] text = new byte[1];
long position = 0;
int count;
fs.Seek(0, SeekOrigin.Current);
position = fs.Position;
//do we have trailing rn?
if (fs.Length > 1)
{
byte[] vagnretur = new byte[2];
fs.Seek(-2, SeekOrigin.Current);
fs.Read(vagnretur, 0, 2);
if (ASCIIEncoding.ASCII.GetString(vagnretur).Equals("rn"))
{
//move it back
fs.Seek(-2, SeekOrigin.Current);
position = fs.Position;
}
}
while (fs.Position > 0)
{
text.Initialize();
//read one char
fs.Read(text, 0, 1);
string asciiText = ASCIIEncoding.ASCII.GetString(text);
//moveback to the charachter before
fs.Seek(-2, SeekOrigin.Current);
if (asciiText.Equals("n"))
{
fs.Read(text, 0, 1);
asciiText = ASCIIEncoding.ASCII.GetString(text);
if (asciiText.Equals("r"))
{
fs.Seek(1, SeekOrigin.Current);
break;
}
}
}
count = int.Parse((position - fs.Position).ToString());
line = new byte[count];
fs.Read(line, 0, count);
fs.Seek(-count, SeekOrigin.Current);
return ASCIIEncoding.ASCII.GetString(line);
}
public bool SOF
{
get
{
return fs.Position == 0;
}
}
public void Close()
{
fs.Close();
}
}

I wanted to do the similar thing.
Here is my code. This class will create temporary files containing chunks of the big file. This will avoid memory bloating. User can specify whether s/he wants the file reversed. Accordingly it will return the content in reverse manner.
This class can also be used to write big data in a single file without bloating memory.
Please provide feedback.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace BigFileService
{
public class BigFileDumper
{
/// <summary>
/// Buffer that will store the lines until it is full.
/// Then it will dump it to temp files.
/// </summary>
public int CHUNK_SIZE = 1000;
public bool ReverseIt { get; set; }
public long TotalLineCount { get { return totalLineCount; } }
private long totalLineCount;
private int BufferCount = 0;
private StreamWriter Writer;
/// <summary>
/// List of files that would store the chunks.
/// </summary>
private List<string> LstTempFiles;
private string ParentDirectory;
private char[] trimchars = { '/', '\\'};
public BigFileDumper(string FolderPathToWrite)
{
this.LstTempFiles = new List<string>();
this.ParentDirectory = FolderPathToWrite.TrimEnd(trimchars) + "\\" + "BIG_FILE_DUMP";
this.totalLineCount = 0;
this.BufferCount = 0;
this.Initialize();
}
private void Initialize()
{
// Delete existing directory.
if (Directory.Exists(this.ParentDirectory))
{
Directory.Delete(this.ParentDirectory, true);
}
// Create a new directory.
Directory.CreateDirectory(this.ParentDirectory);
}
public void WriteLine(string line)
{
if (this.BufferCount == 0)
{
string newFile = "DumpFile_" + LstTempFiles.Count();
LstTempFiles.Add(newFile);
Writer = new StreamWriter(this.ParentDirectory + "\\" + newFile);
}
// Keep on adding in the buffer as long as size is okay.
if (this.BufferCount < this.CHUNK_SIZE)
{
this.totalLineCount++; // main count
this.BufferCount++; // Chunk count.
Writer.WriteLine(line);
}
else
{
// Buffer is full, time to create a new file.
// Close the existing file first.
Writer.Close();
// Make buffer count 0 again.
this.BufferCount = 0;
this.WriteLine(line);
}
}
public void Close()
{
if (Writer != null)
Writer.Close();
}
public string GetFullFile()
{
if (LstTempFiles.Count <= 0)
{
Debug.Assert(false, "There are no files created.");
return "";
}
string returnFilename = this.ParentDirectory + "\\" + "FullFile";
if (File.Exists(returnFilename) == false)
{
// Create a consolidated file from the existing small dump files.
// Now this is interesting. We will open the small dump files one by one.
// Depending on whether the user require inverted file, we will read them in descending order & reverted,
// or ascending order in normal way.
if (this.ReverseIt)
this.LstTempFiles.Reverse();
foreach (var fileName in LstTempFiles)
{
string fullFileName = this.ParentDirectory + "\\" + fileName;
// FileLines will use small memory depending on size of CHUNK. User has control.
var fileLines = File.ReadAllLines(fullFileName);
// Time to write in the writer.
if (this.ReverseIt)
fileLines = fileLines.Reverse().ToArray();
// Write the lines
File.AppendAllLines(returnFilename, fileLines);
}
}
return returnFilename;
}
}
}
This service can be used as follows -
void TestBigFileDump_File(string BIG_FILE, string FOLDER_PATH_FOR_CHUNK_FILES)
{
// Start processing the input Big file.
StreamReader reader = new StreamReader(BIG_FILE);
// Create a dump file class object to handle efficient memory management.
var bigFileDumper = new BigFileDumper(FOLDER_PATH_FOR_CHUNK_FILES);
// Set to reverse the output file.
bigFileDumper.ReverseIt = true;
bigFileDumper.CHUNK_SIZE = 100; // How much at a time to keep in RAM before dumping to local file.
while (reader.EndOfStream == false)
{
string line = reader.ReadLine();
bigFileDumper.WriteLine(line);
}
bigFileDumper.Close();
reader.Close();
// Get back full reversed file.
var reversedFilename = bigFileDumper.GetFullFile();
Console.WriteLine("Check output file - " + reversedFilename);
}

In case anyone else comes across this, I solved it with the following PowerShell script which can easily be modified into a C# script with a small amount of effort.
[System.IO.FileStream]$fileStream = [System.IO.File]::Open("C:\Name_of_very_large_file.log", [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
[System.IO.BufferedStream]$bs = New-Object System.IO.BufferedStream $fileStream;
[System.IO.StreamReader]$sr = New-Object System.IO.StreamReader $bs;
$buff = New-Object char[] 20;
$seek = $bs.Seek($fileStream.Length - 10000, [System.IO.SeekOrigin]::Begin);
while(($line = $sr.ReadLine()) -ne $null)
{
$line;
}
This basically starts reading from the last 10,000 characters of a file, outputting each line.

Related

Read binary objects from a file in C# written out by a C++ program

I am trying to read objects from very large files containing padded structs that were written into it by a C++ process. I was using an example to memory map the large file and try to deserialize the data into an object but I now can see that it won't work this way.
How can I extract all the objects from the files to use in C#? I'm probably way off but I've provided the code. The objects have a 8 byte milliseconds member followed by 21 16bit integers, which needs 6bytes of padding to align to a 8byte boundary.
[Serializable]
unsafe public struct DataStruct
{
public UInt64 milliseconds;
[MarshalAs(UnmanagedType.ByValArray, SizeConst = 21)]
public fixed Int16 data[21];
[MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)]
public fixed Int16 padding[3];
};
[Serializable]
public class DataArray
{
public DataStruct[] samples;
}
public static class Helper
{
public static Int16[] GetData(this DataStruct data)
{
unsafe
{
Int16[] output = new Int16[21];
for (int index = 0; index < 21; ++index)
output[index] = data.data[index];
return output;
}
}
}
class FileThreadSupport
{
struct DataFileInfo
{
public string path;
public UInt64 start;
public UInt64 stop;
public UInt64 elements;
};
// Create our epoch timestamp
private static readonly DateTime epoch = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
// Output TCP client
private Support.AsyncTcpClient output;
// Directory which contains our data
private string replay_directory;
// Files to be read from
private DataFileInfo[] file_infos;
// Current timestamp of when the process was started
UInt64 process_start = 0;
// Object from current file
DataArray current_file_data;
// Offset into current files
UInt64 current_file_index = 0;
// Offset into current files
UInt64 current_file_offset = 0;
// Run flag
bool run = true;
public FileThreadSupport(ref Support.AsyncTcpClient output, ref Engine.A.Information info, ref Support.Configuration configuration)
{
// Set our output directory
replay_directory = configuration.getString("replay_directory");
if (replay_directory.Length == 0)
{
Console.WriteLine("Configuration does not provide a replay directory");
return;
}
// Check the directory for playable files
if(!loadDataDirectory(replay_directory))
{
Console.WriteLine("Replay directory {} did not have any valid files", replay_directory);
}
// Set the output TCP client
this.output = output;
}
private bool loadDataDirectory(string directory)
{
string[] files = Directory.GetFiles(directory, "*.*", SearchOption.TopDirectoryOnly);
file_infos = new DataFileInfo[files.Length];
int index = 0;
foreach (string file in files)
{
string[] parts = file.Split('\\');
string name = parts.Last();
parts = name.Split('.');
if (parts.Length != 2)
continue;
UInt64 start, stop = 0;
if (!UInt64.TryParse(parts[0], out start) || !UInt64.TryParse(parts[1], out stop))
continue;
long size = new System.IO.FileInfo(file).Length;
// Add to our file info array
file_infos[index] = new DataFileInfo
{
path = file,
start = start,
stop = stop,
elements = (ulong)(new System.IO.FileInfo(file).Length / 56
/*System.Runtime.InteropServices.Marshal.SizeOf(typeof(DataStruct))*/)
};
++index;
}
// Sort the array
Array.Sort(file_infos, delegate (DataFileInfo x, DataFileInfo y) { return x.start.CompareTo(y.start); });
// Return whether or not there were files found
return (files.Length > 0);
}
public void start()
{
process_start = (ulong)DateTime.Now.ToUniversalTime().Subtract(epoch).TotalMilliseconds;
UInt64 num_samples = 0;
while(run)
{
// Get our samples and add it to the sample
DataStruct[] result = getData(100);
Engine.A.A message = new Engine.A.A();
for (int i = 0; i < result.Length; ++i)
{
Engine.A.Data sample = new Engine.A.Data();
sample.Time = process_start + num_samples * 4;
Int16[] signal_data = Helper.GetData(result[i]);
for(int e = 0; e < signal_data.Length; ++e)
sample.Value[e] = signal_data[e];
message.Signal.Add(sample);
++num_samples;
}
// Send out the websocket
this.output.SendAsync(message.ToByteArray());
// Sleep 100 milliseconds
Thread.Sleep(100);
}
}
public void stop()
{
run = false;
}
private DataStruct[] getData(UInt64 milliseconds)
{
if (file_infos.Length == 0)
return new DataStruct[0];
if (current_file_data == null)
{
current_file_data = ReadObjectFromMMF(file_infos[current_file_index].path) as DataArray;
if(current_file_data.samples.Length == 0)
return new DataStruct[0];
}
UInt64 elements_to_read = (UInt64) milliseconds / 4;
DataStruct[] result = new DataStruct[elements_to_read];
Array.Copy(current_file_data.samples, (int)current_file_offset, result, 0, (int) Math.Min(elements_to_read, file_infos[current_file_index].elements - current_file_offset));
while((UInt64) result.Length != elements_to_read)
{
current_file_index = (current_file_index + 1) % (ulong) file_infos.Length;
current_file_data = ReadObjectFromMMF(file_infos[current_file_index].path) as DataArray;
if (current_file_data.samples.Length == 0)
return new DataStruct[0];
current_file_offset = 0;
Array.Copy(current_file_data.samples, (int)current_file_offset, result, result.Length, (int)Math.Min(elements_to_read, file_infos[current_file_index].elements - current_file_offset));
}
return result;
}
private object ByteArrayToObject(byte[] buffer)
{
BinaryFormatter binaryFormatter = new BinaryFormatter(); // Create new BinaryFormatter
MemoryStream memoryStream = new MemoryStream(buffer); // Convert buffer to memorystream
return binaryFormatter.Deserialize(memoryStream); // Deserialize stream to an object
}
private object ReadObjectFromMMF(string file)
{
// Get a handle to an existing memory mapped file
using (MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(file, FileMode.Open))
{
// Create a view accessor from which to read the data
using (MemoryMappedViewAccessor mmfReader = mmf.CreateViewAccessor())
{
// Create a data buffer and read entire MMF view into buffer
byte[] buffer = new byte[mmfReader.Capacity];
mmfReader.ReadArray<byte>(0, buffer, 0, buffer.Length);
// Convert the buffer to a .NET object
return ByteArrayToObject(buffer);
}
}
}
Well for one thing you're not using that memory mapped file well at all, you're just sequentially reading it all in a buffer, which is both needlessly inefficient and much slower than if you simply opened the file to read normally. The selling point of memory mapped files is repeated random access and random updates backed by the OS's virtual memory paging.
And you definitely don't need to read the entire file in memory, since your data is so strongly structured. You know exactly how many bytes to read for a record: Marshal.SizeOf<DataStruct>().
Then you need to get rid of all that serialization noise. Again your data is strongly typed, just read it. Get rid of those fixed arrays and use regular arrays, you're already instructing the marshaller how to read them with MarshalAs attributes (good). That also gets rid of that helper function that just copies an array for some unknown reason.
Your reading loop is very simple: read the correct number of bytes for one entry, use Marshal.PtrToStructure to convert it to a readable structure and add it to a list to return at the end. Bonus points if you can use .Net Core and Unsafe.As or Unsafe.Cast.
Edit: and don't use object returns, you know exactly what you're returning, write it down.

Copy Multiple Files and report progress C#

After looking at multiple questions/answers I couldn't find a solution for my problem. I remember I got this code from some question here at StackOverflow and it works perfectly but just for one file. What I want is multiple files.
This is the original CopyTo Function:
public static void CopyTo(this FileInfo file, FileInfo destination, Action<int> progressCallback)
{
const int bufferSize = 1024 * 1024; //1MB
byte[] buffer = new byte[bufferSize], buffer2 = new byte[bufferSize];
bool swap = false;
int progress = 0, reportedProgress = 0, read = 0;
long len = file.Length;
float flen = len;
Task writer = null;
using (var source = file.OpenRead())
using (var dest = destination.OpenWrite())
{
//dest.SetLength(source.Length);
for (long size = 0; size < len; size += read)
{
if ((progress = ((int)((size / flen) * 100))) != reportedProgress)
progressCallback(reportedProgress = progress);
read = source.Read(swap ? buffer : buffer2, 0, bufferSize);
writer?.Wait(); // if < .NET4 // if (writer != null) writer.Wait();
writer = dest.WriteAsync(swap ? buffer : buffer2, 0, read);
swap = !swap;
}
writer?.Wait(); //Fixed - Thanks #sam-hocevar
}
}
So here is how I start the file copy process:
var ficheiro = ficheirosCopia.ElementAt(x);
var _source = new FileInfo(ficheiro.Key);
var _destination = new FileInfo(ficheiro.Value);
if (_destination.Exists)
{
_destination.Delete();
}
Task.Run(() =>
{
_source.CopyTo(_destination, perc => Dispatcher.Invoke(() => progressBar.SetProgress(perc)));
}).GetAwaiter().OnCompleted(() => MessageBox.Show("File Copied!"));
This works very well when I copy only one file but I need to copy multiple files. So I've started to change things a bit:
public static void CopyTo(Dictionary<string, string> files, Action<int> progressCallback)
{
int globalProgress = 0, globalReportedProgress = 0, globalRead = 0;
for (var x = 0; x < files.Count; x++)
{
var item = files.ElementAt(x);
var file = new FileInfo(item.Key);
var destination = new FileInfo(item.Value);
const int bufferSize = 1024 * 1024; //1MB
byte[] buffer = new byte[bufferSize], buffer2 = new byte[bufferSize];
bool swap = false;
int progress = 0, reportedProgress = 0, read = 0;
long len = file.Length;
float flen = len;
Task writer = null;
using (var source = file.OpenRead())
using (var dest = destination.OpenWrite())
{
for (long size = 0; size < len; size += read)
{
if ((progress = ((int)((size / flen) * 100))) != reportedProgress)
progressCallback(reportedProgress = progress);
read = source.Read(swap ? buffer : buffer2, 0, bufferSize);
writer?.Wait(); // if < .NET4 // if (writer != null) writer.Wait();
writer = dest.WriteAsync(swap ? buffer : buffer2, 0, read);
swap = !swap;
}
writer?.Wait(); //Fixed - Thanks #sam-hocevar
}
}
}
Of course this code has a lot of errors but I can't understand how this should be done.
The main goal would be to Start a single task for multiple tiles and having progresscallback for global copy. Receiving a Dictionary (it's already created on other part of the code) as a parameter.
I came up with two approaches to this, one reporting progress after each file, and the other reporting progress every n bytes.
namespace StackOverflow41750117CopyProgress
{
using System;
using System.Collections.Generic;
using System.IO;
public class Batch
{
private bool _overwrite;
/// <summary>
/// Initializes a new instance of the <see cref="Batch"/> class.
/// </summary>
/// <param name="overwrite">
/// True to overwrite the destination file if it already exists (default),
/// false to throw an exception if the destination file already exists.
/// </param>
public Batch(bool overwrite = true)
{
this._overwrite = overwrite;
}
/// <summary>
/// Copies the files, reporting progress once per file.
/// </summary>
/// <param name="filesToCopy">
/// A dictionary with the paths of the source files as its keys, and the path to the destination file as its values.
/// </param>
/// <param name="progressCallback">
/// A callback which accepts two Int64 parameters - the number of bytes copied so far, and the total number of bytes to copy.
/// </param>
public void CopyReportingPerFile(Dictionary<string, string> filesToCopy, Action<long, long> progressCallback)
{
var bytesToCopy = this.GetTotalFileSize(filesToCopy);
long totalBytesCopied = 0;
foreach (var copy in filesToCopy)
{
File.Copy(copy.Key, copy.Value, this._overwrite);
totalBytesCopied += new FileInfo(copy.Key).Length;
progressCallback(totalBytesCopied, bytesToCopy);
}
}
/// <summary>
/// Copies the files, reporting progress once per read/write operation.
/// </summary>
/// <param name="filesToCopy">
/// A dictionary with the paths of the source files as its keys, and the path to the destination file as its values.
/// </param>
/// <param name="progressCallback">
/// A callback which accepts two Int64 parameters - the number of bytes copied so far, and the total number of bytes to copy.
/// </param>
public void CopyReportingPerBuffer(Dictionary<string, string> filesToCopy, Action<long, long> progressCalllback)
{
var bytesToCopy = this.GetTotalFileSize(filesToCopy);
var bufferSize = 1024 * 1024 * 50;
var buffer = new byte[bufferSize];
var span = new Span<byte>(buffer);
long totalBytesCopied = 0;
foreach (var copy in filesToCopy)
{
using (var source = File.OpenRead(copy.Key))
using (var destination = File.OpenWrite(copy.Value))
{
int bytesRead = 0;
do
{
// The Read method returns 0 once we've reached the end of the file
bytesRead = source.Read(span);
destination.Write(span);
totalBytesCopied += bytesRead;
progressCalllback(totalBytesCopied, bytesToCopy);
} while (bytesRead > 0);
source.Close();
destination.Close();
}
}
}
private long GetTotalFileSize(Dictionary<string, string> filesToCopy)
{
long bytesToCopy = 0;
foreach (var filename in filesToCopy.Keys)
{
var fileInfo = new FileInfo(filename);
bytesToCopy += fileInfo.Length;
}
return bytesToCopy;
}
}
}
Usage:
namespace StackOverflow41750117CopyProgress
{
using System;
using System.Collections.Generic;
using System.IO;
public class Program
{
public static void Main(string[] args)
{
var filesToCopy = new Dictionary<string, string>();
filesToCopy.Add(#"C:\temp\1.mp4", #"C:\temp\1copy.mp4");
filesToCopy.Add(#"C:\temp\2.mp4", #"C:\temp\2copy.mp4");
filesToCopy.Add(#"C:\temp\3.mp4", #"C:\temp\3copy.mp4");
filesToCopy.Add(#"C:\temp\4.mp4", #"C:\temp\4copy.mp4");
filesToCopy.Add(#"C:\temp\5.mp4", #"C:\temp\5copy.mp4");
filesToCopy.Add(#"C:\temp\6.mp4", #"C:\temp\6copy.mp4");
filesToCopy.Add(#"C:\temp\7.mp4", #"C:\temp\7copy.mp4");
// Make sure the destination files don't already exist
foreach (var copy in filesToCopy)
{
File.Delete(copy.Value);
}
var batch = new Batch();
Console.WriteLine($"Started {DateTime.Now}");
batch.CopyReportingPerFile(filesToCopy, (bytesCopied, bytesToCopy) => Console.WriteLine($"Copied {bytesCopied} bytes of {bytesToCopy}"));
//batch.CopyReportingPerBuffer(filesToCopy, (bytesCopied, bytesToCopy) => Console.WriteLine($"Copied {bytesCopied} bytes of {bytesToCopy}"));
Console.WriteLine($"Finished {DateTime.Now}");
}
}
}
A few observations...
Reporting progress once per file was easier to implement but doesn't meet the requirements of the question, and isn't very responsive if you're copying a small number of large files.
Using File.Copy preserves the original file's modified date, reading the files into memory and then writing them does not.
Increasing the buffer size from 1MB to 10MB and them 50MB increased the memory usage and also improved the performance, although most of that performance improvement seems to be as a result of calling Console.Writeline in my progressCallback less often, rather than increasing the speed of the disk I/O.
The optimum balance between performance and frequency of progress reports will depend on your circumstances and the spec of the machine running the process, but I found a 50MB buffer results in a progress report roughly once per second.
Note the use of Span<byte> rather than byte[] for the buffer that the data is read into and written from - this removes the need for my code to keep track of the current position in the file (and that's something new I learned today).
I know I'm rather late to this question, but hopefully someone will find this useful.

TcpClient.GetStream().DataAvailable returns false, but stream has more data

So, it would seem that a blocking Read() can return before it is done receiving all of the data being sent to it. In turn we wrap the Read() with a loop that is controlled by the DataAvailable value from the stream in question. The problem is that you can receive more data while in this while loop, but there is no behind the scenes processing going on to let the system know this. Most of the solutions I have found to this on the net have not been applicable in one way or another to me.
What I have ended up doing is as the last step in my loop, I do a simple Thread.Sleep(1) after reading each block from the stream. This appears to give the system time to update and I am not getting accurate results but this seems a bit hacky and quite a bit 'circumstantial' for a solution.
Here is a list of the circumstances I am dealing with: Single TCP Connection between an IIS Application and a standalone application, both written in C# for send/receive communication. It sends a request and then waits for a response. This request is initiated by an HTTP request, but I am not having this issue reading data from the HTTP Request, it is after the fact.
Here is the basic code for handling an incoming connection
protected void OnClientCommunication(TcpClient oClient)
{
NetworkStream stream = oClient.GetStream();
MemoryStream msIn = new MemoryStream();
byte[] aMessage = new byte[4096];
int iBytesRead = 0;
while ( stream.DataAvailable )
{
int iRead = stream.Read(aMessage, 0, aMessage.Length);
iBytesRead += iRead;
msIn.Write(aMessage, 0, iRead);
Thread.Sleep(1);
}
MemoryStream msOut = new MemoryStream();
// .. Do some processing adding data to the msOut stream
msOut.WriteTo(stream);
stream.Flush();
oClient.Close();
}
All feedback welcome for a better solution or just a thumbs up on needing to give that Sleep(1) a go to allow things to update properly before we check the DataAvailable value.
Guess I am hoping after 2 years that the answer to this question isn't how things still are :)
You have to know how much data you need to read; you cannot simply loop reading data until there is no more data, because you can never be sure that no more is going to come.
This is why HTTP GET results have a byte count in the HTTP headers: so the client side will know when it has received all the data.
Here are two solutions for you depending on whether you have control over what the other side is sending:
Use "framing" characters: (SB)data(EB), where SB and EB are start-block and end-block characters (of your choosing) but which CANNOT occur inside the data. When you "see" EB, you know you are done.
Implement a length field in front of each message to indicate how much data follows: (len)data. Read (len), then read (len) bytes; repeat as necessary.
This isn't like reading from a file where a zero-length read means end-of-data (that DOES mean the other side has disconnected, but that's another story).
A third (not recommended) solution is that you can implement a timer. Once you start getting data, set the timer. If the receive loop is idle for some period of time (say a few seconds, if data doesn't come often), you can probably assume no more data is coming. This last method is a last resort... it's not very reliable, hard to tune, and it's fragile.
I'm seeing a problem with this.
You're expecting that the communication will be faster than the while() loop, which is very unlikely.
The while() loop will finish as soon as there is no more data, which may not be the case a few milliseconds just after it exits.
Are you expecting a certain amount of bytes?
How often is OnClientCommunication() fired? Who triggers it?
What do you do with the data after the while() loop? Do you keep appending to previous data?
DataAvailable WILL return false because you're reading faster than the communication, so that's fine only if you keep coming back to this code block to process more data coming in.
I was trying to check DataAvailable before reading data from a network stream and it would return false, although after reading a single byte it would return true. So I checked the MSDN documentation and they also read before checking. I would re-arrange the while loop to a do while loop to follow this pattern.
http://msdn.microsoft.com/en-us/library/system.net.sockets.networkstream.dataavailable.aspx
// Check to see if this NetworkStream is readable.
if(myNetworkStream.CanRead){
byte[] myReadBuffer = new byte[1024];
StringBuilder myCompleteMessage = new StringBuilder();
int numberOfBytesRead = 0;
// Incoming message may be larger than the buffer size.
do{
numberOfBytesRead = myNetworkStream.Read(myReadBuffer, 0, myReadBuffer.Length);
myCompleteMessage.AppendFormat("{0}", Encoding.ASCII.GetString(myReadBuffer, 0, numberOfBytesRead));
}
while(myNetworkStream.DataAvailable);
// Print out the received message to the console.
Console.WriteLine("You received the following message : " +
myCompleteMessage);
}
else{
Console.WriteLine("Sorry. You cannot read from this NetworkStream.");
}
When I have this code:
var readBuffer = new byte[1024];
using (var memoryStream = new MemoryStream())
{
do
{
int numberOfBytesRead = networkStream.Read(readBuffer, 0, readBuffer.Length);
memoryStream.Write(readBuffer, 0, numberOfBytesRead);
}
while (networkStream.DataAvailable);
}
From what I can observe:
When sender sends 1000 bytes and reader wants to read them. Then I suspect that NetworkStream somehow "knows" that it should receive 1000 bytes.
When I call .Read before any data arrives from NetworkStream then .Read should be blocking until it gets more than 0 bytes (or more if .NoDelay is false on networkStream)
Then when I read first batch of data I suspect that .Read is somehow updating from its result the counter of those 1000 bytes at NetworkStream and before this happens I suspect, that in this time the .DataAvailable is set to false and after the counter is updated then the .DataAvailable is then set to correct value if the counter data is less than 1000 bytes. It makes sense when you think about it. Because otherwise it would go to the next cycle before checking that 1000 bytes arrived and the .Read method would be blocking indefinitely, because reader could have already read 1000 bytes and no more data would arrive.
This I think is the point of failure here as already James said:
Yes, this is just the way these libraries work. They need to be given time to run to fully validate the data incoming. – James Apr 20 '16 at 5:24
I suspect that the update of internal counter between end of .Read and before accessing .DataAvailable is not as atomic operation (transaction) so the TcpClient needs more time to properly set the DataAvailable.
When I have this code:
var readBuffer = new byte[1024];
using (var memoryStream = new MemoryStream())
{
do
{
int numberOfBytesRead = networkStream.Read(readBuffer, 0, readBuffer.Length);
memoryStream.Write(readBuffer, 0, numberOfBytesRead);
if (!networkStream.DataAvailable)
System.Threading.Thread.Sleep(1); //Or 50 for non-believers ;)
}
while (networkStream.DataAvailable);
}
Then the NetworkStream have enough time to properly set .DataAvailable and this method should function correctly.
Fun fact... This seems to be somehow OS Version dependent. Because the first function without sleep worked for me on Win XP and Win 10, but was failing to receive whole 1000 bytes on Win 7. Don't ask me why, but I tested it quite thoroughly and it was easily reproducible.
Using TcpClient.Available will allow this code to read exactly what is available each time. TcpClient.Available is automatically set to TcpClient.ReceiveBufferSize when the amount of data remaining to be read is greater than or equal to TcpClient.ReceiveBufferSize. Otherwise it is set to the size of the remaining data.
Hence, you can indicate the maximum amount of data that is available for each read by setting TcpClient.ReceiveBufferSize (e.g., oClient.ReceiveBufferSize = 4096;).
protected void OnClientCommunication(TcpClient oClient)
{
NetworkStream stream = oClient.GetStream();
MemoryStream msIn = new MemoryStream();
byte[] aMessage;
oClient.ReceiveBufferSize = 4096;
int iBytesRead = 0;
while (stream.DataAvailable)
{
int myBufferSize = (oClient.Available < 1) ? 1 : oClient.Available;
aMessage = new byte[oClient.Available];
int iRead = stream.Read(aMessage, 0, aMessage.Length);
iBytesRead += iRead;
msIn.Write(aMessage, 0, iRead);
}
MemoryStream msOut = new MemoryStream();
// .. Do some processing adding data to the msOut stream
msOut.WriteTo(stream);
stream.Flush();
oClient.Close();
}
public class NetworkStream
{
private readonly Socket m_Socket;
public NetworkStream(Socket socket)
{
m_Socket = socket ?? throw new ArgumentNullException(nameof(socket));
}
public void Send(string message)
{
if (message is null)
{
throw new ArgumentNullException(nameof(message));
}
byte[] data = Encoding.UTF8.GetBytes(message);
SendInternal(data);
}
public string Receive()
{
byte[] buffer = ReceiveInternal();
string message = Encoding.UTF8.GetString(buffer);
return message;
}
private void SendInternal(byte[] message)
{
int size = message.Length;
if (size == 0)
{
m_Socket.Send(BitConverter.GetBytes(size), 0, sizeof(int), SocketFlags.None);
}
else
{
m_Socket.Send(BitConverter.GetBytes(size), 0, sizeof(int), SocketFlags.None);
m_Socket.Send(message, 0, size, SocketFlags.None);
}
}
private byte[] ReceiveInternal()
{
byte[] sizeData = CommonReceiveMessage(sizeof(int));
int size = BitConverter.ToInt32(sizeData);
if (size == 0)
{
return Array.Empty<byte>();
}
return CommonReceiveMessage(size);
}
private byte[] CommonReceiveMessage(int messageLength)
{
if (messageLength < 0)
{
throw new ArgumentOutOfRangeException(nameof(messageLength), messageLength, "Размер сообщения не может быть меньше нуля.");
}
if (messageLength == 0)
{
return Array.Empty<byte>();
}
byte[] buffer = new byte[m_Socket.ReceiveBufferSize];
int currentLength = 0;
int receivedDataLength;
using (MemoryStream memoryStream = new())
{
do
{
receivedDataLength = m_Socket.Receive(buffer, 0, m_Socket.ReceiveBufferSize, SocketFlags.None);
currentLength += receivedDataLength;
memoryStream.Write(buffer, 0, receivedDataLength);
}
while (currentLength < messageLength);
return memoryStream.ToArray();
}
}
}
This example presents an algorithm for sending and receiving data, namely text messages. You can also send files.
using System;
using System.IO;
using System.Net.Sockets;
using System.Text;
namespace Network
{
/// <summary>
/// Represents a network stream for transferring data.
/// </summary>
public class NetworkStream
{
#region Fields
private static readonly byte[] EmptyArray = Array.Empty<byte>();
private readonly Socket m_Socket;
#endregion
#region Constructors
/// <summary>
/// Initializes a new instance of the class <seealso cref="NetworkStream"/>.
/// </summary>
/// <param name="socket">
/// Berkeley socket interface.
/// </param>
public NetworkStream(Socket socket)
{
m_Socket = socket ?? throw new ArgumentNullException(nameof(socket));
}
#endregion
#region Properties
#endregion
#region Methods
/// <summary>
/// Sends a message.
/// </summary>
/// <param name="message">
/// Message text.
/// </param>
/// <exception cref="ArgumentNullException"/>
public void Send(string message)
{
if (message is null)
{
throw new ArgumentNullException(nameof(message));
}
byte[] data = Encoding.UTF8.GetBytes(message);
Write(data);
}
/// <summary>
/// Receives the sent message.
/// </summary>
/// <returns>
/// Sent message.
/// </returns>
public string Receive()
{
byte[] data = Read();
return Encoding.UTF8.GetString(data);
}
/// <summary>
/// Receives the specified number of bytes from a bound <seealso cref="Socket"/>.
/// </summary>
/// <param name="socket">
/// <seealso cref="Socket"/> for receiving data.
/// </param>
/// <param name="size">
/// The size of the received data.
/// </param>
/// <returns>
/// Returns an array of received data.
/// </returns>
private byte[] Read(int size)
{
if (size < 0)
{
// You can throw an exception.
return null;
}
if (size == 0)
{
// Don't throw an exception here, just return an empty data array.
return EmptyArray;
}
// There are many examples on the Internet where the
// Socket.Available property is used, this is WRONG!
// Important! The Socket.Available property is not working as expected.
// Data packages may be in transit, but the Socket.Available property may indicate otherwise.
// Therefore, we use a counter that will allow us to receive all data packets, no more and no less.
// The cycle will continue until we receive all the data packets or the timeout is triggered.
// Note. This algorithm is not designed to work with big data.
SimpleCounter counter = new(size, m_Socket.ReceiveBufferSize);
byte[] buffer = new byte[counter.BufferSize];
int received;
using MemoryStream storage = new();
// The cycle will run until we get all the data.
while (counter.IsExpected)
{
received = m_Socket.Receive(buffer, 0, counter.Available, SocketFlags.None);
// Pass the size of the received data to the counter.
counter.Count(received);
// Write data to memory.
storage.Write(buffer, 0, received);
}
return storage.ToArray();
}
/// <summary>
/// Receives the specified number of bytes from a bound <seealso cref="Socket"/>.
/// </summary>
/// <returns>
/// Returns an array of received data.
/// </returns>
private byte[] Read()
{
byte[] sizeData;
// First, we get the size of the master data.
sizeData = Read(sizeof(int));
// We convert the received data into a number.
int size = BitConverter.ToInt32(sizeData);
// If the data size is less than 0 then throws an exception.
// We inform the recipient that an error occurred while reading the data.
if (size < 0)
{
// Or return the value null.
throw new SocketException();
}
// If the data size is 0, then we will return an empty array.
// Do not allow an exception here.
if (size == 0)
{
return EmptyArray;
}
// Here we read the master data.
byte[] data = Read(size);
return data;
}
/// <summary>
/// Writes data to the stream.
/// </summary>
/// <param name="data"></param>
private void Write(byte[] data)
{
if (data is null)
{
// Throw an exception.
// Or send a negative number that will represent the value null.
throw new ArgumentNullException(nameof(data));
}
byte[] sizeData = BitConverter.GetBytes(data.Length);
// In any case, we inform the recipient about the size of the data.
m_Socket.Send(sizeData, 0, sizeof(int), SocketFlags.None);
if (data.Length != 0)
{
// We send data whose size is greater than zero.
m_Socket.Send(data, 0, data.Length, SocketFlags.None);
}
}
#endregion
#region Classes
/// <summary>
/// Represents a simple counter of received data over the network.
/// </summary>
private class SimpleCounter
{
#region Fields
private int m_Received;
private int m_Available;
private bool m_IsExpected;
#endregion
#region Constructors
/// <summary>
/// Initializes a new instance of the class <seealso cref="SimpleCounter"/>.
/// </summary>
/// <param name="dataSize">
/// Data size.
/// </param>
/// <param name="bufferSize">
/// Buffer size.
/// </param>
/// <exception cref="ArgumentOutOfRangeException"/>
public SimpleCounter(int dataSize, int bufferSize)
{
if (dataSize < 0)
{
throw new ArgumentOutOfRangeException(nameof(dataSize), dataSize, "Data size cannot be less than 0");
}
if (bufferSize < 0)
{
throw new ArgumentOutOfRangeException(nameof(dataSize), bufferSize, "Buffer size cannot be less than 0");
}
DataSize = dataSize;
BufferSize = bufferSize;
// Update the counter data.
UpdateCounter();
}
#endregion
#region Properties
/// <summary>
/// Returns the size of the expected data.
/// </summary>
/// <value>
/// Size of expected data.
/// </value>
public int DataSize { get; }
/// <summary>
/// Returns the size of the buffer.
/// </summary>
/// <value>
/// Buffer size.
/// </value>
public int BufferSize { get; }
/// <summary>
/// Returns the available buffer size for receiving data.
/// </summary>
/// <value>
/// Available buffer size.
/// </value>
public int Available
{
get
{
return m_Available;
}
}
/// <summary>
/// Returns a value indicating whether the thread should wait for data.
/// </summary>
/// <value>
/// <see langword="true"/> if the stream is waiting for data; otherwise, <see langword="false"/>.
/// </value>
public bool IsExpected
{
get
{
return m_IsExpected;
}
}
#endregion
#region Methods
// Updates the counter.
private void UpdateCounter()
{
int unreadDataSize = DataSize - m_Received;
m_Available = unreadDataSize < BufferSize ? unreadDataSize : BufferSize;
m_IsExpected = m_Available > 0;
}
/// <summary>
/// Specifies the size of the received data.
/// </summary>
/// <param name="bytes">
/// The size of the received data.
/// </param>
public void Count(int bytes)
{
// NOTE: Counter cannot decrease.
if (bytes > 0)
{
int received = m_Received += bytes;
// NOTE: The value of the received data cannot exceed the size of the expected data.
m_Received = (received < DataSize) ? received : DataSize;
// Update the counter data.
UpdateCounter();
}
}
/// <summary>
/// Resets counter data.
/// </summary>
public void Reset()
{
m_Received = 0;
UpdateCounter();
}
#endregion
}
#endregion
}
}
Use a do-while loop. This will make sure the memory stream pointers have moved. The first Read or ReadAsync will cause the memorystream pointer to move and then onwards the ".DataAvailable" property will continue to return true until we hit the end of the stream.
An example from microsoft docs:
// Check to see if this NetworkStream is readable.
if(myNetworkStream.CanRead){
byte[] myReadBuffer = new byte[1024];
StringBuilder myCompleteMessage = new StringBuilder();
int numberOfBytesRead = 0;
// Incoming message may be larger than the buffer size.
do{
numberOfBytesRead = myNetworkStream.Read(myReadBuffer, 0, myReadBuffer.Length);
myCompleteMessage.AppendFormat("{0}", Encoding.ASCII.GetString(myReadBuffer, 0, numberOfBytesRead));
}
while(myNetworkStream.DataAvailable);
// Print out the received message to the console.
Console.WriteLine("You received the following message : " +
myCompleteMessage);
}
else{
Console.WriteLine("Sorry. You cannot read from this NetworkStream.");
}
Original Micorosoft Doc

How to compare 2 files fast using .NET?

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
Would a checksum comparison such as CRC be faster?
Are there any .NET libraries that can generate a checksum for a file?
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
Here's what I came up with:
const int BYTES_TO_READ = sizeof(Int64);
static bool FilesAreEqual(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
byte[] one = new byte[BYTES_TO_READ];
byte[] two = new byte[BYTES_TO_READ];
for (int i = 0; i < iterations; i++)
{
fs1.Read(one, 0, BYTES_TO_READ);
fs2.Read(two, 0, BYTES_TO_READ);
if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
return false;
}
}
return true;
}
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
Here's the ReadByte and hashing methods I used, for comparison purposes:
static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
for (int i = 0; i < first.Length; i++)
{
if (fs1.ReadByte() != fs2.ReadByte())
return false;
}
}
return true;
}
static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
{
byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());
for (int i=0; i<firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i])
return false;
}
return true;
}
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:
• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));
• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
fi1.Length == fi2.Length &&
(fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
File.ReadAllBytes(fi2.FullName)));
Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.
Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page, so such streams may be different for files otherwise reported as the "same."
In addition to Reed Copsey's answer:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
bool result;
if (fileInfo1.Length != fileInfo2.Length)
{
result = false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
result = StreamsContentsAreEqual(file1, file2);
}
}
}
return result;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
Edit: This method would not work for comparing binary files!
In .NET 4.0, the File class has the following two new methods:
public static IEnumerable<string> ReadLines(string path)
public static IEnumerable<string> ReadLines(string path, Encoding encoding)
Which means you could use:
bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));
The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.
Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.
You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.
If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.
My answer is a derivative of #lars but fixes the bug in the call to Stream.Read. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:
using System;
using System.IO;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqual(fi1, fi2));
}
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return StreamsContentsAreEqual(file1, file2);
}
}
}
}
private static int ReadFullBuffer(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = ReadFullBuffer(stream1, buffer1);
int count2 = ReadFullBuffer(stream2, buffer2);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Or if you want to be super-awesome, you can use the async variant:
using System;
using System.IO;
using System.Threading.Tasks;
namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
var fi1 = new FileInfo(args[0]);
var fi2 = new FileInfo(args[1]);
Console.WriteLine(FilesContentsAreEqualAsync(fi1, fi2).GetAwaiter().GetResult());
}
public static async Task<bool> FilesContentsAreEqualAsync(FileInfo fileInfo1, FileInfo fileInfo2)
{
if (fileInfo1 == null)
{
throw new ArgumentNullException(nameof(fileInfo1));
}
if (fileInfo2 == null)
{
throw new ArgumentNullException(nameof(fileInfo2));
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
return await StreamsContentsAreEqualAsync(file1, file2).ConfigureAwait(false);
}
}
}
}
private static async Task<int> ReadFullBufferAsync(Stream stream, byte[] buffer)
{
int bytesRead = 0;
while (bytesRead < buffer.Length)
{
int read = await stream.ReadAsync(buffer, bytesRead, buffer.Length - bytesRead).ConfigureAwait(false);
if (read == 0)
{
// Reached end of stream.
return bytesRead;
}
bytesRead += read;
}
return bytesRead;
}
private static async Task<bool> StreamsContentsAreEqualAsync(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = await ReadFullBufferAsync(stream1, buffer1).ConfigureAwait(false);
int count2 = await ReadFullBufferAsync(stream2, buffer2).ConfigureAwait(false);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
}
Honestly, I think you need to prune your search tree down as much as possible.
Things to check before going byte-by-byte:
Are sizes the same?
Is the last byte in file A different than file B
Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.
Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.
Inspired from https://dev.to/emrahsungu/how-to-compare-two-files-using-net-really-really-fast-2pd9
Here is a proposal to do it with AVX2 SIMD instructions:
using System.Buffers;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace FileCompare;
public static class FastFileCompare
{
public static bool AreFilesEqual(FileInfo fileInfo1, FileInfo fileInfo2, int bufferSize = 4096 * 32)
{
if (fileInfo1.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo1), fileInfo1.FullName);
}
if (fileInfo2.Exists == false)
{
throw new FileNotFoundException(nameof(fileInfo2), fileInfo2.FullName);
}
if (fileInfo1.Length != fileInfo2.Length)
{
return false;
}
if (string.Equals(fileInfo1.FullName, fileInfo2.FullName, StringComparison.OrdinalIgnoreCase))
{
return true;
}
using FileStream fileStream01 = fileInfo1.OpenRead();
using FileStream fileStream02 = fileInfo2.OpenRead();
ArrayPool<byte> sharedArrayPool = ArrayPool<byte>.Shared;
byte[] buffer1 = sharedArrayPool.Rent(bufferSize);
byte[] buffer2 = sharedArrayPool.Rent(bufferSize);
Array.Fill<byte>(buffer1, 0);
Array.Fill<byte>(buffer2, 0);
try
{
while (true)
{
int len1 = 0;
for (int read;
len1 < buffer1.Length &&
(read = fileStream01.Read(buffer1, len1, buffer1.Length - len1)) != 0;
len1 += read)
{
}
int len2 = 0;
for (int read;
len2 < buffer1.Length &&
(read = fileStream02.Read(buffer2, len2, buffer2.Length - len2)) != 0;
len2 += read)
{
}
if (len1 != len2)
{
return false;
}
if (len1 == 0)
{
return true;
}
unsafe
{
fixed (byte* pb1 = buffer1)
{
fixed (byte* pb2 = buffer2)
{
int vectorSize = Vector256<byte>.Count;
for (int processed = 0; processed < len1; processed += vectorSize)
{
Vector256<byte> result = Avx2.CompareEqual(Avx.LoadVector256(pb1 + processed), Avx.LoadVector256(pb2 + processed));
if (Avx2.MoveMask(result) != -1)
{
return false;
}
}
}
}
}
}
}
finally
{
sharedArrayPool.Return(buffer1);
sharedArrayPool.Return(buffer2);
}
}
}
If the files are not too big, you can use:
public static byte[] ComputeFileHash(string fileName)
{
using (var stream = File.OpenRead(fileName))
return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}
It will only be feasible to compare hashes if the hashes are useful to store.
(Edited the code to something much cleaner.)
My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.
So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:
for (int i = 0; i < count1; i++)
{
if (buffer1[i] != buffer2[i])
return false;
}
I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.
Another improvement on large files with identical length, might be to not read the files sequentially, but rather compare more or less random blocks.
You can use multiple threads, starting on different positions in the file and comparing either forward or backwards.
This way you can detect changes at the middle/end of the file, faster than you would get there using a sequential approach.
If you only need to compare two files, I guess the fastest way would be (in C, I don't know if it's applicable to .NET)
open both files f1, f2
get the respective file length l1, l2
if l1 != l2 the files are different; stop
mmap() both files
use memcmp() on the mmap()ed files
OTOH, if you need to find if there are duplicate files in a set of N files, then the fastest way is undoubtedly using a hash to avoid N-way bit-by-bit comparisons.
Something (hopefully) reasonably efficient:
public class FileCompare
{
public static bool FilesEqual(string fileName1, string fileName2)
{
return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2));
}
/// <summary>
///
/// </summary>
/// <param name="file1"></param>
/// <param name="file2"></param>
/// <param name="bufferSize">8kb seemed like a good default</param>
/// <returns></returns>
public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192)
{
if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false;
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
using (var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
while (true)
{
var bytesRead1 = stream1.Read(buffer1, 0, bufferSize);
var bytesRead2 = stream2.Read(buffer2, 0, bufferSize);
if (bytesRead1 != bytesRead2) return false;
if (bytesRead1 == 0) return true;
if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false;
}
}
}
}
/// <summary>
///
/// </summary>
/// <param name="array1"></param>
/// <param name="array2"></param>
/// <param name="bytesToCompare"> 0 means compare entire arrays</param>
/// <returns></returns>
public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0)
{
if (array1.Length != array2.Length) return false;
var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare;
var tailIdx = length - length % sizeof(Int64);
//check in 8 byte chunks
for (var i = 0; i < tailIdx; i += sizeof(Int64))
{
if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false;
}
//check the remainder of the array, always shorter than 8 bytes
for (var i = tailIdx; i < length; i++)
{
if (array1[i] != array2[i]) return false;
}
return true;
}
}
Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.
I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.
As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).
public static bool AreFilesIdenticalFast(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast);
}
public static bool AreFilesIdentical(string path1, string path2)
{
return AreFilesIdentical(path1, path2, AreStreamsIdentical);
}
public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical)
{
if (path1 == null)
throw new ArgumentNullException(nameof(path1));
if (path2 == null)
throw new ArgumentNullException(nameof(path2));
if (areStreamsIdentical == null)
throw new ArgumentNullException(nameof(path2));
if (!File.Exists(path1) || !File.Exists(path2))
return false;
using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
if (valueFile.Length != thisFile.Length)
return false;
if (!areStreamsIdentical(thisFile, valueFile))
return false;
}
}
return true;
}
public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var tasks = new List<Task<bool>>();
do
{
// consumes more memory (two buffers for each tasks)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
{
int read3 = stream2.Read(buffer2, 0, 1);
if (read3 != 0) // not eof
return false;
break;
}
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
// consumes more cpu
var task = Task.Run(() =>
{
return IsSame(buffer1, buffer2);
});
tasks.Add(task);
}
while (true);
Task.WaitAll(tasks.ToArray());
return !tasks.Any(t => !t.Result);
}
public static bool AreStreamsIdentical(Stream stream1, Stream stream2)
{
if (stream1 == null)
throw new ArgumentNullException(nameof(stream1));
if (stream2 == null)
throw new ArgumentNullException(nameof(stream2));
const int bufsize = 80000; // 80000 is below LOH (85000)
var buffer1 = new byte[bufsize];
var buffer2 = new byte[bufsize];
var tasks = new List<Task<bool>>();
do
{
int read1 = stream1.Read(buffer1, 0, buffer1.Length);
if (read1 == 0)
return stream2.Read(buffer2, 0, 1) == 0; // check not eof
// both stream read could return different counts
int read2 = 0;
do
{
int read3 = stream2.Read(buffer2, read2, read1 - read2);
if (read3 == 0)
return false;
read2 += read3;
}
while (read2 < read1);
if (!IsSame(buffer1, buffer2))
return false;
}
while (true);
}
public static bool IsSame(byte[] bytes1, byte[] bytes2)
{
if (bytes1 == null)
throw new ArgumentNullException(nameof(bytes1));
if (bytes2 == null)
throw new ArgumentNullException(nameof(bytes2));
if (bytes1.Length != bytes2.Length)
return false;
for (int i = 0; i < bytes1.Length; i++)
{
if (bytes1[i] != bytes2[i])
return false;
}
return true;
}
I think there are applications where "hash" is faster than comparing byte by byte.
If you need to compare a file with others or have a thumbnail of a photo that can change.
It depends on where and how it is using.
private bool CompareFilesByte(string file1, string file2)
{
using (var fs1 = new FileStream(file1, FileMode.Open))
using (var fs2 = new FileStream(file2, FileMode.Open))
{
if (fs1.Length != fs2.Length) return false;
int b1, b2;
do
{
b1 = fs1.ReadByte();
b2 = fs2.ReadByte();
if (b1 != b2 || b1 < 0) return false;
}
while (b1 >= 0);
}
return true;
}
private string HashFile(string file)
{
using (var fs = new FileStream(file, FileMode.Open))
using (var reader = new BinaryReader(fs))
{
var hash = new SHA512CryptoServiceProvider();
hash.ComputeHash(reader.ReadBytes((int)file.Length));
return Convert.ToBase64String(hash.Hash);
}
}
private bool CompareFilesWithHash(string file1, string file2)
{
var str1 = HashFile(file1);
var str2 = HashFile(file2);
return str1 == str2;
}
Here, you can get what is the fastest.
var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));
Optionally, we can save the hash in a database.
Hope this can help
This I have found works well comparing first the length without reading data and then comparing the read byte sequence
private static bool IsFileIdentical(string a, string b)
{
if (new FileInfo(a).Length != new FileInfo(b).Length) return false;
return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b)));
}
Yet another answer, derived from #chsh. MD5 with usings and shortcuts for file same, file not exists and differing lengths:
/// <summary>
/// Performs an md5 on the content of both files and returns true if
/// they match
/// </summary>
/// <param name="file1">first file</param>
/// <param name="file2">second file</param>
/// <returns>true if the contents of the two files is the same, false otherwise</returns>
public static bool IsSameContent(string file1, string file2)
{
if (file1 == file2)
return true;
FileInfo file1Info = new FileInfo(file1);
FileInfo file2Info = new FileInfo(file2);
if (!file1Info.Exists && !file2Info.Exists)
return true;
if (!file1Info.Exists && file2Info.Exists)
return false;
if (file1Info.Exists && !file2Info.Exists)
return false;
if (file1Info.Length != file2Info.Length)
return false;
using (FileStream file1Stream = file1Info.OpenRead())
using (FileStream file2Stream = file2Info.OpenRead())
{
byte[] firstHash = MD5.Create().ComputeHash(file1Stream);
byte[] secondHash = MD5.Create().ComputeHash(file2Stream);
for (int i = 0; i < firstHash.Length; i++)
{
if (i>=secondHash.Length||firstHash[i] != secondHash[i])
return false;
}
return true;
}
}
Not really an answer, but kinda funny.
This is what github's CoPilot (AI) suggested :-)
public static void CompareFiles(FileInfo actualFile, FileInfo expectedFile) {
if (actualFile.Length != expectedFile.Length) {
throw new Exception($"File {actualFile.Name} has different length in actual and expected directories.");
}
// compare the files on a byte level
using var actualStream = actualFile.OpenRead();
using var expectedStream = expectedFile.OpenRead();
var actualBuffer = new byte[1024];
var expectedBuffer = new byte[1024];
int actualBytesRead;
int expectedBytesRead;
do {
actualBytesRead = actualStream.Read(actualBuffer, 0, actualBuffer.Length);
expectedBytesRead = expectedStream.Read(expectedBuffer, 0, expectedBuffer.Length);
if (actualBytesRead != expectedBytesRead) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
if (!actualBuffer.SequenceEqual(expectedBuffer)) {
throw new Exception($"File {actualFile.Name} has different content in actual and expected directories.");
}
} while (actualBytesRead > 0);
}
I find the usage of SequenceEqual particular interesting.

Compare binary files in C#

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.
I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.
I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.
static bool FileEquals(string fileName1, string fileName2)
{
// Check the file size and CRC equality here.. if they are equal...
using (var file1 = new FileStream(fileName1, FileMode.Open))
using (var file2 = new FileStream(fileName2, FileMode.Open))
return FileStreamEquals(file1, file2);
}
static bool FileStreamEquals(Stream stream1, Stream stream2)
{
const int bufferSize = 2048;
byte[] buffer1 = new byte[bufferSize]; //buffer size
byte[] buffer2 = new byte[bufferSize];
while (true) {
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
return false;
if (count1 == 0)
return true;
// You might replace the following with an efficient "memcmp"
if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
return false;
}
}
I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 2048 * 2;
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
This is how I would do it if you didn't want to rely on crc:
/// <summary>
/// Binary comparison of two files
/// </summary>
/// <param name="fileName1">the file to compare</param>
/// <param name="fileName2">the other file to compare</param>
/// <returns>a value indicateing weather the file are identical</returns>
public static bool CompareFiles(string fileName1, string fileName2)
{
FileInfo info1 = new FileInfo(fileName1);
FileInfo info2 = new FileInfo(fileName2);
bool same = info1.Length == info2.Length;
if (same)
{
using (FileStream fs1 = info1.OpenRead())
using (FileStream fs2 = info2.OpenRead())
using (BufferedStream bs1 = new BufferedStream(fs1))
using (BufferedStream bs2 = new BufferedStream(fs2))
{
for (long i = 0; i < info1.Length; i++)
{
if (bs1.ReadByte() != bs2.ReadByte())
{
same = false;
break;
}
}
}
}
return same;
}
The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.
BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.
The following code takes advantage of BinaryReader to do the comparison:
static private bool FileEquals(string file1, string file2)
{
using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader b1 = new BinaryReader(s1))
using (BinaryReader b2 = new BinaryReader(s2))
{
while (true)
{
byte[] data1 = b1.ReadBytes(64 * 1024);
byte[] data2 = b2.ReadBytes(64 * 1024);
if (data1.Length != data2.Length)
return false;
if (data1.Length == 0)
return true;
if (!data1.SequenceEqual(data2))
return false;
}
}
}
if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small
You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.
But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.
You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Categories