Seek through FileStream then using StreamReader to read from there - c#

So I want to be able to seek to a point in a fileStream, then read forward using a StreamReader. Then seek forward again, and use the StreamReader to read another chunk of data.
const int BufferSize = 4096;
var buffer = new char[BufferSize];
var endpoints = new List<long>();
using (var fileStream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var fileLength = fileStream.Length;
var seekPositionCount = fileLength / concurrentReads;
long currentOffset = 0;
for (var i = 0; i < concurrentReads; i++)
{
var seekPosition = seekPositionCount + currentOffset;
// seek the file forward
fileStream.Seek(seekPosition, SeekOrigin.Current);
// setting true at the end is very important, keeps the underlying fileStream open.
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize, true))
{
// this also seeks the file forward the amount in the buffer...
int bytesRead;
var totalBytesRead = 0;
while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
totalBytesRead += bytesRead;
var found = false;
var gotR = false;
for (var j = 0; j < buffer.Length; j++)
{
if (buffer[j] == '\r')
{
gotR = true;
continue;
}
if (buffer[j] == '\n' && gotR)
{
// so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
seekPosition += totalBytesRead - BufferSize + j;
endpoints.Add(seekPosition);
found = true;
break;
}
}
if (found) break;
}
}
// we need to seek to the position we got to in the StreamReader (but not going by how much was read).
fileStream.Seek(seekPosition, SeekOrigin.Current);
currentOffset += seekPosition;
}
}
return endpoints;
However, I get to two entries in endpoints and it exits out.
(bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0
The arguments you pass to ReadAsync I thought are solely to do with the buffer, so the index argument I thought was to say, fill the buffer at index.
I can't make out from Reference Source how this value is used.
I assumed (and can't find the evidence to back up) that, when you opened a StreamReader it uses the underlying Stream as it's guide, so when you ask to read some bytes, it will start from the position the underlying Stream is at...
But the results of what I'm doing aren't showing that, they seem to be showing that the StreamReader is starting at the beginning of the Stream each time - however, I can't find the evidence to support that is how it does it either...
Seeking
Is my understanding of seeking correct, in the sense that if I call seek
fileStream.Seek(seekPosition, SeekOrigin.Current);
If the file is at 300, I want to seek to 600, the above variable seekPosition should be 600??
ReferenceSource would say otherwise:
else if (origin == SeekOrigin.Current) {
// Don't call FlushRead here, which would have caused an infinite
// loop. Simply adjust the seek origin. This isn't necessary
// if we're seeking relative to the beginning or end of the stream.
offset -= (_readLen - _readPos);
}

So thanks to Hans Passant, I have got the answer:
var buffer = new char[BufferSize];
var endpoints = new List<long>();
using (var fileStream = this.CreateMultipleReadAccessFileStream(fileName))
{
var fileLength = fileStream.Length;
var seekPositionCount = fileLength / concurrentReads;
long currentOffset = 0;
for (var i = 0; i < concurrentReads; i++)
{
var seekPosition = seekPositionCount + currentOffset;
// seek the file forward
// fileStream.Seek(seekPosition, SeekOrigin.Current);
// setting true at the end is very important, keeps the underlying fileStream open.
using (var streamReader = this.CreateTemporaryStreamReader(fileStream))
{
// this is poor on performance, hence why you split the file here and read in new threads.
streamReader.DiscardBufferedData();
// you have to advance the fileStream here, because of the previous line
streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
// this also seeks the file forward the amount in the buffer...
int bytesRead;
var totalBytesRead = 0;
while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
totalBytesRead += bytesRead;
var found = false;
var gotR = false;
for (var j = 0; j < buffer.Length; j++)
{
if (buffer[j] == '\r')
{
gotR = true;
continue;
}
if (buffer[j] == '\n' && gotR)
{
// so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
seekPosition += totalBytesRead - BufferSize + j;
endpoints.Add(seekPosition);
found = true;
break;
}
// if we have found new line then move the position to
}
if (found) break;
}
}
currentOffset = seekPosition;
}
}
return endpoints;
Note the new part, rather than doing this twice:
fileStream.Seek(seekPosition, SeekOrigin.Current);
I now use SeekOrigin.Begin and use the StreamReader to progress the underlying base stream:
// this is poor on performance, hence why you split the file here and read in new threads.
streamReader.DiscardBufferedData();
// you have to advance the fileStream here, because of the previous line
streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
The DiscardBufferedData will mean that I'm always using the underlying stream position.

Related

C# FileStream.Read doesn't read last block

I read binary file to hex by block.
It is diffrent when I use FileStream.Read and File.ReadAllBytes
FileSteram.Read
int limit = 0;
if (openFileDlg.FileName.Length > 0)
{
fileName = openFileDlg.FileName;
FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
fsLen = (int)fs.Length;
int count = 0;
limit = 100;
byte[] read_buff = new byte[limit];
StringBuilder sb = new StringBuilder();
while ( (count = fs.Read(read_buff, 0, limit)) > 0)
{
foreach (byte b in read_buff)
{
sb.Append(Convert.ToString(b, 16).PadLeft(2, '0'));
}
}
rtxb_bin.AppendText(sb.ToString() + "\n");
}
File.ReadAllBytes
if (openFileDlg.FileName.Length > 0)
{
fileName = openFileDlg.FileName;
byte[] fileBytes = File.ReadAllBytes(fileName);
StringBuilder sb2 = new StringBuilder();
foreach (byte b2 in fileBytes)
{
sb2.Append(Convert.ToString(b2, 16).PadLeft(2, '0'));
}
rtxb_allbin.AppendText(sb2.ToString());
}
case 1, reasult is ...
........04c0020f00452a00421346108129844f2138448500208020250405250043188510812e0
and case 2 is
.......04c0020f00452a00421346108129844f2138448500208020250405250043188510812e044f212cc48120c24125404f2069c2c0008bff35f8f401efbd17047
FileStream.Read doesn't read after '12e0'
'44f212cc48120c24125404f2069c2c0008bff35f8f401efbd17047' is missing
How can I read all bytes using FileStream.Read?
Why FileStream.Read doesn't read last block?
Most likely it appears to you that it does not read last block. Suppose you have file of length 102. First iteration of you loop reads first 100 bytes, all is fine. But what happens on second (last) one? You read two bytes into read_buff, which is of length 100. Now that buffer contains 2 bytes of last block and 98 bytes of previous (first) block, because Read doesn't clear the buffer. Then you proceed with:
foreach (byte b in read_buff)
{
sb.Append(Convert.ToString(b, 16).PadLeft(2, '0'));
}
In result, sb has 100 bytes of first block, 2 bytes of last block, and then again 98 bytes of first block. If you don't look too closely, it might appear that it just skipped last block, while in reality it duplicated part of the previous one.
To fix, use count (indicating how much bytes were really read into the buffer) to work only with valid part of read_buff:
for (int i = 0; i < count; i++) {
sb.Append(Convert.ToString(read_buff[i], 16).PadLeft(2, '0'));
}
You need update offset and count.
Sintaxis
public override int Read(
byte[] array,
int offset,
int count
)
Example
public static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}
Reference
public static void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
{
FileInfo info = new FileInfo(theFilename);
long fileLength = info.Length;
long timesToRead = (fileLength / megabyte);
long ctr = 0;
long timesRead = 0;
FileStream fileStram = new FileStream(theFilename, FileMode.Open, FileAccess.Read);
using (fileStram)
{
byte[] buffer = new byte[megabyte];
fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
int bytesRead = 0;
//bytesRead = fileStram.Read(buffer, 0, megabyte);
//ctr = ctr + 1;
while ((bytesRead = fileStram.Read(buffer, 0, megabyte)) > 0)
{
ProcessChunk(buffer, bytesRead);
buffer = new byte[megabyte]; // This solves last read prob
}
}
}
private static void ProcessChunk(byte[] buffer, int bytesRead)
{
// Do the processing here
string utfString = Encoding.UTF8.GetString(buffer, 0, bytesRead);
Console.Write(utfString);
}

FileStream Seek fails on large files at second call

I'm working with large files , beginning from 10Gb. I'm loading the parts of the file in the memory for processing. Following code works fine for smaller files (700Mb)
byte[] byteArr = new byte[layerPixelCount];
using (FileStream fs = File.OpenRead(recFileName))
{
using (BinaryReader br = new BinaryReader(fs))
{
fs.Seek(offset, SeekOrigin.Begin);
for (int i = 0; i < byteArr.Length; i++)
{
byteArr[i] = (byte)(br.ReadUInt16() / 256);
}
}
}
After opening a 10Gb file, the first run of this function is OK. But the second Seek() throws an IO exception:
An attempt was made to move the file pointer before the beginning of the file.
The numbers are:
fs.Length = 11998628352
offset = 4252580352
byteArr.Length = 7746048
I assumed that GC didn't collect the closed fs reference before the second call and tried
GC.Collect();
GC.WaitForPendingFinalizers();
but no luck.
Any help is apreciated
I'm guessing it's because either your signed integer indexer or offset is rolling over to negative values. Try declaring offset and i as long.
//Offest is now long
long offset = 4252580352;
byte[] byteArr = new byte[layerPixelCount];
using (FileStream fs = File.OpenRead(recFileName))
{
using (BinaryReader br = new BinaryReader(fs))
{
fs.Seek(offset, SeekOrigin.Begin);
for (long i = 0; i < byteArr.Length; i++)
{
byteArr[i] = (byte)(br.ReadUInt16() / 256);
}
}
}
My following written code logic is appropriate with large files beyond 4GB. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries. In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size. (using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)
private uint Crc32CAlgorithmBigCrc(string fileName)
{
uint hash = 0;
byte[] buffer = null;
FileInfo fileInfo = new FileInfo(fileName);
long fileLength = fileInfo.Length;
int blockSize = 1024000000;
decimal div = fileLength / blockSize;
int blocks = (int)Math.Floor(div);
int restBytes = (int)(fileLength - (blocks * blockSize));
long offsetFile = 0;
uint interHash = 0;
Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
bool firstBlock = true;
using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
buffer = new byte[blockSize];
using (BinaryReader br = new BinaryReader(fs))
{
while (blocks > 0)
{
blocks -= 1;
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(blockSize);
if (firstBlock)
{
firstBlock = false;
interHash = Crc32CAlgorithm.Compute(buffer);
hash = interHash;
}
else
{
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
offsetFile += blockSize;
}
if (restBytes > 0)
{
Array.Resize(ref buffer, restBytes);
fs.Seek(offsetFile, SeekOrigin.Begin);
buffer = br.ReadBytes(restBytes);
hash = Crc32CAlgorithm.Append(interHash, buffer);
}
buffer = null;
}
}
//MessageBox.Show(hash.ToString());
//MessageBox.Show(hash.ToString("X"));
return hash;
}

Remove last x lines from a streamreader

I need to read in all but the last x lines from a file to a streamreader in C#. What is the best way to do this?
Many Thanks!
If it's a large file, is it possible to just seek to the end of the file, and examine the bytes in reverse for the '\n' character? I am aware that \n and \r\n exists. I whipped up the following code and tested on a fairly trivial file. Can you try testing this on the files that you have? I know my solution looks long, but I think you'll find that it's faster than reading from the beginning and rewriting the whole file.
public static void Truncate(string file, int lines)
{
using (FileStream fs = File.Open(file, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None))
{
fs.Position = fs.Length;
// \n \r\n (both uses \n for lines)
const int BUFFER_SIZE = 2048;
// Start at the end until # lines have been encountered, record the position, then truncate the file
long currentPosition = fs.Position;
int linesProcessed = 0;
byte[] buffer = new byte[BUFFER_SIZE];
while (linesProcessed < linesToTruncate && currentPosition > 0)
{
int bytesRead = FillBuffer(buffer, fs);
// We now have a buffer containing the later contents of the file
for (int i = bytesRead - 1; i >= 0; i--)
{
currentPosition--;
if (buffer[i] == '\n')
{
linesProcessed++;
if (linesProcessed == linesToTruncate)
break;
}
}
}
// Truncate the file
fs.SetLength(currentPosition);
}
}
private static int FillBuffer(byte[] buffer, FileStream fs)
{
if (fs.Position == 0)
return 0;
int bytesRead = 0;
int currentByteOffset = 0;
// Calculate how many bytes of the buffer can be filled (remember that we're going in reverse)
long expectedBytesToRead = (fs.Position < buffer.Length) ? fs.Position : buffer.Length;
fs.Position -= expectedBytesToRead;
while (bytesRead < expectedBytesToRead)
{
bytesRead += fs.Read(buffer, currentByteOffset, buffer.Length - bytesRead);
currentByteOffset += bytesRead;
}
// We have to reset the position again because we moved the reader forward;
fs.Position -= bytesRead;
return bytesRead;
}
Since you are only planning on deleting the end of the file, it seems wasteful to rewrite everything, especially if it's a large file and small N. Of course, one can make the argument that if someone wanted to eliminate all lines, then going from the beginning to the end is more efficient.
Since you are referring to lines in a file, I'm assuming it's a text file. If you just want to get the lines you can read them into an array of strings like so:
string[] lines = File.ReadAllLines(#"C:\test.txt");
Or if you really need to work with StreamReaders:
using (StreamReader reader = new StreamReader(#"C:\test.txt"))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
You don't really read INTO a StreamReader. In fact, for the pattern you're asking for you don't need the StreamReader at all. System.IO.File has the useful static method 'ReadLines' that you can leverage instead:
IEnumerable<string> allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();
The previous flawed version, back in response to the comment thread
List<string> allLines = File.ReadLines(path).ToList();
IEnumerable<string> allBut = allLines.Take(allLines.Count - 5);

Setting the offset in a stream

It says here msdn.microsoft.com/en-us/library/system.io.stream.read.aspx that the Stream.Read and Stream.Write methods both advance the position/offset in the stream automatically so why is the examples here http://msdn.microsoft.com/en-us/library/system.io.stream.read.aspx and http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx manually changing the offset?
Do you only set the offset in a loop if you know the size of the stream and set it to 0 if you don't know the size and using a buffer?
// Now read s into a byte buffer.
byte[] bytes = new byte[s.Length];
int numBytesToRead = (int) s.Length;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
// Read may return anything from 0 to 10.
int n = s.Read(bytes, numBytesRead, 10);
// The end of the file is reached.
if (n == 0)
{
break;
}
numBytesRead += n;
numBytesToRead -= n;
}
and
using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
{
const int size = 4096;
byte[] buffer = new byte[size];
using (MemoryStream memory = new MemoryStream())
{
int count = 0;
do
{
count = stream.Read(buffer, 0, size);
if (count > 0)
{
memory.Write(buffer, 0, count);
}
}
while (count > 0);
return memory.ToArray();
}
}
The offset is actually the offset of the buffer, not the stream. Streams are advanced automatically as they are read.
Edit (to the edited question):
In none of the code snippets you pasted into the question I see any stream offset being set.
I think you are mistaking the calculation of bytes to read vs. bytes received. This protocol may seem funny (why would you receive fewer bytes than requested?) but it makes sense when you consider that you might be reading from a high-latency packet oriented source (think: network sockets).
You might be receiving 6 characters in one burst (from a TCP packet) and only receive the remaining 4 characters in your next read (when the next packet has arrived).
Edit In response to your linked example from the comment:
using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
{
// ... snip
count = stream.Read(buffer, 0, size);
if (count > 0)
{
memory.Write(buffer, 0, count);
}
It appears that the coders use prior knowledge about the underlying stream implementation, that stream.Read will always return 0 OR the size requested. That seems like a risky bet, to me. But if the docs for GZipStream do state that, it could be alright. However, since the MSDN samples use a generic Stream variable, it is (way) more correct to check the exact number of bytes read.
The first linked example uses a MemoryStream in both Write and Read fashion. The position is reset in between, so the data that was written first will be read:
Stream s = new MemoryStream();
for (int i = 0; i < 100; i++)
{
s.WriteByte((byte)i);
}
s.Position = 0;
The second example linked does not set the stream position. You'd typically have seen a call to Seek if it did. You maybe confusing the offsets into the data buffer with the stream position?

Problem in splitting a file

int bufferlength = 12488;
int pointer = 1;
int offset = 0;
int length = 0;
FileStream fstwrite = new FileStream("D:\\Movie.wmv", FileMode.Create);
while (pointer != 0)
{
byte[] buff = new byte[bufferlength];
FileStream fst = new FileStream("E:\\Movie.wmv", FileMode.Open);
pointer = fst.Read(buff, 0, bufferlength);
fst.Close();
fstwrite.Write(buff, offset , pointer);
offset += pointer;
}
I used the above code for splitting a file and place it in other drive.Im not able to set the correct offset and length for this routine can anyone help me to fix this
splitting in the sense ,i split it in "x" kbs and pass it somewhere make the same file in some other location
I find it atlast ,thanks to evry one who gave their valueble responses.
Currently you're always reading from the start of the file... and even if you weren't you'd just be copying the whole file.
Here's some code which will actually split a single file into multiple files:
public static void SplitFile(string inputFile,
string outputPrefix,
int chunkSize)
{
byte[] buffer = new byte[chunkSize];
using (Stream input = File.OpenRead(inputFile))
{
int index = 0;
while (input.Position < input.Length)
{
using (Stream output = File.Create(outputPrefix + index))
{
int chunkBytesRead = 0;
while (chunkBytesRead < chunkSize)
{
int bytesRead = input.Read(buffer,
chunkBytesRead,
chunkSize - chunkBytesRead);
// End of input
if (bytesRead == 0)
{
break;
}
chunkBytesRead += bytesRead;
}
output.Write(buffer, 0, chunkBytesRead);
}
index++;
}
}
}
Your reading bufferlength of bytes. Shouldn't you set the offset like this then?
offset += bufferlength;
Don't open your source file inside the loop, or you'll always read the first chunk.
Open it before the loop, then make sure your offset is applied to the read.

Categories