loading/streaming a file into a buffer/buffers - c#

I have been trying for a couple of days now to load a file in chunks to allow the user to use very large (GB) files and still keep the speed of the program. Currently i have the following code:
using (FileStream filereader = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
using (StreamReader reader = new StreamReader(filereader))
{
while (toRead > 0 && (bytesread = reader.Read(buffer, offset, toRead)) > 0)
{
toRead -= bytesread;
offset += bytesread;
}
if (toRead > 0) throw new EndOfStreamException();
foreach (var item in buffer)
{
temporary = temporary += item.ToString();
}
temporary.Replace("\n", "\n" + System.Environment.NewLine);
Below are the declarations to avoid any confusion (hopefully):
const int Max_Buffer = 5000;
char[] buffer = new char[Max_Buffer];
int bytesread;
int toRead = 5000;
int offset = 0;
At the moment the program reads in 5000 bytes of the text file, then processes the bytes into a string which i then pass into a stringreader so i can take the information i want.
My problem at the moment is that the buffer can stop halfway through a line so when I take my data in the stringreader class it brings up index/length errors.
What i need is to know how to either seek back in the array to find a certain set of characters that signify the start of a line and then only return the data before that point for processing to a string.
Another issue after sorting the seeking back problem is how would i keep the data i didnt want to process and bring in more data to fill the buffer.
I hope this is explained well, i know i can sometimes be confusing hope someone can help.

I would suggest the use of reader.ReadLine() instead of reader.Read() in your loop
buffer=reader.ReadLine();
bytesread = buffer.Length*2;//Each charcter is unicode and equal to 2 bytes
You can then check on the whether (toRead - bytesread)<0.

Related

Read a large binary file(5GB) into a byte array in C#?

I have a recording file (Binary file) more than 5 GB, i have to read that file and filter out the data needed to be send to server.
Problem is byte[] array supports till 2GB of file data . so just need help if someone had already dealt with this type of situation.
using (FileStream str = File.OpenRead(textBox2.Text))
{
int itemSectionStart = 0x00000000;
BinaryReader breader = new BinaryReader(str);
breader.BaseStream.Position = itemSectionStart;
int length = (int)breader.BaseStream.Length;
byte[] itemSection = breader.ReadBytes(length ); //first frame data
}
issues:
1: Length is crossing the range of integer.
2: tried using long and unint but byte[] only supports integer
Edit.
Another approach i want to give try, Read data on frame buffer basis, suppose my frame buffer size is 24000 . so byte array store that many frames data and then process the frame data and then flush out the byte array and store another 24000 frame data. till keep on going till end of binary file..
See you can not read that much big file at once, so you have to either split the file in small portions and then process the file.
OR
Read file using buffer concept and once you are done with that buffer data then flush out that buffer.
I faced the same issue, so i tried the buffer based approach and it worked for me.
FileStream inputTempFile = new FileStream(Path, FileMode.OpenOrCreate, FileAccess.Read);
Buffer_value = 1024;
byte[] Array_buffer = new byte[Buffer_value];
while ((bytesRead = inputTempFile.Read(Array_buffer, 0, Buffer_value)) > 0)
{
for (int z = 0; z < Array_buffer.Length; z = z + 4)
{
string temp_id = BitConverter.ToString(Array_buffer, z, 4);
string[] temp_strArrayID = temp_id.Split(new char[] { '-' });
string temp_ArraydataID = temp_strArrayID[0] + temp_strArrayID[1] + temp_strArrayID[2] + temp_strArrayID[3];
}
}
this way you can process your data.
For my case i was trying to store buffer read data in to a List, it will work fine till 2GB data after that it will throw memory exception.
The approach i followed, read the data from buffer and apply needed filters and write filter data in to a text file and then process that file.
//text file approach
FileStream inputTempFile = new FileStream(Path, FileMode.OpenOrCreate, FileAccess.Read);
Buffer_value = 1024;
StreamWriter writer = new StreamWriter(Path, true);
byte[] Array_buffer = new byte[Buffer_value];
while ((bytesRead = inputTempFile.Read(Array_buffer, 0, Buffer_value)) > 0)
{
for (int z = 0; z < Array_buffer.Length; z = z + 4)
{
string temp_id = BitConverter.ToString(Array_buffer, z, 4);
string[] temp_strArrayID = temp_id.Split(new char[] { '-' });
string temp_ArraydataID = temp_strArrayID[0] + temp_strArrayID[1] + temp_strArrayID[2] + temp_strArrayID[3];
if(temp_ArraydataID =="XYZ Condition")
{
writer.WriteLine(temp_ArraydataID);
}
}
}
writer.Close();
As said in comments, I think you have to read your file with a stream. Here is how you can do this:
int nbRead = 0;
var step = 10000;
byte[] buffer = new byte[step];
do
{
nbRead = breader.Read(buffer, 0, step);
hugeArray.Add(buffer);
foreach(var oneByte in hugeArray.SelectMany(part => part))
{
// Here you can read byte by byte this subpart
}
}
while (nbRead > 0);
If I well understand your needs, you are looking for a specific pattern into your file?
I think you can do it by looking for the start of your pattern byte by byte. Once you find it, you can start reading the important bytes. If the whole important data is greater than 2GB, as said in the comments, you will have to send it to your server in several parts.

Get Estimate of Line Count in a text file

I would like to get an estimate of the number of lines in a csv/text file so that I can use that number for a progress bar. The file could be extremely large so getting the exact number of lines will take too long for this purpose.
What I have come up with is below (read in a portion of the file and count the number of lines and use the file size to estimate the total number of lines):
public static int GetLineCountEstimate(string file)
{
double count = 0;
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
long byteCount = fs.Length;
int maxByteCount = 524288;
if (byteCount > maxByteCount)
{
var buf = new byte[maxByteCount];
fs.Read(buf, 0, maxByteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length * byteCount / maxByteCount;
}
else
{
var buf = new byte[byteCount];
fs.Read(buf, 0, (int)byteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length;
}
}
return Convert.ToInt32(count);
}
This seems to work ok, but I have some concerns:
1) I would like to have my parameter simply as Stream (as opposed to a filename) since I may also be reading from the clipboard (MemoryStream). However Stream doesn't seem to be able to read n bytes at once into a buffer or get the total length of the Stream in bytes, like FileStream can. Stream is the parent class to both MemoryStream and FileStream.
2) I don't want to assume an encoding such as UTF8
3) I don't want to assume an end of line character (it should work for CR, CRLF, and LF)
I would appreciate any help to make this function more robust.
Here is what I came up with as a more robust solution for estimating line count.
public static int EstimateLineCount(string file)
{
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
return EstimateLineCount(fs);
}
}
public static int EstimateLineCount(Stream s)
{
//if file is larger than 10MB estimate the line count, otherwise get the exact line count
const int maxBytes = 10485760; //10MB = 1024*1024*10 bytes
s.Position = 0;
using (var sr = new StreamReader(s, Encoding.UTF8))
{
int lineCount = 0;
if (s.Length > maxBytes)
{
while (s.Position < maxBytes && sr.ReadLine() != null)
lineCount++;
return Convert.ToInt32((double)lineCount * s.Length / s.Position);
}
while (sr.ReadLine() != null)
lineCount++;
return lineCount;
}
}
var lineCount = File.ReadLines(#"C:\file.txt").Count();
An other way:
var lineCount = 0;
using (var reader = File.OpenText(#"C:\file.txt"))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
You're cheating! You're asking more than one question... I'll try to help you anyway :P
No, you can't use Stream, but you can use StreamReader. This should provide the flexibility you need.
Test for encoding, since I deduce you'll be working with various. Keep in mind however that it's usually hard to cater for ALL scenarios, so pick a few important ones first, and extend your program later.
Don't - let me show you how:
First, consider your source. Whether it's a file or memory stream, you should have an idea about it's size. I've done the file bit because I'm lazy and it's easy, so you'll have to figure out the memory stream bit yourself. What I've done is much simpler but less accurate: Read the first line of the file, and use it as a percentage of the size of the file. Note I multiplied the length of the string by 2 as that is the delta, in other words number of extra bytes used per extra character in a string. Obviously this isn't very accurate, so you can extend it to x number of lines, just keep in mind that you'll have to change the formula as well.
static void Main(string[] args)
{
FileInfo fileInfo = new FileInfo((#"C:\Muckabout\StringCounter\test.txt"));
using (var stream = new StreamReader(fileInfo.FullName))
{
var firstLine = stream.ReadLine(); // Read the first line.
Console.WriteLine("First line read. This is roughly " + (firstLine.Length * 2.0) / fileInfo.Length * 100 + " per cent of the file.");
}
Console.ReadKey();
}

Writing and reading streams with offset to reduce disc seeks

I am reading a file and writing stream of that to a file, what I want to do is writing multiple files in a single file and reading them by their offset.
While writing the files, I understand that i need to know the file offset and length of the stream to read back the file.
var file = #"d:\foo.pdf";
var stream = File.ReadAllBytes(file);
// here i have the length of the
Console.WriteLine(stream.LongLength);
using (var br = new BinaryWriter(File.Open(#"d:\foo.bin", FileMode.OpenOrCreate)))
{
br.Write(stream);
}
I need to find the offset while writing multiple files.
Also while reading back the files, how do I start from an offset and read forwards as long as the length?
Finally, Does this method reduces number of disc seeks?
To read back various fragments, you will need to store the individual file lengths. For example:
using(var dest = File.Open(#"d:\foo.bin", FileMode.OpenOrCreate))
{
Append(dest, file);
Append(dest, anotherFile);
}
...
static void AppendFile(Stream dest, string path)
{
using(var source = File.OpenRead(path))
{
var lenHeader = BitConverter.GetBytes(source.Length);
dest.Write(lenHeader, 0, 4);
source.CopyTo(dest);
}
}
Then to read back you can do things like:
using(var source = File.OpenRead(...))
{
int len = ReadLength(source);
stream.Seek(len, SeekOrigin.Current); // skip the first file
len = ReadLength(source);
// TODO: now read len-many bytes from the second file
}
static int ReadLength(Stream stream)
{
byte[] buffer = new byte[4];
int count = 4, offset = 0, read;
while(count != 0 && (read = stream.Read(buffer, offset, count)) > 0)
{
count -= read;
offset += read;
}
if (count != 0) throw new EndOfStreamException();
return BitConverter.ToInt32(buffer, 0);
}
As for reading len-many bytes; you can either just keep track of it and decrement it while reading, or you can create a length-limited Stream wrapper. Either works.

Remove last x lines from a streamreader

I need to read in all but the last x lines from a file to a streamreader in C#. What is the best way to do this?
Many Thanks!
If it's a large file, is it possible to just seek to the end of the file, and examine the bytes in reverse for the '\n' character? I am aware that \n and \r\n exists. I whipped up the following code and tested on a fairly trivial file. Can you try testing this on the files that you have? I know my solution looks long, but I think you'll find that it's faster than reading from the beginning and rewriting the whole file.
public static void Truncate(string file, int lines)
{
using (FileStream fs = File.Open(file, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None))
{
fs.Position = fs.Length;
// \n \r\n (both uses \n for lines)
const int BUFFER_SIZE = 2048;
// Start at the end until # lines have been encountered, record the position, then truncate the file
long currentPosition = fs.Position;
int linesProcessed = 0;
byte[] buffer = new byte[BUFFER_SIZE];
while (linesProcessed < linesToTruncate && currentPosition > 0)
{
int bytesRead = FillBuffer(buffer, fs);
// We now have a buffer containing the later contents of the file
for (int i = bytesRead - 1; i >= 0; i--)
{
currentPosition--;
if (buffer[i] == '\n')
{
linesProcessed++;
if (linesProcessed == linesToTruncate)
break;
}
}
}
// Truncate the file
fs.SetLength(currentPosition);
}
}
private static int FillBuffer(byte[] buffer, FileStream fs)
{
if (fs.Position == 0)
return 0;
int bytesRead = 0;
int currentByteOffset = 0;
// Calculate how many bytes of the buffer can be filled (remember that we're going in reverse)
long expectedBytesToRead = (fs.Position < buffer.Length) ? fs.Position : buffer.Length;
fs.Position -= expectedBytesToRead;
while (bytesRead < expectedBytesToRead)
{
bytesRead += fs.Read(buffer, currentByteOffset, buffer.Length - bytesRead);
currentByteOffset += bytesRead;
}
// We have to reset the position again because we moved the reader forward;
fs.Position -= bytesRead;
return bytesRead;
}
Since you are only planning on deleting the end of the file, it seems wasteful to rewrite everything, especially if it's a large file and small N. Of course, one can make the argument that if someone wanted to eliminate all lines, then going from the beginning to the end is more efficient.
Since you are referring to lines in a file, I'm assuming it's a text file. If you just want to get the lines you can read them into an array of strings like so:
string[] lines = File.ReadAllLines(#"C:\test.txt");
Or if you really need to work with StreamReaders:
using (StreamReader reader = new StreamReader(#"C:\test.txt"))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
You don't really read INTO a StreamReader. In fact, for the pattern you're asking for you don't need the StreamReader at all. System.IO.File has the useful static method 'ReadLines' that you can leverage instead:
IEnumerable<string> allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();
The previous flawed version, back in response to the comment thread
List<string> allLines = File.ReadLines(path).ToList();
IEnumerable<string> allBut = allLines.Take(allLines.Count - 5);

Setting the offset in a stream

It says here msdn.microsoft.com/en-us/library/system.io.stream.read.aspx that the Stream.Read and Stream.Write methods both advance the position/offset in the stream automatically so why is the examples here http://msdn.microsoft.com/en-us/library/system.io.stream.read.aspx and http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx manually changing the offset?
Do you only set the offset in a loop if you know the size of the stream and set it to 0 if you don't know the size and using a buffer?
// Now read s into a byte buffer.
byte[] bytes = new byte[s.Length];
int numBytesToRead = (int) s.Length;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
// Read may return anything from 0 to 10.
int n = s.Read(bytes, numBytesRead, 10);
// The end of the file is reached.
if (n == 0)
{
break;
}
numBytesRead += n;
numBytesToRead -= n;
}
and
using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
{
const int size = 4096;
byte[] buffer = new byte[size];
using (MemoryStream memory = new MemoryStream())
{
int count = 0;
do
{
count = stream.Read(buffer, 0, size);
if (count > 0)
{
memory.Write(buffer, 0, count);
}
}
while (count > 0);
return memory.ToArray();
}
}
The offset is actually the offset of the buffer, not the stream. Streams are advanced automatically as they are read.
Edit (to the edited question):
In none of the code snippets you pasted into the question I see any stream offset being set.
I think you are mistaking the calculation of bytes to read vs. bytes received. This protocol may seem funny (why would you receive fewer bytes than requested?) but it makes sense when you consider that you might be reading from a high-latency packet oriented source (think: network sockets).
You might be receiving 6 characters in one burst (from a TCP packet) and only receive the remaining 4 characters in your next read (when the next packet has arrived).
Edit In response to your linked example from the comment:
using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
{
// ... snip
count = stream.Read(buffer, 0, size);
if (count > 0)
{
memory.Write(buffer, 0, count);
}
It appears that the coders use prior knowledge about the underlying stream implementation, that stream.Read will always return 0 OR the size requested. That seems like a risky bet, to me. But if the docs for GZipStream do state that, it could be alright. However, since the MSDN samples use a generic Stream variable, it is (way) more correct to check the exact number of bytes read.
The first linked example uses a MemoryStream in both Write and Read fashion. The position is reset in between, so the data that was written first will be read:
Stream s = new MemoryStream();
for (int i = 0; i < 100; i++)
{
s.WriteByte((byte)i);
}
s.Position = 0;
The second example linked does not set the stream position. You'd typically have seen a call to Seek if it did. You maybe confusing the offsets into the data buffer with the stream position?

Categories