I need to read in all but the last x lines from a file to a streamreader in C#. What is the best way to do this?
Many Thanks!
If it's a large file, is it possible to just seek to the end of the file, and examine the bytes in reverse for the '\n' character? I am aware that \n and \r\n exists. I whipped up the following code and tested on a fairly trivial file. Can you try testing this on the files that you have? I know my solution looks long, but I think you'll find that it's faster than reading from the beginning and rewriting the whole file.
public static void Truncate(string file, int lines)
{
using (FileStream fs = File.Open(file, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None))
{
fs.Position = fs.Length;
// \n \r\n (both uses \n for lines)
const int BUFFER_SIZE = 2048;
// Start at the end until # lines have been encountered, record the position, then truncate the file
long currentPosition = fs.Position;
int linesProcessed = 0;
byte[] buffer = new byte[BUFFER_SIZE];
while (linesProcessed < linesToTruncate && currentPosition > 0)
{
int bytesRead = FillBuffer(buffer, fs);
// We now have a buffer containing the later contents of the file
for (int i = bytesRead - 1; i >= 0; i--)
{
currentPosition--;
if (buffer[i] == '\n')
{
linesProcessed++;
if (linesProcessed == linesToTruncate)
break;
}
}
}
// Truncate the file
fs.SetLength(currentPosition);
}
}
private static int FillBuffer(byte[] buffer, FileStream fs)
{
if (fs.Position == 0)
return 0;
int bytesRead = 0;
int currentByteOffset = 0;
// Calculate how many bytes of the buffer can be filled (remember that we're going in reverse)
long expectedBytesToRead = (fs.Position < buffer.Length) ? fs.Position : buffer.Length;
fs.Position -= expectedBytesToRead;
while (bytesRead < expectedBytesToRead)
{
bytesRead += fs.Read(buffer, currentByteOffset, buffer.Length - bytesRead);
currentByteOffset += bytesRead;
}
// We have to reset the position again because we moved the reader forward;
fs.Position -= bytesRead;
return bytesRead;
}
Since you are only planning on deleting the end of the file, it seems wasteful to rewrite everything, especially if it's a large file and small N. Of course, one can make the argument that if someone wanted to eliminate all lines, then going from the beginning to the end is more efficient.
Since you are referring to lines in a file, I'm assuming it's a text file. If you just want to get the lines you can read them into an array of strings like so:
string[] lines = File.ReadAllLines(#"C:\test.txt");
Or if you really need to work with StreamReaders:
using (StreamReader reader = new StreamReader(#"C:\test.txt"))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
You don't really read INTO a StreamReader. In fact, for the pattern you're asking for you don't need the StreamReader at all. System.IO.File has the useful static method 'ReadLines' that you can leverage instead:
IEnumerable<string> allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();
The previous flawed version, back in response to the comment thread
List<string> allLines = File.ReadLines(path).ToList();
IEnumerable<string> allBut = allLines.Take(allLines.Count - 5);
Related
I have a large text file which should be processed after every 2000 characters with a new line to it I have done so far as
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
StreamReader reader = new StreamReader(FilePath);
string firstLine = reader.ReadLine();
if (firstLine.Length > 2000)
{
string text = File.ReadAllText(FilePath);
text = Regex.Replace(text, #"(.{2000})", "$1\r\n", RegexOptions.Multiline);
reader.Close();
File.WriteAllText(FilePath, text);
}
it is giving
out of memory exception
please, anyone, refer me some advice
In case of very large (multi Gigabyte) file which doesn't fit memory, you can try storing processed data into a temporary file. Avoid ReadAllText, but read and write with a help of buffer (which is convenient to be of 2000 chars in the context)
// Initial and target file
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Temporary file
string tempFile = Path.ChangeExtension(FilePath, ".~temp");
char[] buffer = new char[2000];
using (StreamReader reader = new StreamReader(FilePath)) {
bool first = true;
using (StreamWriter writer = new StreamWriter(tempFile)) {
while (true) {
int size = reader.ReadBlock(buffer, 0, buffer.Length);
if (size > 0) { // Do we have anything to write?
if (!first) // Are we in the middle and have to add a new line?
writer.WriteLine();
for (int i = 0; i < size; ++i)
writer.Write(buffer[i]);
}
// The last (incomplete) chunk
if (size < buffer.Length)
break;
first = false;
}
}
}
File.Delete(FilePath);
// Move temporary file into target one
File.Move(tempFile, FilePath);
// And finally removing temporary file
File.Delete(tempFile);
Edit: Even if you have not that large (300MB, see comments) avoid string processing (several copies of the initial string can well lead to Out Of Memory).
Something like this
private static IEnumerable<string> ToChunks(string text, int size) {
int n = text.Length / size + (text.Length % size == 0 ? 0 : 1);
for (int i = 0; i < n; ++i)
if (i == n - 1)
yield return text.Substring(i * size); // Last chunk
else
yield return text.Substring(i * size, size); // Inner chunk
}
...
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Read once, do not Replace ao do something with the string
string text = File.ReadAllText(FilePath);
// ... but extracting 2000 char chunks
File.WriteAllLines(FilePath, ToChunks(text, 2000));
You can't simply insert newlines into an exiting file - you need to rewrite the entire thing, basically. The easiest way to do that is to use two files - a source and destination - and then perhaps delete and rename at the end (so the temporary destination file takes the name of the original). This means you can now loop over the source file without reading it all into memory first; essentially, as pseudo-code:
using(...open source for read...)
using(...create dest for write...)
{
char[] buffer = new char[2000];
int charCount;
while(TryBuffer(source, buffer, out charCount)) {
// if true, we filled the buffer; don't need to worry
// about charCount
Write(destination, buffer, buffer.Length);
Write(destination, CRLF);
}
if(charCount != 0) // final chunk when returned false
{
// write any remaining charCount chars as a final chunk
Write(destination, buffer, charCount);
}
}
So that leaves the implementation of TryBuffer and Write. In this case, TextReader and TextWriter are probably your friends, since you are dealing in characters rather than bytes.
So I want to be able to seek to a point in a fileStream, then read forward using a StreamReader. Then seek forward again, and use the StreamReader to read another chunk of data.
const int BufferSize = 4096;
var buffer = new char[BufferSize];
var endpoints = new List<long>();
using (var fileStream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
var fileLength = fileStream.Length;
var seekPositionCount = fileLength / concurrentReads;
long currentOffset = 0;
for (var i = 0; i < concurrentReads; i++)
{
var seekPosition = seekPositionCount + currentOffset;
// seek the file forward
fileStream.Seek(seekPosition, SeekOrigin.Current);
// setting true at the end is very important, keeps the underlying fileStream open.
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize, true))
{
// this also seeks the file forward the amount in the buffer...
int bytesRead;
var totalBytesRead = 0;
while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
totalBytesRead += bytesRead;
var found = false;
var gotR = false;
for (var j = 0; j < buffer.Length; j++)
{
if (buffer[j] == '\r')
{
gotR = true;
continue;
}
if (buffer[j] == '\n' && gotR)
{
// so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
seekPosition += totalBytesRead - BufferSize + j;
endpoints.Add(seekPosition);
found = true;
break;
}
}
if (found) break;
}
}
// we need to seek to the position we got to in the StreamReader (but not going by how much was read).
fileStream.Seek(seekPosition, SeekOrigin.Current);
currentOffset += seekPosition;
}
}
return endpoints;
However, I get to two entries in endpoints and it exits out.
(bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0
The arguments you pass to ReadAsync I thought are solely to do with the buffer, so the index argument I thought was to say, fill the buffer at index.
I can't make out from Reference Source how this value is used.
I assumed (and can't find the evidence to back up) that, when you opened a StreamReader it uses the underlying Stream as it's guide, so when you ask to read some bytes, it will start from the position the underlying Stream is at...
But the results of what I'm doing aren't showing that, they seem to be showing that the StreamReader is starting at the beginning of the Stream each time - however, I can't find the evidence to support that is how it does it either...
Seeking
Is my understanding of seeking correct, in the sense that if I call seek
fileStream.Seek(seekPosition, SeekOrigin.Current);
If the file is at 300, I want to seek to 600, the above variable seekPosition should be 600??
ReferenceSource would say otherwise:
else if (origin == SeekOrigin.Current) {
// Don't call FlushRead here, which would have caused an infinite
// loop. Simply adjust the seek origin. This isn't necessary
// if we're seeking relative to the beginning or end of the stream.
offset -= (_readLen - _readPos);
}
So thanks to Hans Passant, I have got the answer:
var buffer = new char[BufferSize];
var endpoints = new List<long>();
using (var fileStream = this.CreateMultipleReadAccessFileStream(fileName))
{
var fileLength = fileStream.Length;
var seekPositionCount = fileLength / concurrentReads;
long currentOffset = 0;
for (var i = 0; i < concurrentReads; i++)
{
var seekPosition = seekPositionCount + currentOffset;
// seek the file forward
// fileStream.Seek(seekPosition, SeekOrigin.Current);
// setting true at the end is very important, keeps the underlying fileStream open.
using (var streamReader = this.CreateTemporaryStreamReader(fileStream))
{
// this is poor on performance, hence why you split the file here and read in new threads.
streamReader.DiscardBufferedData();
// you have to advance the fileStream here, because of the previous line
streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
// this also seeks the file forward the amount in the buffer...
int bytesRead;
var totalBytesRead = 0;
while ((bytesRead = await streamReader.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
totalBytesRead += bytesRead;
var found = false;
var gotR = false;
for (var j = 0; j < buffer.Length; j++)
{
if (buffer[j] == '\r')
{
gotR = true;
continue;
}
if (buffer[j] == '\n' && gotR)
{
// so we add the total bytes read, minus the current buffer amount read, then add how far into the buffer we actually read.
seekPosition += totalBytesRead - BufferSize + j;
endpoints.Add(seekPosition);
found = true;
break;
}
// if we have found new line then move the position to
}
if (found) break;
}
}
currentOffset = seekPosition;
}
}
return endpoints;
Note the new part, rather than doing this twice:
fileStream.Seek(seekPosition, SeekOrigin.Current);
I now use SeekOrigin.Begin and use the StreamReader to progress the underlying base stream:
// this is poor on performance, hence why you split the file here and read in new threads.
streamReader.DiscardBufferedData();
// you have to advance the fileStream here, because of the previous line
streamReader.BaseStream.Seek(seekPosition, SeekOrigin.Begin);
The DiscardBufferedData will mean that I'm always using the underlying stream position.
I would like to get an estimate of the number of lines in a csv/text file so that I can use that number for a progress bar. The file could be extremely large so getting the exact number of lines will take too long for this purpose.
What I have come up with is below (read in a portion of the file and count the number of lines and use the file size to estimate the total number of lines):
public static int GetLineCountEstimate(string file)
{
double count = 0;
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
long byteCount = fs.Length;
int maxByteCount = 524288;
if (byteCount > maxByteCount)
{
var buf = new byte[maxByteCount];
fs.Read(buf, 0, maxByteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length * byteCount / maxByteCount;
}
else
{
var buf = new byte[byteCount];
fs.Read(buf, 0, (int)byteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length;
}
}
return Convert.ToInt32(count);
}
This seems to work ok, but I have some concerns:
1) I would like to have my parameter simply as Stream (as opposed to a filename) since I may also be reading from the clipboard (MemoryStream). However Stream doesn't seem to be able to read n bytes at once into a buffer or get the total length of the Stream in bytes, like FileStream can. Stream is the parent class to both MemoryStream and FileStream.
2) I don't want to assume an encoding such as UTF8
3) I don't want to assume an end of line character (it should work for CR, CRLF, and LF)
I would appreciate any help to make this function more robust.
Here is what I came up with as a more robust solution for estimating line count.
public static int EstimateLineCount(string file)
{
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
return EstimateLineCount(fs);
}
}
public static int EstimateLineCount(Stream s)
{
//if file is larger than 10MB estimate the line count, otherwise get the exact line count
const int maxBytes = 10485760; //10MB = 1024*1024*10 bytes
s.Position = 0;
using (var sr = new StreamReader(s, Encoding.UTF8))
{
int lineCount = 0;
if (s.Length > maxBytes)
{
while (s.Position < maxBytes && sr.ReadLine() != null)
lineCount++;
return Convert.ToInt32((double)lineCount * s.Length / s.Position);
}
while (sr.ReadLine() != null)
lineCount++;
return lineCount;
}
}
var lineCount = File.ReadLines(#"C:\file.txt").Count();
An other way:
var lineCount = 0;
using (var reader = File.OpenText(#"C:\file.txt"))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
You're cheating! You're asking more than one question... I'll try to help you anyway :P
No, you can't use Stream, but you can use StreamReader. This should provide the flexibility you need.
Test for encoding, since I deduce you'll be working with various. Keep in mind however that it's usually hard to cater for ALL scenarios, so pick a few important ones first, and extend your program later.
Don't - let me show you how:
First, consider your source. Whether it's a file or memory stream, you should have an idea about it's size. I've done the file bit because I'm lazy and it's easy, so you'll have to figure out the memory stream bit yourself. What I've done is much simpler but less accurate: Read the first line of the file, and use it as a percentage of the size of the file. Note I multiplied the length of the string by 2 as that is the delta, in other words number of extra bytes used per extra character in a string. Obviously this isn't very accurate, so you can extend it to x number of lines, just keep in mind that you'll have to change the formula as well.
static void Main(string[] args)
{
FileInfo fileInfo = new FileInfo((#"C:\Muckabout\StringCounter\test.txt"));
using (var stream = new StreamReader(fileInfo.FullName))
{
var firstLine = stream.ReadLine(); // Read the first line.
Console.WriteLine("First line read. This is roughly " + (firstLine.Length * 2.0) / fileInfo.Length * 100 + " per cent of the file.");
}
Console.ReadKey();
}
I have been trying for a couple of days now to load a file in chunks to allow the user to use very large (GB) files and still keep the speed of the program. Currently i have the following code:
using (FileStream filereader = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
using (StreamReader reader = new StreamReader(filereader))
{
while (toRead > 0 && (bytesread = reader.Read(buffer, offset, toRead)) > 0)
{
toRead -= bytesread;
offset += bytesread;
}
if (toRead > 0) throw new EndOfStreamException();
foreach (var item in buffer)
{
temporary = temporary += item.ToString();
}
temporary.Replace("\n", "\n" + System.Environment.NewLine);
Below are the declarations to avoid any confusion (hopefully):
const int Max_Buffer = 5000;
char[] buffer = new char[Max_Buffer];
int bytesread;
int toRead = 5000;
int offset = 0;
At the moment the program reads in 5000 bytes of the text file, then processes the bytes into a string which i then pass into a stringreader so i can take the information i want.
My problem at the moment is that the buffer can stop halfway through a line so when I take my data in the stringreader class it brings up index/length errors.
What i need is to know how to either seek back in the array to find a certain set of characters that signify the start of a line and then only return the data before that point for processing to a string.
Another issue after sorting the seeking back problem is how would i keep the data i didnt want to process and bring in more data to fill the buffer.
I hope this is explained well, i know i can sometimes be confusing hope someone can help.
I would suggest the use of reader.ReadLine() instead of reader.Read() in your loop
buffer=reader.ReadLine();
bytesread = buffer.Length*2;//Each charcter is unicode and equal to 2 bytes
You can then check on the whether (toRead - bytesread)<0.
int bufferlength = 12488;
int pointer = 1;
int offset = 0;
int length = 0;
FileStream fstwrite = new FileStream("D:\\Movie.wmv", FileMode.Create);
while (pointer != 0)
{
byte[] buff = new byte[bufferlength];
FileStream fst = new FileStream("E:\\Movie.wmv", FileMode.Open);
pointer = fst.Read(buff, 0, bufferlength);
fst.Close();
fstwrite.Write(buff, offset , pointer);
offset += pointer;
}
I used the above code for splitting a file and place it in other drive.Im not able to set the correct offset and length for this routine can anyone help me to fix this
splitting in the sense ,i split it in "x" kbs and pass it somewhere make the same file in some other location
I find it atlast ,thanks to evry one who gave their valueble responses.
Currently you're always reading from the start of the file... and even if you weren't you'd just be copying the whole file.
Here's some code which will actually split a single file into multiple files:
public static void SplitFile(string inputFile,
string outputPrefix,
int chunkSize)
{
byte[] buffer = new byte[chunkSize];
using (Stream input = File.OpenRead(inputFile))
{
int index = 0;
while (input.Position < input.Length)
{
using (Stream output = File.Create(outputPrefix + index))
{
int chunkBytesRead = 0;
while (chunkBytesRead < chunkSize)
{
int bytesRead = input.Read(buffer,
chunkBytesRead,
chunkSize - chunkBytesRead);
// End of input
if (bytesRead == 0)
{
break;
}
chunkBytesRead += bytesRead;
}
output.Write(buffer, 0, chunkBytesRead);
}
index++;
}
}
}
Your reading bufferlength of bytes. Shouldn't you set the offset like this then?
offset += bufferlength;
Don't open your source file inside the loop, or you'll always read the first chunk.
Open it before the loop, then make sure your offset is applied to the read.