Why StreamReader.EndOfStream property change the BaseStream.Position value - c#

I wrote this small program which reads every 5th character from Random.txt
In random.txt I have one line of text: ABCDEFGHIJKLMNOPRST. I got the expected result:
Position of A is 0
Position of F is 5
Position of K is 10
Position of P is 15
Here is the code:
static void Main(string[] args)
{
StreamReader fp;
int n;
fp = new StreamReader("d:\\RANDOM.txt");
long previousBSposition = fp.BaseStream.Position;
//In this point BaseStream.Position is 0, as expected
n = 0;
while (!fp.EndOfStream)
{
//After !fp.EndOfStream were executed, BaseStream.Position is changed to 19,
//so I have to reset it to a previous position :S
fp.BaseStream.Seek(previousBSposition, SeekOrigin.Begin);
Console.WriteLine("Position of " + Convert.ToChar(fp.Read()) + " is " + fp.BaseStream.Position);
n = n + 5;
fp.DiscardBufferedData();
fp.BaseStream.Seek(n, SeekOrigin.Begin);
previousBSposition = fp.BaseStream.Position;
}
}
My question is, why after line while (!fp.EndOfStream) BaseStream.Position is changed to 19, e.g. end of a BaseStream. I expected, obviously wrong, that BaseStream.Position will stay the same when I call EndOfStream check?
Thanks.

Thre only certain way to find out whether a Stream is at its end is to actually read something from it and check whether the return value is 0. (StreamReader has another way – checking its internal buffer, but you correctly don't let it do that by calling DiscardBufferedData.)
So, EndOfStream has to read at least one byte from the base stream. And since reading byte by byte is inefficient, it reads more. That's the reason why the call to EndOfStream changes the position to the end (it woulnd't be the end of file for bigger files).
It seems you don't actually need to use StreamReader, so you should use Stream (or specifically FileStream) directly:
using (Stream fp = new FileStream(#"d:\RANDOM.txt", FileMode.Open))
{
int n = 0;
while (true)
{
int read = fp.ReadByte();
if (read == -1)
break;
char c = (char)read;
Console.WriteLine("Position of {0} is {1}.", c, fp.Position);
n += 5;
fp.Position = n;
}
}
(I'm not sure what does setting the position beyond the end of file do in this situation, you may need to add a check for that.)

The base stream's Position property refers to the position of the last read byte in the buffer, not the actual position of the StreamReader's cursor.

You are right and I could reproduce your issue as well, anyway according to (MSDN: Read Text from a File) the proper way to read a text file with a StreamReader is the following, not yours (this also always closes and disposes the stream by using a using block):
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader("TestFile.txt"))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
Console.WriteLine(line);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}

Related

c# - splitting a large list into smaller sublists

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith
Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

Read the large text files into chunks line by line

Suppose the following lines in text file to which i have to read
INFO 2014-03-31 00:26:57,829 332024549ms Service1 startmethod - FillPropertyColor end
INFO 2014-03-31 00:26:57,829 332024549ms Service1 getReports_Dataset - getReports_Dataset started
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration start
INFO 2014-03-31 00:26:57,829 332024549ms Service1 cheduledGeneration - SwitchScheduledGeneration limitId, subscriptionId, limitPeriod, dtNextScheduledDate,shoplimittype0, 0, , 3/31/2014 12:26:57 AM,0
I use the FileStream method to read the text file because the text file size having size over 1 GB. I have to read the files into chunks like initially in first run of program this would read two lines i.e. up to "getReports_Dataset started of second line". In next run it should read from 3rd line. I did the code but unable to get desired output.Problem is that my code doesn't give the exact chunk from where i have to start read text in next run. And second problem is while reading text lines .. don't give a complete line..i.e. some part is missing in lines. Following code:
readPosition = getLastReadPosition();
using (FileStream fStream = new FileStream(logFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (System.IO.StreamReader rdr = new System.IO.StreamReader(fStream))
{
rdr.BaseStream.Seek(readPosition, SeekOrigin.Begin);
while (numCharCount > 0)
{
int numChars = rdr.ReadBlock(block, 0, block.Length);
string blockString = new string(block);
lines = blockString.Split(Convert.ToChar('\r'));
lines[0] = fragment + lines[0];
fragment = lines[lines.Length - 1];
foreach (string line in lines)
{
lstTextLog.Add(line);
if (lstTextLog.Contains(fragment))
{
lstTextLog.Remove(fragment);
}
numProcessedChar++;
}
numCharCount--;
}
SetLastPosition(numProcessedChar, logFilePath);
}
If you want to read a file line-by-line, do this:
foreach (string line in File.ReadLines("filename"))
{
// process line here
}
If you really must read a line and save the position, you need to save the last line number read, rather than the stream position. For example:
int lastLineRead = getLastLineRead();
string nextLine = File.ReadLines("filename").Skip(lastLineRead).FirstOrDefault();
if (nextLine != null)
{
lastLineRead++;
SetLastPosition(lastLineRead, logFilePath);
}
The reason you can't do it by saving the base stream position is because StreamReader reads a large buffer full of data from the base stream, which moves the file pointer forward by the buffer size. StreamReader then satisfies read requests from that buffer until it has to read the next buffer full. For example, say you open a StreamReader and ask for a single character. Assuming that it has a buffer size of 4 kilobytes, StreamReader does essentially this:
if (buffer is empty)
{
read buffer (4,096 bytes) from base stream
buffer_position = 0;
}
char c = buffer[buffer_position];
buffer_position++; // increment position for next read
return c;
Now, if you ask for the base stream's position, it's going to report that the position is at 4096, even though you've only read one character from the StreamReader.

In C#, How can I copy a file with arbitrary encoding, reading line by line, without adding or deleting a newline

I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.
What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:
Rec1<newline>
Rec2<newline>
And a file with these:
Rec1<newline>
Rec2
How can I tell the difference in my code so that I can take appropriate action?
using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
bool isFirstLine = true;
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (isFirstLine)
{
writer.Write(line);
isFirstLine = false;
}
else
{
writer.Write("\r\n" + line);
}
}
//if (LastLineHasNewline)
//{
// writer.Write("\n");
//}
writer.Flush();
}
The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.
Remember, I have no a priori knowledge of the input file encoding.
That's the fundamental problem to solve.
If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.
I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.
As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:
if (LastLineHasNewline(reader))
{
writer.Write("\n");
}
And the function looks like this:
private static bool LastLineHasNewline(StreamReader reader)
{
byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
int newlineByteCount = newlineBytes.Length;
reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);
byte[] inputBytes = new byte[newlineByteCount];
reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
for (int i = 0; i < newlineByteCount; i++)
{
if (newlineBytes[i] != inputBytes[i])
return false;
}
return true;
}

How can I split a big text file into smaller file?

I have a big file with some text, and I want to split it into smaller files.
In this example, What I do:
I open a text file let's say with 10 000 lines into it
I set a number of package=300 here, which means, that's the small file limit, once a small file has 300 lines into it, close it, open a new file for writing for example (package2).
Same, as step 2.
You already know
Here is the code from my function that should do that. The ideea (what I dont' know) is how to close, and open a new file once it has reached the 300 limit (in our case here).
Let me show you what I'm talking about:
int nr = 1;
package=textBox1.Text;//how many lines/file (small file)
string packnr = nr.ToString();
string filer=package+"Pack-"+packnr+"+_"+date2+".txt";//name of small file/s
int packtester = 0;
int package= 300;
StreamReader freader = new StreamReader("bigfile.txt");
StreamWriter pak = new StreamWriter(filer);
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line);//writing line to small file
packtester++;//increasing the lines of small file
}
else if (packtester == package)//in this example, checking if the lines
//written, got to 300
{
packtester = 0;
pak.Close();//closing the file
nr++;//nr++ -> just for file name to be Pack-2;
packnr = nr.ToString();
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
}
}
I get this errors:
Cannot use local variable 'pak' before it is declared
A local variable named 'pak' cannot be declared in this scope because it would give a different meaning to 'pak', which is already used in a 'parent or current' scope to denote something else
Try this:
public void SplitFile()
{
int nr = 1;
int package = 300;
DateTime date2 = DateTime.Now;
int packtester = 0;
using (var freader = new StreamReader("bigfile.txt"))
{
StreamWriter pak = null;
try
{
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
string line;
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line); //writing line to small file
packtester++; //increasing the lines of small file
}
else
{
pak.Flush();
pak.Close(); //closing the file
packtester = 0;
nr++; //nr++ -> just for file name to be Pack-2;
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
}
}
}
finally
{
if(pak != null)
{
pak.Dispose();
}
}
}
}
private string GetPackFilename(int package, int nr, DateTime date2)
{
return string.Format("{0}Pack-{1}+_{2}.txt", package, nr, date2);
}
Logrotate can do this automatically for you. Years have been put into it and it's what people trust to handle their sometimes very large webserver logs.
Note that the code, as written, will not compile because you define the variable pak more than once. It should otherwise function, though it has some room for improvement.
When working with files, my suggestion and the general norm is to wrap your code in a using block, which is basically syntactic sugar built on top of a finally clause:
using (var stream = File.Open("C:\hi.txt"))
{
//write your code here. When this block is exited, stream will be disposed.
}
Is equivalent to:
try
{
var stream = File.Open(#"C:\hi.txt");
}
finally
{
stream.Dispose();
}
In addition, when working with files, always prefer opening file streams using very specific permissions and modes as opposed to using the more sparse constructors that assume some default options. For example:
var stream = new StreamWriter(File.Open(#"c:\hi.txt", FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Read));
This will guarantee, for example, that files should not be overwritten -- instead, we assume that the file we want to open doesn't exist yet.
Oh, and instead of using the check you perform, I suggest using the EndOfStream property of the StreamReader object.
This code looks like it closes the stream and re-opens a new stream when you hit 300 lines. What exactly doesn't work in this code?
One thing you'll want to add is a final close (probably with a check so it doesn't try to close an already closed stream) in case you don't have an even multiple of 300 lines.
EDIT:
Due to your edit I see your problem. You don't need to redeclare pak in the last line of code, simply reinitialize it to another streamwriter.
(I don't remember if that is disposable but if it is you probably should do that before making a new one).
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
becomes
pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");

Bytes consumed by StreamReader

Is there a way to know how many bytes of a stream have been used by StreamReader?
I have a project where we need to read a file that has a text header followed by the start of the binary data. My initial attempt to read this file was something like this:
private int _dataOffset;
void ReadHeader(string path)
{
using (FileStream stream = File.OpenRead(path))
{
StreamReader textReader = new StreamReader(stream);
do
{
string line = textReader.ReadLine();
handleHeaderLine(line);
} while(line != "DATA") // Yes, they used "DATA" to mark the end of the header
_dataOffset = stream.Position;
}
}
private byte[] ReadDataFrame(string path, int frameNum)
{
using (FileStream stream = File.OpenRead(path))
{
stream.Seek(_dataOffset + frameNum * cbFrame, SeekOrigin.Begin);
byte[] data = new byte[cbFrame];
stream.Read(data, 0, cbFrame);
return data;
}
return null;
}
The problem is that when I set _dataOffset to stream.Position, I get the position that the StreamReader has read to, not the end of the header. As soon as I thought about it this made sense, but I still need to be able to know where the end of the header is and I'm not sure if there's a way to do it and still take advantage of StreamReader.
You can find out how many bytes the StreamReader has actually returned (as opposed to read from the stream) in a number of ways, none of them too straightforward I'm afraid.
Get the result of textReader.CurrentEncoding.GetByteCount(totalLengthOfAllTextRead) and then seek to this position in the stream.
Use some reflection hackery to retrieve the value of the private variable of the StreamReader object that corresponds to the current byte position within the internal buffer (different from that with the stream - usually behind, but no more than equal to of course). Judging by .NET Reflector, the this variable seems to be named bytePos.
Don't bother using a StreamReader at all but instead implement your custom ReadLine function built on top of the Stream or BinaryReader even (BinaryReader is guaranteed never to read further ahead than what you request). This custom function must read from the stream char by char, so you'd actually have to use the low-level Decoder object (unless the encoding is ASCII/ANSI, in which case things are a bit simpler due to single-byte encoding).
Option 1 is going to be the least efficient I would imagine (since you're effectively re-encoding text you just decoded), and option 3 the hardest to implement, though perhaps the most elegant. I'd probably recommend against using the ugly reflection hack (option 2), even though it's looks tempting, being the most direct solution and only taking a couple of lines. (To be quite honest, the StreamReader class really ought to expose this variable via a public property, but alas it does not.) So in the end, it's up to you, but either method 1 or 3 should do the job nicely enough...
Hope that helps.
So the data is utf8 (the default encoding for StreamReader). This is a multibyte encoding, so IndexOf would be inadvisable. You could:
Encoding.UTF8.GetByteCount(string)
on your data so far, adding 1 or 2 bytes for the missing line ending.
If you're needing to count bytes, I'd go with the BinaryReader. You can take the results and cast them about as needed, but I find its idea of its current position to be more reliable (in that since it reads in binary, its immune to character-set problems).
So your last line contains 'DATA' + an unknown amount of data bytes. You could extract the position by using IndexOf() with your last read line. Then readjust the stream.Position.
But I am not sure if you should use ReadLine() at all in this case. Maybe it would be better to read byte by byte until you reach the 'DATA' mark.
The line breaks are easily identifiable without needing to decode the stream first (except for some encodings rarely used for text files like EBCDIC, UTF-16, UTF-32), so you can just read each line as bytes and then decode the entire line:
using (FileStream stream = File.OpenRead(path)) {
List<byte> buffer = new List<byte>();
bool hasCr = false;
bool done = false;
while (!done) {
int b = stream.ReadByte();
if (b == -1) throw new IOException("End of file reached in header.");
if (b == 13) {
hasCr = true;
} else if (b == 10 && hasCr) {
string line = Encoding.UTF8.GetString(buffer.ToArray(), 0, buffer.Count);
if (line == "DATA") {
done = true;
} else {
HandleHeaderLine(line);
}
buffer.Clear();
hasCr = false;
} else {
if (hasCr) buffer.Add(13);
hasCr = false;
buffer.Add((byte)b);
}
}
_dataOffset = stream.Position;
}
Instead of closing the stream and open it again, you could of course just keep on reading the data.

Categories