How does StreamReader read all chars, including 0x0D 0x0A chars? - c#

How does StreamReader read all chars, including 0x0D 0x0A chars?
I have an old .txt file I am trying to covert. Many lines (but not all) end with "0x0D 0x0D 0x0A".
This code reads all of the lines.
StreamReader srFile = new StreamReader(gstPathFileName);
while (!srFile.EndOfStream) {
string stFileContents = srFile.ReadLine();
...
}
This results in extra "" strings between each .txt line. As there are some blank lines between the paragraphs, removing all "" strings removes those blank lines.
Is there a way to have StreamReader read all of the chars including the "0x0D 0x0D 0x0A"?
Edited two hours later ... the file is huge, 1.6MB.

A very simple reimplementation of ReadLine. I have done a version that returns an IEnumerable<string> because it's easier. I've put it in an extension method, so the static class. The code is heavily commented, so it should be easy to read.
public static class StreamEx
{
public static string[] ReadAllLines(this TextReader tr, string separator)
{
return tr.ReadLines(separator).ToArray();
}
// StreamReader is based on TextReader
public static IEnumerable<string> ReadLines(this TextReader tr, string separator)
{
// Handling of empty file: old remains null
string old = null;
// Read buffer
var buffer = new char[128];
while (true)
{
// If we already read something
if (old != null)
{
// Look for the separator
int ix = old.IndexOf(separator);
// If found
if (ix != -1)
{
// Return the piece of line before the separator
yield return old.Remove(ix);
// Then remove the piece of line before the separator plus the separator
old = old.Substring(ix + separator.Length);
// And continue
continue;
}
}
// old doesn't contain any separator, let's read some more chars
int read = tr.ReadBlock(buffer, 0, buffer.Length);
// If there is no more chars to read, break the cycle
if (read == 0)
{
break;
}
// Add the just read chars to the old chars
// note that null + "somestring" == "somestring"
old += new string(buffer, 0, read);
// A new "round" of the while cycle will search for the separator
}
// Now we have to handle chars after the last separator
// If we read something
if (old != null)
{
// Return all the remaining characters
yield return old;
}
}
}
Note that, as written, it won't directly handle your problem :-) But it lets you select the separator you want to use. So you use "\r\n" and then you trim the excess '\r'.
Use it like this:
using (var sr = new StreamReader("somefile"))
{
// Little LINQ to strip excess \r and to make an array
// (note that by making an array you'll put all the file
// in memory)
string[] lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r')).ToArray();
}
or
using (var sr = new StreamReader("somefile"))
{
// Little LINQ to strip excess \r
// (note that the file will be read line by line, so only
// a line at a time is in memory (plus some remaining characters
// of the next line in the old buffer)
IEnumerable<string> lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r'));
foreach (string line in lines)
{
// Do something
}
}

You could always use a BinaryReader and manually read in lines a byte at a time. Keep hold of the bytes, then when you come across 0x0d 0x0d 0x0a, make a new string of the bytes for the current line.
Note:
I'm assuming that your encoding is Encoding.UTF8 but your case might be different. Accessing bytes directly, I don't know off-hand how to interpret the encoding.
If your file has extra information, e.g. a byte order mark, that will be returned too.
Here it is:
public static IEnumerable<string> ReadLinesFromStream(string fileName)
{
using ( var fileStream = File.Open(gstPathFileName) )
using ( BinaryReader binaryReader = new BinaryReader(fileStream) )
{
var bytes = new List<byte>();
while ( binaryReader.PeekChar() != -1 )
{
bytes.Add(binaryReader.ReadByte());
bool newLine = bytes.Count > 2
&& bytes[bytes.Count - 3] == 0x0d
&& bytes[bytes.Count - 2] == 0x0d
&& bytes[bytes.Count - 1] == 0x0a;
if ( newLine )
{
yield return Encoding.UTF8.GetString(bytes.Take(bytes.Count - 3).ToArray());
bytes.Clear();
}
}
if ( bytes.Count > 0 )
yield return Encoding.UTF8.GetString(bytes.ToArray());
}
}

A very easy solution (not optimized for memory consumption) could be:
var allLines = File.ReadAllText(gstPathFileName)
.Split('\n');
The if you need to remove trailing carriage return characters, then do:
for(var i = 0; i < allLines.Length; ++i)
allLines[i] = allLines[i].TrimEnd('\r');
You can put relevant processing into that for link if you want. Or if you do not want to keep the array, use this instead of the for:
foreach(var line in allLines.Select(x => x.TrimEnd('\r')))
{
// use 'line' here ...
}

This code works well ... reads every char.
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}

Related

Counting total characters of a file

Hi I'm pretty new to C# and trying to do some exercises to get up to speed with it. I'm trying to count the total number of characters in a file but it's stopping after the first word, would someone be able to tell me where I am going wrong? Thanks in advance
public void TotalCharacterCount()
{
string str;
int count, i, l;
count = i = 0;
StreamReader reader = File.OpenText("C:\\Users\\Lewis\\file.txt");
str = reader.ReadLine();
l = str.Length;
while (str != null && i < l)
{
count++;
i++;
str = reader.ReadLine();
}
reader.Close();
Console.Write("Number of characters in the file is : {0}\n", count);
}
If you want to know the size of a file:
long length = new System.IO.FileInfo("C:\\Users\\Lewis\\file.txt").Length;
Console.Write($"Number of characters in the file is : {length}");
If you want to count characters to play around with C#, then here is some sample code that might help you
int totalCharacters = 0;
// Using will do the reader.Close for you.
using (StreamReader reader = File.OpenText("C:\\Users\\Lewis\\file.txt"))
{
string str = reader.ReadLine();
while (str != null)
{
totalCharacters += str.Length;
str = reader.ReadLine();
}
}
// If you add the $ in front of the string, then you can interpolate expressions
Console.Write($"Number of characters in the file is : {totalCharacters}");
it's stopping after the first word
It is because you have check && i < l in the loop and then increment it so the check doesn't pass you don't change the value of l variable(by the way, the name is not very good, I was sure it was 1, not l).
Then if you need to get total count of characters in the file you could read the whole file to a string variable and just get it from Count() Length
var count = File.ReadAllText(path).Count();
Getting Length property of the FileInfo will give the size, in bytes, of the current file, which is not necessary will be equal to characters count(depending on Encoding a character may take more than a byte)
And regarding the way you read - it also depends whether you want to count new line symbols and others or not.
Consider the following sample
static void Main(string[] args)
{
var sampleWithEndLine = "a\r\n";
var length1 = "a".Length;
var length2 = sampleWithEndLine.Length;
var length3 = #"a
".Length;
Console.WriteLine($"First sample: {length1}");
Console.WriteLine($"Second sample: {length2}");
Console.WriteLine($"Third sample: {length3}");
var totalCharacters = 0;
File.WriteAllText("sample.txt", sampleWithEndLine);
using(var reader = File.OpenText("sample.txt"))
{
string str = reader.ReadLine();
while (str != null)
{
totalCharacters += str.Length;
str = reader.ReadLine();
}
}
Console.WriteLine($"Second sample read with stream reader: {totalCharacters}");
Console.ReadKey();
}
For the second sample, first, the Length will return 3, because it actually contains three symbols, while with stream reader you will get 1, because The string that is returned does not contain the terminating carriage return or line feed. The returned value is null if the end of the input stream is reached

Read lines with specific NewLine char sequence with StreamReader.ReadLine

Sometimes we need to read lines from a stream, but considering only specific char sequence as newline (CRLF, but not CR or LF).
StreamReader.ReadLine, as documented, treats as newline sequence CRLF, CR and LF. That may be unacceptable if the line can contain single CR ("\r") or single LF ("\n") as business-valued data.
Need to have ability to read stream line-by-line, but delimited by certain character sequence.
Here is a method that reads line from stream and returns it as a string:
public static string ReadLineWithFixedNewlineDelimeter(StreamReader reader, string delim)
{
if (reader.EndOfStream)
return null;
if (string.IsNullOrEmpty(delim))
{
return reader.ReadToEnd();
}
var sb = new StringBuilder();
var delimCandidatePosition = 0;
while (!reader.EndOfStream && delimCandidatePosition < delim.Length)
{
var c = (char)reader.Read();
if (c == delim[delimCandidatePosition])
{
delimCandidatePosition ++;
}
else
{
delimCandidatePosition = 0;
}
sb.Append(c);
}
return sb.ToString(0, sb.Length - (delimCandidatePosition == delim.Length ? delim.Length : 0));
}

Stream read line

I have a stream reader line by line (sr.ReadLine()). My code counts the line-end with both line endings \r\n and/or \n.
StreamReader sr = new System.IO.StreamReader(sPath, enc);
while (!sr.EndOfStream)
{
// reading 1 line of datafile
string sLine = sr.ReadLine();
...
How to tell to code (instead of universal sr.ReadLine()) that I want to count new line only a full \r\n and not the \n?
It is not possible to do this using StreamReader.ReadLine.
As per msdn:
A line is defined as a sequence of characters followed by a line feed
("\n"), a carriage return ("\r"), or a carriage return immediately
followed by a line feed ("\r\n"). The string that is returned does not
contain the terminating carriage return or line feed. The returned
value is null if the end of the input stream is reached.
So yoг have to read this stream byte-by-byte and return line only if you've captured \r\n
EDIT
Here is some code sample
private static IEnumerable<string> ReadLines(StreamReader stream)
{
StringBuilder sb = new StringBuilder();
int symbol = stream.Peek();
while (symbol != -1)
{
symbol = stream.Read();
if (symbol == 13 && stream.Peek() == 10)
{
stream.Read();
string line = sb.ToString();
sb.Clear();
yield return line;
}
else
sb.Append((char)symbol);
}
yield return sb.ToString();
}
You can use it like
foreach (string line in ReadLines(stream))
{
//do something
}
you cannot do it with ReadLine, but you can do instead:
stream.ReadToEnd().Split(new[] {"\r\n"}, StringSplitOptions.None)
For simplification, let's work over a byte array:
static int NumberOfNewLines(byte[] data)
{
int count = 0;
for (int i = 0; i < data.Length - 1; i++)
{
if (data[i] == '\r' && data[i + 1] == '\n')
count++;
}
return count;
}
If you care about efficiency, optimize away, but this should work.
You can get the bytes of a file by using System.IO.File.ReadBytes(string filename).

String array throws OutOfMemoryException for large multi-line entries

In a Windows Forms C# app, I have a textbox where users paste log data, and it sorts it. I need to check each line individualy so I split the input by the new line, but if there are a lot of lines, greater than 100,000 or so, it throws a OutOfMemoryException.
My code looks like this:
StringSplitOptions splitOptions = new StringSplitOptions();
if(removeEmptyLines_CB.Checked)
splitOptions = StringSplitOptions.RemoveEmptyEntries;
else
splitOptions = StringSplitOptions.None;
List<string> outputLines = new List<string>();
foreach(string line in input_TB.Text.Split(new string[] { "\r\n", "\n" }, splitOptions))
{
if(line.Contains(inputCompare_TB.Text))
outputLines.Add(line);
}
output_TB.Text = string.Join(Environment.NewLine, outputLines);
The problem comes from when I split the textbox text by line, here input_TB.Text.Split(new string[] { "\r\n", "\n" }
Is there a better way to do this? I've thought about taking the first X amount of text, truncating at a new line and repeat until everything has been read, but this seems tedious. Or is there a way to allocate more memory for it?
Thanks,
Garrett
Update
Thanks to Attila, I came up with this and it seems to work. Thanks
StringReader reader = new StringReader(input_TB.Text);
string line;
while((line = reader.ReadLine()) != null)
{
if(line.Contains(inputCompare_TB.Text))
outputLines.Add(line);
}
output_TB.Text = string.Join(Environment.NewLine, outputLines);
The better way to do this would be to extract and process one line at a time, and use a StringBuilder to create the result:
StringBuilder outputTxt = new StringBuilder();
string txt = input_TB.Text;
int txtIndex = 0;
while (txtIndex < txt.Length) {
int startLineIndex = txtIndex;
GetMore:
while (txtIndex < txt.Length && txt[txtIndex] != '\r' && txt[txtIndex] != '\n')) {
txtIndex++;
}
if (txtIndex < txt.Length && txt[txtIndex] == '\r' && (txtIndex == txt.Length-1 || txt[txtIndex+1] != '\n') {
txtIndex++;
goto GetMore;
}
string line = txt.Substring(startLineIndex, txtIndex-startLineIndex);
if (line.Contains(inputCompare_TB.Text)) {
if (outputTxt.Length > 0)
outputTxt.Append(Environment.NewLine);
outputTxt.Append(line);
}
txtIndex++;
}
output_TB.Text = outputTxt.ToString();
Pre-emptive comment: someone will object to the goto - but it is what's needed here, the alternatives are much more complex (reg exp for example), or fake the goto with another loop and continue or break
Using a StringReader to split the lines is a much cleaner solution, but it does not handle both \r\n and \n as a new line:
StringReader reader = new StringReader(input_TB.Text);
StringBuilder outputTxt = new StringBuilder();
string compareTxt = inputCompare_TB.Text;
string line;
while((line = reader.ReadLine()) != null) {
if (line.Contains(compareTxt)) {
if (outputTxt.Length > 0)
outputTxt.Append(Environment.NewLine);
outputTxt.Append(line);
}
}
output_TB.Text = outputTxt.ToString();
Split will have to duplicate the memory need of the original text, plus overhead of string objects for each line. If this causes memory issues, a reliable way of processing the input is to parse one line at a time.
I guess the only way to do this on large text files is to open the file manually and use a StreamReader. Here is an example how to do this.
You can avoid creating strings for all lines and the array by creating the string for each line one at a time:
var eol = new[] { '\r', '\n' };
var pos = 0;
while (pos < input.Length)
{
var i = input.IndexOfAny(eol, pos);
if (i < 0)
{
i = input.Length;
}
if (i != pos)
{
var line = input.Substring(pos, i - pos);
// process line
}
pos = i + 1;
}
On other hand, In this article say that the point is that "split" method is implemented poorly. Read it, and make your conclusions.
Like Attila said, you have to parse line by line.

C# - Read External CSV File Character by Character

What is the easiest way to read a file character by character in C#?
Currently, I am reading line by line by calling System.io.file.ReadLine(). I see that there is a Read() function but it doesn;t return a character...
I would also like to know how to detect the end of a line using such an approach...The input file in question is a CSV file....
Open a TextReader (e.g. by File.OpenText - note that File is a static class, so you can't create an instance of it) and repeatedly call Read. That returns int rather than char so it can also indicate end of file:
int readResult = reader.Read();
if (readResult != -1)
{
char nextChar = (char) readResult;
// ...
}
Or to loop:
int readResult;
while ((readResult = reader.Read()) != -1)
{
char nextChar = (char) readResult;
// ...
}
Or for more funky goodness:
public static IEnumerable<char> ReadCharacters(string filename)
{
using (var reader = File.OpenText(filename))
{
int readResult;
while ((readResult = reader.Read()) != -1)
{
yield return (char) readResult;
}
}
}
...
foreach (char c in ReadCharacters("foo.txt"))
{
...
}
Note that all by default, File.OpenText will use an encoding of UTF-8. Specify an encoding explicitly if that isn't what you want.
EDIT: To find the end of a line, you'd check whether the character is \n... you'd potentially want to handle \r specially too, if this is a Windows text file.
But if you want each line, why not just call ReadLine? You can always iterate over the characters in the line afterwards...
Here is a snippet from msdn
using (StreamReader sr = new StreamReader(path))
{
char[] c = null;
while (sr.Peek() >= 0)
{
c = new char[1];
sr.Read(c, 0, c.Length);
// do something with c[0]
}
}

Categories