At the moment, I am using C#'s inbuilt StreamReader to read lines from a file.
As is well known, if the last line is blank, the stream reader does not acknowledge this as a separate line. That is, a line must contain text, and the newline character at the end is optional for the last line.
This is having the effect on some of my files that I am losing (important, for reasons I don't want to get into) whitespace at the end of the file each time my program consumes and re-writes specific files.
Is there an implementation of TextReader available either as a part of the language or as a NuGet package which provides the ReadLine functionality, but retains the new line characters (whichever they may be) as a part of the line so that I can exactly reproduce the output? I would prefer not to have to roll my own method to consume line-based input.
Edit: it should be noted that I cannot read the whole file into memory.
You can combine ReadToEnd() with Split to get in an array the content of your file, including the empty lines.
I don't recommend you to use ReadToEnd() if your file is big.
In example :
string[] lines;
using (StreamReader sr = new StreamReader(path))
{
var WholeFile = sr.ReadToEnd();
lines = WholeFile.Split('\n');
}
private readonly char newLineMarker = Environment.NewLine.Last();
private readonly char[] newLine = Environment.NewLine.ToCharArray();
private readonly char eof = '\uffff';
private IEnumerable<string> EnumerateLines(string path)
{
using (var sr = new StreamReader(path))
{
char c;
string line;
var sb = new StringBuilder();
while ((c = (char)sr.Read()) != eof)
{
sb.Append(c);
if (c == newLineMarker &&
(line = sb.ToString()).EndsWith(Environment.NewLine))
{
yield return line.Trim(newLine);
sb.Clear();
sb.Append(Environment.NewLine);
}
}
if (sb.Length > 0)
yield return sb.ToString().Trim(newLine);
}
}
Related
I've dealing with a large text file (6.5GB) and I'm reading it using a StreamReader.
The file has portions that are separated using CRLF, however it sometimes uses just a LF.
Is there an easy way to determine if the line read in was just a newline '\n', or a carriage return with line feed '\r\n'. The StreamReader.ReadLine() seems to treat them both as null or empty.
you can just replace crlf and lf to cr.
using (var reader = new StreamReader(#"yourfile"))
using (var writer = new StreamWriter(#"outfile")) // or any other TextWriter
{
while (!reader.EndOfStream) {
string currentLine = reader.ReadLine();
string newLine = currentLine.Replace('\r\n','\n');
newLine = newLine.Replace('\r','\n');
newLine = newLine.Replace('\n','\r\n');
writer.Write(newLine);
}
}
If I asked the question "how to read a file into a string" the answer would be obvious. However -- here is the catch with CR/LF preserved.
The problem is, File.ReadAllText strips those characters. StreamReader.ReadToEnd just converted LF into CR for me which led to long investigation where I have bug in pretty obvious code ;-)
So, in short, if I have file containing foo\n\r\nbar I would like to get foo\n\r\nbar (i.e. exactly the same content), not foo bar, foobar, or foo\n\n\nbar. Is there some ready to use way in .Net space?
The outcome should be always single string, containing entire file.
Are you sure that those methods are the culprits that are stripping out your characters?
I tried to write up a quick test; StreamReader.ReadToEnd preserves all newline characters.
string str = "foo\n\r\nbar";
using (Stream ms = new MemoryStream(Encoding.ASCII.GetBytes(str)))
using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
{
string str2 = sr.ReadToEnd();
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
}
// Output: 102,111,111,10,13,10,98,97,114
// f o o \n \r \n b a r
An identical result is achieved when writing to and reading from a temporary file:
string str = "foo\n\r\nbar";
string temp = Path.GetTempFileName();
File.WriteAllText(temp, str);
string str2 = File.ReadAllText(temp);
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
It appears that your newlines are getting lost elsewhere.
This piece of code will preserve LR and CR
string r = File.ReadAllText(#".\TestData\TR120119.TRX", Encoding.ASCII);
The outcome should be always single string, containing entire file.
It takes two hops. First one is File.ReadAllBytes() to get all the bytes in the file. Which doesn't try to translate anything, you get the raw data in the file so the weirdo line-endings are preserved as-is.
But that's bytes, you asked for a string. So second hop is to apply Encoding.GetString() to convert the bytes to a string. The one thing you have to do is pick the right Encoding class, the one that matches the encoding used by the program that wrote the file. Given that the file is pretty messed up if it contains \n\r\n sequences, and you didn't document anything else about the file, your best bet is to use Encoding.Default. Tweak as necessary.
You can read the contents of a file using File.ReadAllLines, which will return an array of the lines. Then use String.Join to merge the lines together using a separator.
string[] lines = File.ReadAllLines(#"C:\Users\User\file.txt");
string allLines = String.Join("\r\n", lines);
Note that this will lose the precision of the actual line terminator characters. For example, if the lines end in only \n or \r, the resulting string allLines will have replaced them with \r\n line terminators.
There are of course other ways of acheiving this without losing the true EOL terminator, however ReadAllLines is handy in that it can detect many types of text encoding by itself, and it also takes up very few lines of code.
ReadAllText doesn't return carriage returns.
This method opens a file, reads each line of the file, and then adds each line as an element of a string. It then closes the file. A line is defined as a sequence of characters followed by a carriage return ('\r'), a line feed ('\n'), or a carriage return immediately followed by a line feed. The resulting string does not contain the terminating carriage return and/or line feed.
From MSDN - https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
This is similar to the accepted answer, but wanted to be more to the point. sr.ReadToEnd() will read the bytes like is desired:
string myFilePath = #"C:\temp\somefile.txt";
string myEvents = String.Empty;
FileStream fs = new FileStream(myFilePath, FileMode.Open);
StreamReader sr = new StreamReader(fs);
myEvents = sr.ReadToEnd();
sr.Close();
fs.Close();
You could even also do those in cascaded using statements. But I wanted to describe how the way you write to that file in the first place will determine how to read the content from the myEvents string, and might really be where the problem lies. I wrote to my file like this:
using System.Reflection;
using System.IO;
private static void RecordEvents(string someEvent)
{
string folderLoc = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
if (!folderLoc.EndsWith(#"\")) folderLoc += #"\";
folderLoc = folderLoc.Replace(#"\\", #"\"); // replace double-slashes with single slashes
string myFilePath = folderLoc + "myEventFile.txt";
if (!File.Exists(myFilePath))
File.Create(myFilePath).Close(); // must .Close() since will conflict with opening FileStream, below
FileStream fs = new FileStream(myFilePath, FileMode.Append);
StreamWriter sr = new StreamWriter(fs);
sr.Write(someEvent + Environment.NewLine);
sr.Close();
fs.Close();
}
Then I could use the code farther above to get the string of the contents. Because I was going further and looking for the individual strings, I put this code after THAT code, up there:
if (myEvents != String.Empty) // we have something
{
// (char)2660 is ♠ -- I could have chosen any delimiter I did not
// expect to find in my text
myEvents = myEvents.Replace(Environment.NewLine, ((char)2660).ToString());
string[] eventArray = myEvents.Split((char)2660);
foreach (string s in eventArray)
{
if (!String.IsNullOrEmpty(s))
// do whatever with the individual strings from your file
}
}
And this worked fine. So I know that myEvents had to have the Environment.NewLine characters preserved because I was able to replace it with (char)2660 and do a .Split() on that string using that character to divide it into the individual segments.
I'm working in C# and i got a large text file (75MB)
I want to save lines that match a regular expression
I tried reading the file with a streamreader and ReadToEnd, but it takes 400MB of ram
and when used again creates an out of memory exception.
I then tried using File.ReadAllLines():
string[] lines = File.ReadAllLines("file");
StringBuilder specialLines = new StringBuilder();
foreach (string line in lines)
if (match reg exp)
specialLines.append(line);
this is all great but when my function ends the memory taken doesnt clear and I'm
left with 300MB of used memory, only when recalling the function and executing the line:
string[] lines = File.ReadAllLines("file");
I see the memory clearing down to 50MB give or take and then reallocating back to 200MB
How can I clear this memory or get the lines I need in a different way ?
var file = File.OpenRead("myfile.txt");
var reader = new StreamReader(file);
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
//evaluate the line here.
}
reader.Dispose();
file.Dispose();
You need to stream the text instead of loading the whole file in memory. Here's a way to do it, using an extension method and Linq:
static class ExtensionMethods
{
public static IEnumerable<string> EnumerateLines(this TextReader reader)
{
string line;
while((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
...
var regex = new Regex(..., RegexOptions.Compiled);
using (var reader = new StreamReader(fileName))
{
var specialLines =
reader.EnumerateLines()
.Where(line => regex.IsMatch(line))
.Aggregate(new StringBuilder(),
(sb, line) => sb.AppendLine(line));
}
You can use StreamReader#ReadLine to read file line-by-line and to save those lines that you need.
You should use the Enumerator pattern to keep your memory footprint low in case your file can be huge.
I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.
I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).
You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.
First example:
// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
// Read file
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Ignore empty lines
if (line.Length > 0)
{
// Create addon
T addon = new T();
addon.Load(line, _BaseDir);
// Add to collection
collection.Add(addon);
}
}
}
Second example:
// Open file
using (var file = System.IO.File.OpenText(datFile))
{
// Compile regexs
Regex nameRegex = new Regex("IDENTIFY (.*)");
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Check name
Match m = nameRegex.Match(line);
if (m.Success)
{
_Name = m.Groups[1].Value;
// Remove me when other values are read
break;
}
}
}
You can write a LINQ-based line reader pretty easily using an iterator block:
static IEnumerable<SomeType> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
SomeType newRecord = /* parse line */
yield return newRecord;
}
}
}
or to make Jon happy:
static IEnumerable<string> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
...
var typedSequence = from line in ReadFrom(path)
let record = ParseLine(line)
where record.Active // for example
select record.Key;
then you have ReadFrom(...) as a lazily evaluated sequence without buffering, perfect for Where etc.
Note that if you use OrderBy or the standard GroupBy, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.
It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.
However, I also have a LineReader class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or a Func<TextReader> as an IEnumerable<string> which lets you do LINQ stuff over it. So you can do things like:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
where line.Length > 0
select new AddOn(line); // or whatever
The heart of LineReader is this implementation of IEnumerable<string>.GetEnumerator:
public IEnumerator<string> GetEnumerator()
{
using (TextReader reader = dataSource())
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Almost all the rest of the source is just giving flexible ways of setting up dataSource (which is a Func<TextReader>).
Since .NET 4.0, the File.ReadLines() method is available.
int count = File.ReadLines(filepath).Count(line => line.StartsWith(">"));
NOTE: You need to watch out for the IEnumerable<T> solution, as it will result in the file being open for the duration of processing.
For example, with Marc Gravell's response:
foreach(var record in ReadFrom("myfile.csv")) {
DoLongProcessOn(record);
}
the file will remain open for the whole of the processing.
Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!
Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.
Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?
public void Load(AddonCollection<T> collection)
{
// read from file
var query =
from line in LineReader(_LstFilename)
where line.Length > 0
select CreateAddon(line);
// add results to collection
collection.AddRange(query);
}
protected T CreateAddon(String line)
{
// create addon
T addon = new T();
addon.Load(line, _BaseDir);
return addon;
}
protected static IEnumerable<String> LineReader(String fileName)
{
String line;
using (var file = System.IO.File.OpenText(fileName))
{
// read each line, ensuring not null (EOF)
while ((line = file.ReadLine()) != null)
{
// return trimmed line
yield return line.Trim();
}
}
}
I'd hate to reinvent something that was already written, so I'm wondering if there is a ReadWord() function somewhere in the .NET Framework that extracts words based some text delimited by white space and line breaks.
If not, do you have a implementation that you'd like to share?
string data = "Four score and seven years ago";
List<string> words = new List<string>();
WordReader reader = new WordReader(data);
while (true)
{
string word =reader.ReadWord();
if (string.IsNullOrEmpty(word)) return;
//additional parsing logic goes here
words.Add(word);
}
Not that I'm aware of directly. If you don't mind getting them all in one go, you could use a regular expression:
Regex wordSplitter = new Regex(#"\W+");
string[] words = wordSplitter.Split(data);
If you have leading/trailing whitespace you'll get an empty string at the beginning or end, but you could always call Trim first.
A different option is to write a method which reads a word based on a TextReader. It could even be an extension method if you're using .NET 3.5. Sample implementation:
using System;
using System.IO;
using System.Text;
public static class Extensions
{
public static string ReadWord(this TextReader reader)
{
StringBuilder builder = new StringBuilder();
int c;
// Ignore any trailing whitespace from previous reads
while ((c = reader.Read()) != -1)
{
if (!char.IsWhiteSpace((char) c))
{
break;
}
}
// Finished?
if (c == -1)
{
return null;
}
builder.Append((char) c);
while ((c = reader.Read()) != -1)
{
if (char.IsWhiteSpace((char) c))
{
break;
}
builder.Append((char) c);
}
return builder.ToString();
}
}
public class Test
{
static void Main()
{
// Give it a few challenges :)
string data = #"Four score and
seven years ago ";
using (TextReader reader = new StringReader(data))
{
string word;
while ((word = reader.ReadWord()) != null)
{
Console.WriteLine("'{0}'", word);
}
}
}
}
Output:
'Four'
'score'
'and'
'seven'
'years'
'ago'
Not as such, however you could use String.Split to split the string into an array of string based on a delimiting character or string. You can also specify multiple strings / characters for the split.
If you'd prefer to do it without loading everything into memory then you could write your own stream class that does it as it reads from a stream but the above is a quick fix for small amounts of data word splitting.