How to obtain certain lines from a text file in c#? - c#

I'm working in C# and i got a large text file (75MB)
I want to save lines that match a regular expression
I tried reading the file with a streamreader and ReadToEnd, but it takes 400MB of ram
and when used again creates an out of memory exception.
I then tried using File.ReadAllLines():
string[] lines = File.ReadAllLines("file");
StringBuilder specialLines = new StringBuilder();
foreach (string line in lines)
if (match reg exp)
specialLines.append(line);
this is all great but when my function ends the memory taken doesnt clear and I'm
left with 300MB of used memory, only when recalling the function and executing the line:
string[] lines = File.ReadAllLines("file");
I see the memory clearing down to 50MB give or take and then reallocating back to 200MB
How can I clear this memory or get the lines I need in a different way ?

var file = File.OpenRead("myfile.txt");
var reader = new StreamReader(file);
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
//evaluate the line here.
}
reader.Dispose();
file.Dispose();

You need to stream the text instead of loading the whole file in memory. Here's a way to do it, using an extension method and Linq:
static class ExtensionMethods
{
public static IEnumerable<string> EnumerateLines(this TextReader reader)
{
string line;
while((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
...
var regex = new Regex(..., RegexOptions.Compiled);
using (var reader = new StreamReader(fileName))
{
var specialLines =
reader.EnumerateLines()
.Where(line => regex.IsMatch(line))
.Aggregate(new StringBuilder(),
(sb, line) => sb.AppendLine(line));
}

You can use StreamReader#ReadLine to read file line-by-line and to save those lines that you need.

You should use the Enumerator pattern to keep your memory footprint low in case your file can be huge.

Related

C#: How to make Stream Reader to Int

I'm reading from a file with numbers and then when I try to convert it to an Int I get this error, System.FormatException: 'Input string was not in a correct format.' Reading the file works and I've tested all of that, it just seems to get stuck on this no matter what I try. This is what I've done so far:
StreamReader share_1 = new StreamReader("Share_1_256.txt");
string data_1 = share_1.ReadToEnd();
int intData1 = Int16.Parse(data_1);
And then if parse is in it doesn't print anything.
As we can see in your post, your input file contains not one number but several. So what you will need is to iterate through all lines of your file, then try the parsing for each lines of your string.
EDIT: The old code was using a external library. For raw C#, try:
using (StringReader reader = new StringReader(input))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Do something with the line
}
}
In addition, I encourage you to always parse string to number using the TryParse method, not the Parse one.
You can find some details and different implementations for that common problem in C#: C#: Looping through lines of multiline string
parser every single line
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
int intData1 = Int16.Parse(line);
}
You can simplify the code and get rid of StreamReader with a help of File class and Linq:
// Turn text file into IEnumerable<int>:
var data = File
.ReadLines("Share_1_256.txt")
.Select(line => int.Parse(line));
//TODO: add .OrderBy(item => item); if you want to sort items
// Loop over all numbers within file: 15, 1, 48, ..., 32
foreach (int item in data) {
//TODO: Put relevant code here, e.g. Console.WriteLine(item);
}

C# Parsing a text file misses the first line of every text file

I am making a number to music note value converter from text files and I am having trouble parsing text files in my C# code. The stream reader or parser seems to ignore every first line of text from different text files, I am learning from parsing CSV and assumed that I can parse text files the same way that I can parse CSV files. Here is my code:
static List<NoteColumns> ReadNoteValues(string fileName)
{
var allNotes = new List<NoteColumns>();
using (var reader = new StreamReader(fileName))
{
string line = "";
reader.ReadLine();
while ((line = reader.ReadLine()) != null)
{
string[] cols = line.Split('|');
var noteColumn = new NoteColumns();
if (line.StartsWith("-"))
{
continue;
}
double col;
if (double.TryParse(cols[0], out col))
{
noteColumn.QCol = col;
}
if (double.TryParse(cols[1], out col))
{
noteColumn.WCol = col;
}
}
}
return allNotes;
}
Here are the first 4 lines of one of my text file:
0.1|0.0|
0.0|0.1|
0.3|0.0|
0.1|0.0|
So every time that I have an output it will always skip the first line and move onto the next line. After it misses the first line it converts all the other values perfectly.
I have tried using a foreach loop but I end up getting an 'Index Out Of Range Exception'. Am I missing something/being stupid? Thanks for your time
You have a call to reader.ReadLine(); before your loop, which you don't assign to anything
It's because you have already added ReadLine() before the while loop which skips the first line
In order to avoid such errors (erroneous reader.ReadLine() in the beginning) get rid of Readers at all and query file with Linq (let .Net open and close the file, read it line after line and do all low level work for you):
using System.IO;
using System.Linq;
...
static List<NoteColumns> ReadNoteValues(string fileName) {
return File
.ReadLines(fileName)
.Where(line => !line.StartsWith('-'))
.Select(line => line.Split('|'))
.Select(cols => {
var noteColumn = new NoteColumns();
if (double.TryParse(cols[0], out var col))
noteColumn.QCol = col;
if (double.TryParse(cols[1], out var col))
noteColumn.WCol = col;
return noteColumn;
})
.ToList();
}

C# Insert text from one txt file to another txt file

I would like to insert text from one text file to another.
So for example I have a text file at C:\Users\Public\Test1.txt
first
second
third
forth
And i have a second text file at C:\Users\Public\Test2.txt
1
2
3
4
I want to insert Test2.txt into Test1.txt
The end result should be:
first
second
1
2
3
4
third
forth
It should be inserted at the third line.
So far I have this:
string strTextFileName = #"C:\Users\Public\test1.txt";
int iInsertAtLineNumber = 2;
string strTextToInsert = #"C:\Users\Public\test2.txt";
ArrayList lines = new ArrayList();
StreamReader rdr = new StreamReader(
strTextFileName);
string line;
while ((line = rdr.ReadLine()) != null)
lines.Add(line);
rdr.Close();
if (lines.Count > iInsertAtLineNumber)
lines.Insert(iInsertAtLineNumber,
strTextToInsert);
else
lines.Add(strTextToInsert);
StreamWriter wrtr = new StreamWriter(
strTextFileName);
foreach (string strNewLine in lines)
wrtr.WriteLine(strNewLine);
wrtr.Close();
However I get this when i run it:
first
second
C:\Users\Public\test2.txt
third
forth
Thanks in advance!
Instead of using StreamReaders/Writers, you can use methods from the File helper class.
const string textFileName = #"C:\Users\Public\test1.txt";
const string textToInsertFileName = #"C:\Users\Public\test2.txt";
const int insertAtLineNumber = 2;
List<string> fileContent = File.ReadAllLines(textFileName).ToList();
fileContent.InsertRange(insertAtLineNumber , File.ReadAllLines(textToInsertFileName));
File.WriteAllLines(textFileName, fileContent);
A List<string> is way more convenient than an ArrayList. I also renamed a couple of your variables (most notably textToInsertFileName, and removed the prefix cluttering your declarations, any modern IDE will tell you the datatype if you hover for half a second) and declared your constants with const.
Your original problem was related to the fact that you're never reading from strTextToInsert, looks like you thought it was already the text to insert where it's actually the filename.
Without changing your structure or types around too much you could create a method to read the lines
public ArrayList GetFileLines(string fileName)
{
var lines = new ArrayList();
using (var rdr = new StreamReader(fileName))
{
string line;
while ((line = rdr.ReadLine()) != null)
lines.Add(line);
}
return lines;
}
In the intiial question you were not reading the second file, in the following example it is a little easier to determine when you are reading the files and that each one is read:
string strTextFileName = #"C:\Users\Public\test1.txt";
int iInsertAtLineNumber = 2;
string strTextToInsert = #"C:\Users\Public\test2.txt";
ArrayList lines = new ArrayList();
lines.AddRange(GetFileLines(strTextFileName));
lines.InsertRange(iInsertAtLineNumber, GetFileLines(strTextToInsert));
using (var wrtr = new StreamWriter(strTextFileName))
{
foreach (string strNewLine in lines)
wrtr.WriteLine(strNewLine);
}
NOTE: if you wrap a reader or write in a using statement it will automatically close
I haven't tested this and it could be done better, but hopefully this gets you pointed in the right direction. This solution would completely rewrite the first file

What's the best way to get all the content in between two tagged lines of a file so that you can deserialize it?

I've been noticing that the following segment of code does not scale well for large files (I think that appending to the paneContent string is slow):
string paneContent = String.Empty;
bool lineFound = false;
foreach (string line in File.ReadAllLines(path))
{
if (line.Contains(tag))
{
lineFound = !lineFound;
}
else
{
if (lineFound)
{
paneContent += line;
}
}
}
using (TextReader reader = new StringReader(paneContent))
{
data = (PaneData)(serializer.Deserialize(reader));
}
What's the best way to speed this all up? I have a file that looks like this (so I wanna get all the content in between the two different tags and then deserialize all that content):
A line with some tag
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with some tag
Note: These tags are not XML tags.
You could use a StringBuilder as opposed to a string, that is what the StringBuilder is for. Some example code is below:
var paneContent = new StringBuilder();
bool lineFound = false;
foreach (string line in File.ReadLines(path))
{
if (line.Contains(tag))
{
lineFound = !lineFound;
}
else
{
if (lineFound)
{
paneContent.Append(line);
}
}
}
using (TextReader reader = new StringReader(paneContent.ToString()))
{
data = (PaneData)(serializer.Deserialize(reader));
}
As mentioned in this answer, a StringBuilder is preferred to a string when you are concatenating in a loop, which is the case here.
Here is an example of how to use groups with regexes and retrieve their contents afterwards.
What you want is a regex that will match your tags, label this as a group then retrieve the data of the group as in the example
Use a StringBuilder to build your data string (paneContent). It's much faster because concatenating strings results in new memory allocations. StringBuilder pre-allocates memory (if you expect large data strings, you can customize the initial allocation).
It's a good idea to read your input file line-by-line so you can avoid loading the whole file into memory if you expect files with many lines of text.

Reading a file line by line in C#

I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.
I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).
You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.
First example:
// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
// Read file
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Ignore empty lines
if (line.Length > 0)
{
// Create addon
T addon = new T();
addon.Load(line, _BaseDir);
// Add to collection
collection.Add(addon);
}
}
}
Second example:
// Open file
using (var file = System.IO.File.OpenText(datFile))
{
// Compile regexs
Regex nameRegex = new Regex("IDENTIFY (.*)");
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Check name
Match m = nameRegex.Match(line);
if (m.Success)
{
_Name = m.Groups[1].Value;
// Remove me when other values are read
break;
}
}
}
You can write a LINQ-based line reader pretty easily using an iterator block:
static IEnumerable<SomeType> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
SomeType newRecord = /* parse line */
yield return newRecord;
}
}
}
or to make Jon happy:
static IEnumerable<string> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
...
var typedSequence = from line in ReadFrom(path)
let record = ParseLine(line)
where record.Active // for example
select record.Key;
then you have ReadFrom(...) as a lazily evaluated sequence without buffering, perfect for Where etc.
Note that if you use OrderBy or the standard GroupBy, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.
It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.
However, I also have a LineReader class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or a Func<TextReader> as an IEnumerable<string> which lets you do LINQ stuff over it. So you can do things like:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
where line.Length > 0
select new AddOn(line); // or whatever
The heart of LineReader is this implementation of IEnumerable<string>.GetEnumerator:
public IEnumerator<string> GetEnumerator()
{
using (TextReader reader = dataSource())
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Almost all the rest of the source is just giving flexible ways of setting up dataSource (which is a Func<TextReader>).
Since .NET 4.0, the File.ReadLines() method is available.
int count = File.ReadLines(filepath).Count(line => line.StartsWith(">"));
NOTE: You need to watch out for the IEnumerable<T> solution, as it will result in the file being open for the duration of processing.
For example, with Marc Gravell's response:
foreach(var record in ReadFrom("myfile.csv")) {
DoLongProcessOn(record);
}
the file will remain open for the whole of the processing.
Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!
Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.
Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?
public void Load(AddonCollection<T> collection)
{
// read from file
var query =
from line in LineReader(_LstFilename)
where line.Length > 0
select CreateAddon(line);
// add results to collection
collection.AddRange(query);
}
protected T CreateAddon(String line)
{
// create addon
T addon = new T();
addon.Load(line, _BaseDir);
return addon;
}
protected static IEnumerable<String> LineReader(String fileName)
{
String line;
using (var file = System.IO.File.OpenText(fileName))
{
// read each line, ensuring not null (EOF)
while ((line = file.ReadLine()) != null)
{
// return trimmed line
yield return line.Trim();
}
}
}

Categories