I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.
I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).
You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.
First example:
// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
// Read file
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Ignore empty lines
if (line.Length > 0)
{
// Create addon
T addon = new T();
addon.Load(line, _BaseDir);
// Add to collection
collection.Add(addon);
}
}
}
Second example:
// Open file
using (var file = System.IO.File.OpenText(datFile))
{
// Compile regexs
Regex nameRegex = new Regex("IDENTIFY (.*)");
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Check name
Match m = nameRegex.Match(line);
if (m.Success)
{
_Name = m.Groups[1].Value;
// Remove me when other values are read
break;
}
}
}
You can write a LINQ-based line reader pretty easily using an iterator block:
static IEnumerable<SomeType> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
SomeType newRecord = /* parse line */
yield return newRecord;
}
}
}
or to make Jon happy:
static IEnumerable<string> ReadFrom(string file) {
string line;
using(var reader = File.OpenText(file)) {
while((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
...
var typedSequence = from line in ReadFrom(path)
let record = ParseLine(line)
where record.Active // for example
select record.Key;
then you have ReadFrom(...) as a lazily evaluated sequence without buffering, perfect for Where etc.
Note that if you use OrderBy or the standard GroupBy, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.
It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.
However, I also have a LineReader class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or a Func<TextReader> as an IEnumerable<string> which lets you do LINQ stuff over it. So you can do things like:
var query = from file in Directory.GetFiles("*.log")
from line in new LineReader(file)
where line.Length > 0
select new AddOn(line); // or whatever
The heart of LineReader is this implementation of IEnumerable<string>.GetEnumerator:
public IEnumerator<string> GetEnumerator()
{
using (TextReader reader = dataSource())
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Almost all the rest of the source is just giving flexible ways of setting up dataSource (which is a Func<TextReader>).
Since .NET 4.0, the File.ReadLines() method is available.
int count = File.ReadLines(filepath).Count(line => line.StartsWith(">"));
NOTE: You need to watch out for the IEnumerable<T> solution, as it will result in the file being open for the duration of processing.
For example, with Marc Gravell's response:
foreach(var record in ReadFrom("myfile.csv")) {
DoLongProcessOn(record);
}
the file will remain open for the whole of the processing.
Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!
Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.
Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?
public void Load(AddonCollection<T> collection)
{
// read from file
var query =
from line in LineReader(_LstFilename)
where line.Length > 0
select CreateAddon(line);
// add results to collection
collection.AddRange(query);
}
protected T CreateAddon(String line)
{
// create addon
T addon = new T();
addon.Load(line, _BaseDir);
return addon;
}
protected static IEnumerable<String> LineReader(String fileName)
{
String line;
using (var file = System.IO.File.OpenText(fileName))
{
// read each line, ensuring not null (EOF)
while ((line = file.ReadLine()) != null)
{
// return trimmed line
yield return line.Trim();
}
}
}
Related
I have the following program:
Database (if you can call it that on text files)
When writing to a text file, I need to increase the record id by one
How can I not understand / find with the help of which method it is possible to implement a loop in which I will increase the id, can anyone tell me?
I have a method by which I can format a text file from WPF text boxes:
using (StreamReader sr = new StreamReader("D:\\123.txt", true))
{
while (sr.ReadLine() != null)
{
id++;
using (StreamWriter txt = new StreamWriter("D:\\123.txt", true))
{
txt.WriteLine(string.Format("{0} {1} {2} {3} {4} {5} {6}\n", id, TBName, TBLName, TBMName, TBInfo, TBMat, TBFiz));
}
MessageBox.Show("Данные успешно сохранены!");
}
}
How can the id increase by 1 with each new entry in the text file?
The output of information to the datagrid was as follows:
private void Work()
{
try
{
List<Student> list = new List<Student>();
using (StreamReader sr = new StreamReader(fileName, true))
{
string line;
while ((line = sr.ReadLine()) != null)
{
var parsed = line.Split(' ');
list.Add(new Student
(
Convert.ToInt32(parsed[0]),
parsed[1],
parsed[2],
parsed[3],
Convert.ToInt32(parsed[4]),
Convert.ToInt32(parsed[5]),
Convert.ToInt32(parsed[6])
));
}
}
DGridStudents.ItemsSource = list;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
The code shown has some problems. First you cannot write in a file that you have opened for reading using the StreamReader/StreamWriter classes. But even if you can, look closely at how it works. First you open the file, then you start a loop reading a line, then writing a new line in the same file, then reading the next line (and that next line could be the same one you have just written).
In the better outcome your file will fill your disk.
To increment the value used as id in the last line you could approach with this
// First read the whole file and get the last line from it
int id = 0;
string lastLine = File.ReadLines("D:\\123.txt").LastOrDefault();
if(!string.IsNullOrEmpty(line))
{
// Now split and convert the value of the first splitted part
var parts = line.Split();
if(parts.Length > 0)
{
Int32.TryParse(parts[0], out id);
}
}
// You can now increment and write the new line
id++
using (StreamWriter txt = new StreamWriter("D:\\123.txt", true))
{
txt.WriteLine($"{id} {TBName} {TBLName} {TBMName} {TBInfo} {TBMat} {TBFiz}");
}
This approach will force you to read the whole file to find the last line. However you could add a second file (and index file) to your txt with the same name but with the idx extension. This file will contain only the last number written
int id = 0;
string firstLine = File.ReadLines("D:\\123.idx").FirstOrDefault();
if(!string.IsNullOrEmpty(line))
{
// Now split and convert the value of the first splitted part
var parts = line.Split();
if(parts.Length > 0)
{
Int32.TryParse(parts[0], out id);
}
}
id++
using (StreamWriter txt = new StreamWriter("D:\\123.txt", true))
{
txt.WriteLine($"{id} {TBName} {TBLName} {TBMName} {TBInfo} {TBMat} {TBFiz}");
}
File.WriteAllText("D:\\123.idx", id.ToString());
This second approach is probably better if the txt file is big because it doesn't require to read the whole txt file but there are more points of possible failure. You have two files to handle and this double the chances of IO errors and of course we are not even considering the multiuser scenario.
A database, even one based of a file like SQLite or Access are better suited for these tasks.
I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.
Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.
You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);
I have a CSV file and I am reading data byte by byte by using buffered stream. I want to ignore reading the line if the last column = "True". How do I achieve it?
So far I have got:
BufferedStream stream = new BufferedStream(csvFile, 1000);
int byteIn = stream.ReadByte();
while (byteIn != -1 && (char)byteIn != '\n' && (char)byteIn != '\r')
byteIn = stream.ReadByte();
I want to ignore reading the line if the last column of the line is "True"
Firstly, I wouldn't approach any file IO byte-by-byte without an absolute need for it. Secondly, reading lines from a text file in .Net is a really cheap operation.
Here is some naive starter code, which ignores the possibility of string CSV values:
List<string> matchingLines = new List<string>();
using (var reader = new StreamReader("data.csv"))
{
string rawline;
while (null != (rawline = reader.ReadLine()))
{
if (rawline.TrimEnd().Split(',').Last() == "True") continue;
matchingLines.Add(rawline);
}
}
In reality, it would be advised to parse each CSV line into a strongly typed object and then filter on that collection using LINQ. However, that can be a separate answer for a separate question.
I would read/import the entire CSV file into a DataTable object and then do a Select on the datatable to include rows where last column not equal to true.
Here is a solution using a StreamReader, rather than a BufferedStream:
public string RemoveTrueRows( string csvFile )
{
var sr = new StreamReader( csvFile );
var line = string.Empty;
var contentsWithoutTrueRows = string.Empty;
while ( ( line = sr.ReadLine() ) != null )
{
var columns = line.Split( ',' );
if ( columns[ columns.Length - 1 ] == "True" )
{
contentsWithoutTrueRows += line;
}
}
sr.Close();
return contentsWithoutTrueRows;
}
In addition to jkirkwood's answer, you could also read each line and conditionally add a class or struct to a list of objects.
Some quick, semi-pseudocode:
List<MyObject> ObjectList = new List<MyObject>();
struct MyObject
{
int Property1;
string Property2;
bool Property3;
}
while (buffer = StreamReader.ReadLine())
{
string[] LineData = buffer.Split(',');
if (LineData[LineData.Length - 1] == "true") continue;
MyObject CurrentObject = new MyObject();
CurrentObject.Property1 = Convert.ToInt32(LineData[1]);
CurrentObject.Property2 = LineData[2];
CurrentObject.Property3 = Convert.ToBoolean(LineData[LineData.Length - 1]);
ObjectList.Add(CurrentObject);
}
It really kind of depends on what you want to do with the data once you've read it.
Hopefully this example is a bit helpful.
EDIT
As noted in comments, please be aware this is just a quick example. Your CSV file may have qualifiers and other things which make the string split completely useless. The take-away concept is to read line data into some sort of temporary variable, evaluate it for the desired condition, then output it or add it to your collection as needed.
EDIT 2
If the line lengths vary, you'll need to grab the last field instead of the *n*th field, so I changed the boolean field grabber to show how you would always get the last field instead of, say, the 42nd one.
I'm working in C# and i got a large text file (75MB)
I want to save lines that match a regular expression
I tried reading the file with a streamreader and ReadToEnd, but it takes 400MB of ram
and when used again creates an out of memory exception.
I then tried using File.ReadAllLines():
string[] lines = File.ReadAllLines("file");
StringBuilder specialLines = new StringBuilder();
foreach (string line in lines)
if (match reg exp)
specialLines.append(line);
this is all great but when my function ends the memory taken doesnt clear and I'm
left with 300MB of used memory, only when recalling the function and executing the line:
string[] lines = File.ReadAllLines("file");
I see the memory clearing down to 50MB give or take and then reallocating back to 200MB
How can I clear this memory or get the lines I need in a different way ?
var file = File.OpenRead("myfile.txt");
var reader = new StreamReader(file);
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
//evaluate the line here.
}
reader.Dispose();
file.Dispose();
You need to stream the text instead of loading the whole file in memory. Here's a way to do it, using an extension method and Linq:
static class ExtensionMethods
{
public static IEnumerable<string> EnumerateLines(this TextReader reader)
{
string line;
while((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
...
var regex = new Regex(..., RegexOptions.Compiled);
using (var reader = new StreamReader(fileName))
{
var specialLines =
reader.EnumerateLines()
.Where(line => regex.IsMatch(line))
.Aggregate(new StringBuilder(),
(sb, line) => sb.AppendLine(line));
}
You can use StreamReader#ReadLine to read file line-by-line and to save those lines that you need.
You should use the Enumerator pattern to keep your memory footprint low in case your file can be huge.
For a project that I am doing, one of the things that I must do is delete the first X lines of a plaintext file. I'm saying X because I will need to do this routine multiple times and each time, the lines to delete will be different, but they will always start from the beginning, delete the first X and then output the results to the same file.
I am thinking about doing something like this, which I pieced together from other tutorials and examples that I read:
String line = null;
String tempFile = Path.GetTempFileName();
String filePath = openFileDialog.FileName;
int line_number = 0;
int lines_to_delete = 25;
using (StreamReader reader = new StreamReader(originalFile)) {
using (StreamWriter writer = new StreamWriter(tempFile)) {
while ((line = reader.ReadLine()) != null) {
line_number++;
if (line_number <= lines_to_delete)
continue;
writer.WriteLine(line);
}
}
}
if (File.Exists(tempFile)) {
File.Delete(originalFile);
File.Move(tempFile, originalFile);
}
But I don't know if this would work because of small stuff like line numbers starting at line 0 or whatnot... also, I don't know if it is good code in terms of efficiency and form.
Thanks a bunch.
I like it short...
File.WriteAllLines(
fileName,
File.ReadAllLines(fileName).Skip(numberLinesToSkip).ToArray());
It's OK, and doesn't look it would have the off-by-one problems you fear. However a leaner approach would be afforded by two separate loops -- one to just count the first X lines from the input file (and do nothing else), a separate one to just copy the other lines from input to output. I.e., instead of your single while loop, have...:
while ((line = reader.ReadLine()) != null) {
line_number++;
if (line_number > lines_to_delete)
break;
}
while ((line = reader.ReadLine()) != null) {
writer.WriteLine(line);
}
I like your approach, I see nothing wrong with it. If you know for certain they are small files then the other suggestions may be a little less code if that matters to you.
A slightly less verbose version of what you already have:
using (StreamReader reader = new StreamReader(originalFile))
using (StreamWriter writer = new StreamWriter(tempFile))
{
while(lines_to_delete-- > 0)
reader.ReadLine();
while ((line = reader.ReadLine()) != null)
writer.WriteLine(line);
}
You could read the file into an array of lines, ignore the first few elements, and write the rest back.
The downside to this approach is that it will consume the size of the file in memory. Your approach (although pretty unreadable, no offence) doesn't have this memory problem. Although if the files are not too large, there shouldn't be a reason to worry about memory usage.
Example:
string[] lines = System.IO.File.ReadAllLines("YourFile.txt").Skip(10).ToArray();
System.IO.File.WriteAllLines("OutFile.txt", lines);