In a Windows Forms C# app, I have a textbox where users paste log data, and it sorts it. I need to check each line individualy so I split the input by the new line, but if there are a lot of lines, greater than 100,000 or so, it throws a OutOfMemoryException.
My code looks like this:
StringSplitOptions splitOptions = new StringSplitOptions();
if(removeEmptyLines_CB.Checked)
splitOptions = StringSplitOptions.RemoveEmptyEntries;
else
splitOptions = StringSplitOptions.None;
List<string> outputLines = new List<string>();
foreach(string line in input_TB.Text.Split(new string[] { "\r\n", "\n" }, splitOptions))
{
if(line.Contains(inputCompare_TB.Text))
outputLines.Add(line);
}
output_TB.Text = string.Join(Environment.NewLine, outputLines);
The problem comes from when I split the textbox text by line, here input_TB.Text.Split(new string[] { "\r\n", "\n" }
Is there a better way to do this? I've thought about taking the first X amount of text, truncating at a new line and repeat until everything has been read, but this seems tedious. Or is there a way to allocate more memory for it?
Thanks,
Garrett
Update
Thanks to Attila, I came up with this and it seems to work. Thanks
StringReader reader = new StringReader(input_TB.Text);
string line;
while((line = reader.ReadLine()) != null)
{
if(line.Contains(inputCompare_TB.Text))
outputLines.Add(line);
}
output_TB.Text = string.Join(Environment.NewLine, outputLines);
The better way to do this would be to extract and process one line at a time, and use a StringBuilder to create the result:
StringBuilder outputTxt = new StringBuilder();
string txt = input_TB.Text;
int txtIndex = 0;
while (txtIndex < txt.Length) {
int startLineIndex = txtIndex;
GetMore:
while (txtIndex < txt.Length && txt[txtIndex] != '\r' && txt[txtIndex] != '\n')) {
txtIndex++;
}
if (txtIndex < txt.Length && txt[txtIndex] == '\r' && (txtIndex == txt.Length-1 || txt[txtIndex+1] != '\n') {
txtIndex++;
goto GetMore;
}
string line = txt.Substring(startLineIndex, txtIndex-startLineIndex);
if (line.Contains(inputCompare_TB.Text)) {
if (outputTxt.Length > 0)
outputTxt.Append(Environment.NewLine);
outputTxt.Append(line);
}
txtIndex++;
}
output_TB.Text = outputTxt.ToString();
Pre-emptive comment: someone will object to the goto - but it is what's needed here, the alternatives are much more complex (reg exp for example), or fake the goto with another loop and continue or break
Using a StringReader to split the lines is a much cleaner solution, but it does not handle both \r\n and \n as a new line:
StringReader reader = new StringReader(input_TB.Text);
StringBuilder outputTxt = new StringBuilder();
string compareTxt = inputCompare_TB.Text;
string line;
while((line = reader.ReadLine()) != null) {
if (line.Contains(compareTxt)) {
if (outputTxt.Length > 0)
outputTxt.Append(Environment.NewLine);
outputTxt.Append(line);
}
}
output_TB.Text = outputTxt.ToString();
Split will have to duplicate the memory need of the original text, plus overhead of string objects for each line. If this causes memory issues, a reliable way of processing the input is to parse one line at a time.
I guess the only way to do this on large text files is to open the file manually and use a StreamReader. Here is an example how to do this.
You can avoid creating strings for all lines and the array by creating the string for each line one at a time:
var eol = new[] { '\r', '\n' };
var pos = 0;
while (pos < input.Length)
{
var i = input.IndexOfAny(eol, pos);
if (i < 0)
{
i = input.Length;
}
if (i != pos)
{
var line = input.Substring(pos, i - pos);
// process line
}
pos = i + 1;
}
On other hand, In this article say that the point is that "split" method is implemented poorly. Read it, and make your conclusions.
Like Attila said, you have to parse line by line.
Related
I am reading from a file and I am trying to skip first two lines and start reading from the third one. I've checked other questions which were answered but none of them worked on unity for some reason. I get several errors however it should work.
StreamReader reader = new StreamReader(path);
string line = "";
while ((line = reader.ReadLine()) != null)
{
string[] words = line.Split(' ');
string type = words[0];
float x = float.Parse(words[1]);
....
}
If I understand correctly, we can try to use File.ReadAllLines which will return all line of text content from your file text and then start reading on the third line (array start as 0, so that the third line might be contents[2]).
var contents = File.ReadAllLines(path);
for (int i = 2; i < contents.Length; i++)
{
string[] words = contents[i].Split(' ');
string type = words[0];
float x = float.Parse(words[1]);
}
If we know the Encoding of the file we can try to set Encoding to the second parameter in File.ReadAllLines
Similar to D-Shih's solution, is one using File.ReadLines, which returns an IEnumerable<string>:
var lines = File.ReadLines(path);
foreach (string line in lines.Skip(2))
{
string[] words = line.Split(' ');
string type = words[0];
float x = float.Parse(words[1]);
// etc.
}
The benefit of this approach over D-Shih's is that you don't have to read the entire file into memory at once to process it, so this solution is analogous to your existing solution's use of StreamReader.
As a solution for directly fixing your problem, you just need to call ReadLine twice before getting into the loop (to skip the two lines), though I'd argue the solution above is more legible:
using (StreamReader reader = new StreamReader(path))
{
string line = "";
// skip 2 lines
for (int i = 0; i < 2; ++i)
{
reader.ReadLine();
}
// read file normally
while ((line = reader.ReadLine()) != null)
{
string[] words = line.Split(' ');
string type = words[0];
float x = float.Parse(words[1]);
....
}
}
Notice that I've also wrapped the reader in a using, so that the file handle will be closed & disposed of once the loop completes, or in case of an exception being thrown.
Hi I'm pretty new to C# and trying to do some exercises to get up to speed with it. I'm trying to count the total number of characters in a file but it's stopping after the first word, would someone be able to tell me where I am going wrong? Thanks in advance
public void TotalCharacterCount()
{
string str;
int count, i, l;
count = i = 0;
StreamReader reader = File.OpenText("C:\\Users\\Lewis\\file.txt");
str = reader.ReadLine();
l = str.Length;
while (str != null && i < l)
{
count++;
i++;
str = reader.ReadLine();
}
reader.Close();
Console.Write("Number of characters in the file is : {0}\n", count);
}
If you want to know the size of a file:
long length = new System.IO.FileInfo("C:\\Users\\Lewis\\file.txt").Length;
Console.Write($"Number of characters in the file is : {length}");
If you want to count characters to play around with C#, then here is some sample code that might help you
int totalCharacters = 0;
// Using will do the reader.Close for you.
using (StreamReader reader = File.OpenText("C:\\Users\\Lewis\\file.txt"))
{
string str = reader.ReadLine();
while (str != null)
{
totalCharacters += str.Length;
str = reader.ReadLine();
}
}
// If you add the $ in front of the string, then you can interpolate expressions
Console.Write($"Number of characters in the file is : {totalCharacters}");
it's stopping after the first word
It is because you have check && i < l in the loop and then increment it so the check doesn't pass you don't change the value of l variable(by the way, the name is not very good, I was sure it was 1, not l).
Then if you need to get total count of characters in the file you could read the whole file to a string variable and just get it from Count() Length
var count = File.ReadAllText(path).Count();
Getting Length property of the FileInfo will give the size, in bytes, of the current file, which is not necessary will be equal to characters count(depending on Encoding a character may take more than a byte)
And regarding the way you read - it also depends whether you want to count new line symbols and others or not.
Consider the following sample
static void Main(string[] args)
{
var sampleWithEndLine = "a\r\n";
var length1 = "a".Length;
var length2 = sampleWithEndLine.Length;
var length3 = #"a
".Length;
Console.WriteLine($"First sample: {length1}");
Console.WriteLine($"Second sample: {length2}");
Console.WriteLine($"Third sample: {length3}");
var totalCharacters = 0;
File.WriteAllText("sample.txt", sampleWithEndLine);
using(var reader = File.OpenText("sample.txt"))
{
string str = reader.ReadLine();
while (str != null)
{
totalCharacters += str.Length;
str = reader.ReadLine();
}
}
Console.WriteLine($"Second sample read with stream reader: {totalCharacters}");
Console.ReadKey();
}
For the second sample, first, the Length will return 3, because it actually contains three symbols, while with stream reader you will get 1, because The string that is returned does not contain the terminating carriage return or line feed. The returned value is null if the end of the input stream is reached
My old method (other than being wrong in general) takes too long to get multiple lines from a file and then store the parameters into a dictionary.
Essentially it's open file, grab every second line one at a time, modify the line then store the data (line pos and the first element of the line (minus) ">") close the file and then repeat.
for (int i = 0; i < linecount - 1; i += 2)
{
string currentline = File.ReadLines
(datafile).Skip(i).Take(1).First();
string[] splitline = currentline.Split(' ');
string filenumber = splitline[0].Trim('>');
} for (int i = 0; i < linecount - 1; i += 2)
You need to read next line inside while loop, otherwise loop body will always analyse first line (that's why there are Dictionary error) and never exist:
while (line != null)
{
// your current code here
line = sr.ReadLine();
}
The issue is that you only ever read the first line of the file. To solve this you need to ensure you call sr.ReadLine() on every iteration through the loop. This would look like:
using (StreamReader sr = File.OpenText(datafile))
{
string line = sr.ReadLine();
while (line != null)
{
count = count ++;
if (count % 2 == 0)
{
string[] splitline = line.Split(' ');
string datanumber = splitline[0].Trim('>');
index.Add(datanumber, count);
}
line = sr.ReadLine();
}
}
This means on each iteration, the value of line will be a new value (from the next line of the file).
I have a text file that is divided up into many sections, each about 10 or so lines long. I'm reading in the file using File.ReadAllLines into an array, one line per element of the array, and I'm then I'm trying to parse each section of the file to bring back just some of the data. I'm storing the results in a list, and hoping to export the list to csv ultimately.
My for loop is giving me trouble, as it loops through the right amount of times, but only pulls the data from the first section of the text file each time rather than pulling the data from the first section and then moving on and pulling the data from the next section. I'm sure I'm doing something wrong either in my for loop or for each loop. Any clues to help me solve this would be much appreciated! Thanks
David
My code so far:
namespace ParseAndExport
{
class Program
{
static readonly string sourcefile = #"Path";
static void Main(string[] args)
{
string[] readInLines = File.ReadAllLines(sourcefile);
int counter = 0;
int holderCPStart = counter + 3;//Changed Paths will be an different number of lines each time, but will always start 3 lines after the startDiv
/*Need to find the start of the section and the end of the section and parse the bit in between.
* Also need to identify the blank line that occurs in each section as it is essentially a divider too.*/
int startDiv = Array.FindIndex(readInLines, counter, hyphens72);
int blankLine = Array.FindIndex(readInLines, startDiv, emptyElement);
int endDiv = Array.FindIndex(readInLines, counter + 1, hyphens72);
List<string> results = new List<string>();
//Test to see if FindIndexes work. Results should be 0, 7, 9 for 1st section of sourcefile
/*Console.WriteLine(startDiv);
Console.WriteLine(blankLine);
Console.WriteLine(endDiv);*/
//Check how long the file is so that for testing we know how long the while loop should run for
//Console.WriteLine(readInLines.Length);
//sourcefile has 5255 lines (elements) in the array
for (int i = 0; i <= readInLines.Length; i++)
{
if (i == startDiv)
{
results = (readInLines[i + 1].Split('|').Select(p => p.Trim()).ToList());
string holderCP = string.Join(Environment.NewLine, readInLines, holderCPStart, (blankLine - holderCPStart - 1)).Trim();
results.Add(holderCP);
string comment = string.Join(" ", readInLines, blankLine + 1, (endDiv - (blankLine + 1)));//in case the comment is more than one line long
results.Add(comment);
i = i + 1;
}
else
{
i = i + 1;
}
foreach (string result in results)
{
Console.WriteLine(result);
}
//csvcontent.AppendLine("Revision Number, Author, Date, Time, Count of Lines, Changed Paths, Comments");
/* foreach (string result in results)
{
for (int x = 0; x <= results.Count(); x++)
{
StringBuilder csvcontent = new StringBuilder();
csvcontent.AppendLine(results[x] + "," + results[x + 1] + "," + results[x + 2] + "," + results[x + 3] + "," + results[x + 4] + "," + results[x + 5]);
x = x + 6;
string csvpath = #"addressforcsvfile";
File.AppendAllText(csvpath, csvcontent.ToString());
}
}*/
}
Console.ReadKey();
}
private static bool hyphens72(String h)
{
if (h == "------------------------------------------------------------------------")
{
return true;
}
else
{
return false;
}
}
private static bool emptyElement(String ee)
{
if (ee == "")
{
return true;
}
else
{
return false;
}
}
}
}
It looks like you are trying to grab all of the lines in a file that are not "------" and put them into a list of strings.
You can try this:
var lineswithoutdashes = readInLines.Where(x => x != hyphens72).Select(x => x).ToList();
Now you can take this list and do the split with a '|' to extract the fields you wanted
The logic seems wrong. There are issues with the code in itself also. I am unsure what precisely you're trying to do. Anyway, a few hints that I hope will help:
The if (i == startDiv) checks to see if I equals startDiv. I assume the logic that happens when this condition is met, is what you refer to as "pulls the data from the first section". That's correct, given you only run this code when I equals startDiv.
You increase the counter I inside the for loop, which in itself also increases the counter i.
If the issue in 2. wouldn't exists then I'd suggest to not do the same operation "i = i + 1" in both the true and false conditions of the if (i == startDiv).
Given I assume this file might actually be massive, it's probably a good idea to not store it in memory, but just read the file line by line and process line by line. There's currently no obvious reason why you'd want to consume this amount of memory, unless it's because of the convenience of this API "File.ReadAllLines(sourcefile)". I wouldn't be too scared to read the file like this:
Try (BufferedReader br = new BufferedReader(new FileReader (file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line.
}
}
You can skip the lines until you've passed where the line equals hyphens72.
Then for each line, you process the line with the code you provided in the true case of (i == startDiv), or at least, from what you described, this is what I assume you are trying to do.
int startDiv will return the line number that contains hyphens72.
So your current for loop will only copy to results for the single line that matches the calculated line number.
I guess you want to search the postion of startDiv in the current line?
const string hyphens72;
// loop over lines
for (var lineNumber = 0; lineNumber <= readInLines.Length; lineNumber++) {
string currentLine = readInLines[lineNumber];
int startDiv = currentLine.IndexOf(hyphens72);
// loop over characters in line
for (var charIndex = 0; charIndex < currentLine.Length; charIndex++) {
if (charIndex == startDiv) {
var currentCharacter = currentLine[charIndex];
// write to result ...
}
else {
continue; // skip this character
}
}
}
There are a several things which could be improved.
I would use ReadLines over File.ReadAllLines( because ReadAllLines reads all the lines at ones. ReadLines will stream it.
With the line results = (readInLines[i + 1].Split('|').Select(p => p.Trim()).ToList()); you're overwriting the previous results list. You'd better use results.AddRange() to add new results.
for (int i = 0; i <= readInLines.Length; i++) means when the length = 10 it will do 11 iterations. (1 too many) (remove the =)
Array.FindIndex(readInLines, counter, hyphens72); will do a scan. On large files it will take ages to completely read them and search in it. Try to touch a single line only ones.
I cannot test what you are doing, but here's a hint:
IEnumerable<string> readInLines = File.ReadLines(sourcefile);
bool started = false;
List<string> results = new List<string>();
foreach(var line in readInLines)
{
// skip empty lines
if(emptyElement(line))
continue;
// when dashes are found, flip a boolean to activate the reading mode.
if(hyphens72(line))
{
// flip state.. (start/end)
started != started;
}
if(started)
{
// I don't know what you are doing here precisely, do what you gotta do. ;-)
results.AddRange((line.Split('|').Select(p => p.Trim()).ToList()));
string holderCP = string.Join(Environment.NewLine, readInLines, holderCPStart, (blankLine - holderCPStart - 1)).Trim();
results.Add(holderCP);
string comment = string.Join(" ", readInLines, blankLine + 1, (endDiv - (blankLine + 1)));//in case the comment is more than one line long
results.Add(comment);
}
}
foreach (string result in results)
{
Console.WriteLine(result);
}
You might want to start with a class like this. I don't know whether each section begins with a row of hyphens, or if it's just in between. This should handle either scenario.
What this is going to do is take your giant list of strings (the lines in the file) and break it into chunks - each chunk is a set of lines (10 or so lines, according to your OP.)
The reason is that it's unnecessarily complicated to try to read the file, looking for the hyphens, and process the contents of the file at the same time. Instead, one class takes the input and breaks it into chunks. That's all it does.
Another class might read the file and pass its contents to this class to break them up. Then the output is the individual chunks of text.
Another class can then process those individual sections of 10 or so lines without having to worry about hyphens or what separates on chunk from another.
Now that each of these classes is doing its own thing, it's easier to write unit tests for each of them separately. You can test that your "processing" class receives an array of 10 or so lines and does whatever it's supposed to do with them.
public class TextSectionsParser
{
private readonly string _delimiter;
public TextSectionsParser(string delimiter)
{
_delimiter = delimiter;
}
public IEnumerable<IEnumerable<string>> ParseSections(IEnumerable<string> lines)
{
var result = new List<List<string>>();
var currentList = new List<string>();
foreach (var line in lines)
{
if (line == _delimiter)
{
if(currentList.Any())
result.Add(currentList);
currentList = new List<string>();
}
else
{
currentList.Add(line);
}
}
if (currentList.Any() && !result.Contains(currentList))
{
result.Add(currentList);
}
return result;
}
}
I have a C# console application which parses a .txt file. The txt file has 4 values on each line. So here are a couple of samples:
c:\ecpg\myfolder\no_space.cfm 20160803 01:09:54 1574
c:\ecpg\myfolder\file with space.cfm 20160803 01:09:54 1574
c:\myfolder\.project 20170221 07:54:10 265
I am using the following to split based on white spaces in each row:
while ((line = file.ReadLine()) != null)
{
string[] parts = line.Split(new char[0], StringSplitOptions.RemoveEmptyEntries);
}
Problem is that, in case of row 2, there is a space in the file name and so that's failing the parsing because now I have 5 values instead of 4. How can I prevent this? Maybe some way to detect if there is a . (dot) soon after the space?
Thank you!
You can use Regex to split your string, it will give you better output. Please check my code:
while ((line = file.ReadLine()) != null)
{
string[] parts = Regex.Split(line, #"(\s+\s+)");
}
Also I've written it in DotNetFiddle you can check this.
EDIT: I've edited the code and it will cover all of your scenario. New Solution Fiddle
while ((line = file.ReadLine()) != null)
{
string partOne = Regex.Match(line, #"[a-z](.*)[a-z]").Value;
//string[] parts = Regex.Split(line.Replace(partOne, ""), #"(\s+)");
string[] parts;
if (!string.IsNullOrEmpty(partOne))
{
parts = Regex.Split(line.Replace(partOne, ""), #"(\s+)");
}
else
{
parts = Regex.Split(line, #"(\s+)");
}
}
Final Code:
List<string> parts = new List<string>();
while ((line = file.ReadLine()) != null)
{
parts = new List<string>();
//string partOne = Regex.Match(line, #"[A-Za-z](.*)[A-Za-z]").Value;
//Update Regex for handle numeric value in part one.
string partOne = Regex.Match(line, #"[A-Za-z](.*)([A-Za-z]|([A-Za-z]{1}[0-9]))(.*?)\s").Value.Trim();
parts.Add(partOne);
string[] fianlParts;
if (!string.IsNullOrEmpty(partOne))
{
fianlParts = Regex.Split(line.Replace(partOne, ""), #"(\s+)");
}
else
{
fianlParts = Regex.Split(line, #"(\s+)");
}
foreach (string part in fianlParts)
{
if (!string.IsNullOrEmpty(part.Trim()))
{
parts.Add(part);
}
}
Console.WriteLine(parts[0] + " " + parts[1] + " " + parts[2] + " " + parts[3]);
}
This method is manual but works. It supports filenames with any number of spaces.
It works by locating spaces from the end of the string, retrieving three fields in the loop and finally the filename. There's plenty of room for optimalization here if you're parsing large files.
while ((line = file.ReadLine()) != null)
{
string[] parts = new string[4];
int n = -1;
for (int idx = 0; idx < 3; idx++)
{
n = line.LastIndexOf(' ');
parts[3-idx] = line.Substring(n + 1);
line = line.Substring(0, n).TrimEnd();
}
parts[0] = line; // filename
}
If one or more of the fields are missing you can do simple pattern checks. In your file the first parameter is the filename, the second an 8-digit date, the third the time of day and the fourth (probably) the file size. In this case this code should be more robust (I didn't try compiling it so it might contain typos):
while ((line = file.ReadLine()) != null)
{
string[] parts = new string[4];
int n = -1;
for (int idx = 0; idx < 3; idx++)
{
n = line.LastIndexOf(' ');
if (n == -1 || n == 0) break;
string part = line.Substring(n + 1);
if (part.IndexOf(':') > 0) parts[2] = part;
else if (part.Length == 8) parts[1] = part;
else parts[3] = part; // assuming you don't have 8-digit filesizes
line = line.Substring(0, n).TrimEnd();
}
parts[0] = line.TrimEnd(); // filename
}
Split on the period instead. That will give you two separate strings: the file and the rest. Split only the second string on space. The very first element of the second string split is your file extension:
while ((line = file.ReadLine()) != null)
{
string[] parts = line.Split('.');
string[] secondSplit = parts[1].Split(' ');
// put together the file path
string filePath = parts[0] + "." + secondSplit[0];
// Do something here with the rest of the second split: secondSplit
}