Removing consecutive blank rows from StringBuilder - c#

I have a Stringbuilder object that has been populated from a text file.
How can I check the StringBuilder object for and remove consecutive "blank" lines.
i.e
Line 1: This is my text
Line 2:
Line 3: Another line after the 1st blank one
Line 4:
Line 5:
Line 6: Next line after 2 blank lines
(Line numbers given as reference only)
The blank line on Line 2 is fine, but I would like to remove the duplicate blank line, on Line 5, and so on.
If for argument sake Line 6 would have also been a blank line, and a Line 7 had a value, I would like Blank Line 5 and Blank Line 6 removed, so that there would only be 1 blank line between the Line 3 and Line 7.
Thanks in advance.

Do you have to already have the file contents in a StringBuilder?
It would be nicer to be able to read line-by-line. Something like:
private IEnumerable<string> GetLinesFromFile(string fileName)
{
using (var streamReader = new StreamReader(fileName))
{
string line = null;
bool previousLineWasBlank = false;
while ((line = streamReader.ReadLine()) != null)
{
if (!previousLineWasBlank && string.IsNullOrEmpty(line))
{
yield return line;
}
previousLineWasBlank = string.IsNullOrEmpty(line);
}
}
}
Now you can read in your text (which has had dupe blank lines removed) like this:
foreach (var line in GetLinesFromFile("myFile.txt"))
{
Console.WriteLine(line);
}
Note: I'm only illustrating a technique here. There are other considerations: e.g. my iterator method holds the file open while the consumers are processing the foreach. This is nice and memory efficient (more so than reading into a string for example) as you are only dealing with one line at a time, but not ideal for files that take a long time to process.

Probably not very efficient, but it's easy.
while(sb.ToString().Contains(Environment.NewLine + Environment.NewLine))
{
sb = sb.Replace(Environment.NewLine + Environment.NewLine, Environment.NewLine);
}

StringBuilder is a lot less flexible when it comes to searching & removing from. It's used as a helper to speed up concatenation as "string" + "another string" is a very costly operation.
I would suggest using .ToString() then Regex.Replace with a compiled regular expression with flags set to allow multiline.
You'll probably want a search pattern of:
(\n[\w-\n]*\n)
And you replace it with the empty string.
Check out Expresso for a great .NET Regular expression tool.

Related

How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
Example for line text file
name,id,image,age,place,link
string word = "13215646";
string output = string.Empty;
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
string[] strList = line.Split(',');
if (word == strList[1]) // check if word = id
{
output = line;
break;
}
}
}
You can use this to search the file:
var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();
But it won't solve this:
if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
There's not a practical way to avoid this for a basic file.
The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.
But neither is likely for a random csv file. This is one of the reasons people use databases.
However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.
You can optimise the code. Here are few ideas:
var ids = new ["13215646", "113"];
foreach(var line in File.ReadLines(FileName))
{
var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
if(ids.Contains(id) // Optimization 2: Search for multiple ids
{
//Do what you need with the line
}
}

Reading a specific line from a csv file and letting the line vary

I need to take 1 line from a CSV file and need the line number to be able to vary. I can make an int, double, string etc. and I can change the value from an outer program easily but I don't know how make a file reader script take one of those as the input for the line number.
string GetLine(string lineresults, int LineNumber)
{
using (var sr = new StreamReader(lineresults)) {
for (int i = 1; i < line; i++)
sr.ReadLine();
return sr.ReadLine();
}
}
And I get errors on the GetLine part for semicolons and closeparens expected
If you want random access you can read all lines and store them in an array, so File.ReadAllLines (remember that your variable LineNumber starts at 0):
string[] allLines = File.ReadAllLines(pathToFile);
string line = allLines[LineNumber]; // error if less lines, check allLines.Length
Another more efficient approach is to use File.ReadLines which lazy loads the lines, then use Enumerable.ElementAt or ElementAtOrDefault to access the line number:
var lines = File.ReadLines(pathToFile);
string line = lines.ElementAtOrDefault(LineNumber); // null if there are less lines
It is worth noting that it reads the file until the line number or the end of the file was reached.
MSDN:
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient
As #Alex K commented simply read all the lines into an Array and then get the line you are after.
var lines = System.IO.File.ReadAllLines( filename);
var line = lines[ lineIndex ];
This NuGet package is super-duper helpful for working with CSVs. You can grab individual lines from it, and individual columns by either name or index. Check out info here:
[Josh Close - CsvHelper][1]

Searching for a phrase in text file then displaying that line

im trying to make a code that searches through a textfile for a certain phrase and then populates a textbox with the line if a phrase occurs in that. There are no errors with this code, but it doesn't work at all. Anyone know what is wrong? I'm not too sure if what i'm doing is remotely correct.
{
tuitDisplayTextBox.Text = "";
string[] tuitFilePath = File.ReadAllLines(Server.MapPath("~") +"/App_Data/tuitterMessages.txt");
for (int i = 0; i < tuitFilePath.Length; i++)
{
if (tuitFilePath[i].Contains(searchTextBox.Text))
{
tuitDisplayTextBox.Text += tuitFilePath[i];
}
}
Your solution should work... for the last line that matches, and only that one.
LINQ can help you here, though. Here's a solution that should work.
tuitDisplayTextBox.Text =
File.ReadLines(Server.MapPath("~") +"/App_Data/tuitterMessages.txt")
.Where(n => n.Contains(searchTextBox.Text)).Aggregate((a, b) =>
a + Enviroment.NewLine + b);
Here, what it does is it reads the lines of the file into an IEnumerable<string>, and then I filter that with the Where method, which basically means "if the condition is true for this element, add this element to the list of things to return, else don't add it". And then Aggregate is a bit more complicated. Basically what it does is it takes the first two items from the collection, and then pass a lambda through them that returns a value. Then call the lambda again with that result and the third element. And then it takes that result and calls it with the fourth element. And so on.
Here's some code more similar to yours that will also work:
tuitDisplayTextBox.Text = "";
IEnumerable<string> lines =
File.ReadAllLines(Server.MapPath("~") +"/App_Data/tuitterMessages.txt");
StringBuilder sb = new StringBuilder
foreach (string line in lines)
{
if (line.Contains(searchTextBox.Text))
{
sb.AppendLine(line);
}
}
tuitDisplayTextBox.Text = sb.ToString();
Here it's a bit different. First it reads all the lines into an IEnumerable<string> called lines. Then it makes a StringBuilder object (basically a mutable string). After that, it foreaches the lines in the IEnumerable<string> (I thought it was more appropriate here) and then if the line contains the text you want, it adds that line and a newline to the StringBuilder object. After that, it sets your textbox's text to the result of all of that, by getting the string representation of the StringBuilder instance.
And if you really want a for loop, here's the code modified to use a for loop:
tuitDisplayTextBox.Text = "";
string[] lines =
File.ReadAllLines(Server.MapPath("~") +"/App_Data/tuitterMessages.txt");
StringBuilder sb = new StringBuilder
for (int i = 0; i < lines.Length; i++)
{
if (lines[i].Contains(searchTextBox.Text))
{
sb.AppendLine(lines[i]);
}
}
tuitDisplayTextBox.Text = sb.ToString();
Please note that File.ReadAllLines break sentences at '\r' or '\n'.
So, if you search for "hello world" and this text is break in the file into 2 lines (e.g. "... hello /n world" your code will failed...
So, use the ReadAllText() instead, return one string contains all file's text.
Still, you might face sometimes problems with file encoding, but this is another issue.
After, and if, you find the text you are searching for you can use the ReadAllLines to decide about the location of the text.

Splitting strings using Environment.Newline leaves \n in most array items?

I used MyString.Split(Environment.Newline.ToCharArray()[0]) to split my string from a file into different pieces. But, every item in the array, except the first one starts with \n after I did that? I know the way that I'm splitting by newlines is kind of "cheaty" for lack of a better word, so if there is a better way of doing this, please tell me...
Here is the file...
If you are wanting to maintain using the .Split() instead of reading a file in a line at a time you can do...
var splitResult = MyString.Split( new string[]{ System.Environment.NewLine },
System.StringSplitOptions.RemoveEmptyEntries );
/* or System.StringSplitOptions.None if you want empty results as well */
EDIT:
The problem you were having is that in a non-unix environment the new-line "character" is actually two characters. So when you grabbed the zero index you were actually splitting on a carriage return...not the new-line character (\n).
Windows = "\r\n"
Unix = "\n"
Per http://msdn.microsoft.com/en-us/library/system.environment.newline.aspx
A newline in Windows is two characters (\r and \n). The Environment.Newline.ToCharArray()[0] expression specifies only one of those characters: \r. Therefore, the other character (\n) remains as a portion of the split string.
My I suggest you read your file using something like this:
public IEnumerable<string> ReadFile(string filePath)
{
using (StreamReader rdr = new StreamReader(filePath))
{
string line;
while ( (line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
You might need more error handling, or to specify different file open option, or to pass a stream to method rather than the path, but the idea of using an iterator over the ReadLine() method is sound. The result is you can just use code like this:
foreach (string line in ReadLine(" ... my file path ... "))
{
}

How do I handle line breaks in a CSV file using C#?

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.

Categories