Regex to remove trailing whitespace and multiple blank lines - c#

I'd like regex expressions to use in a Visual Studio 2013 extension written in C#.
I'm trying to remove trailing whitespaces from a line while preserving empty lines. I'd also like to remove multiple empty lines. The existing line endings should be preserved (generally carriage return line feed).
So the following text (spaces shown as underscores):
hello_world__
___hello_world_
__
__
hello_world
Would become:
hello_world
___hello_world
hello_world
I've tried a number of different patterns to remove the trailing spaces but I either end up not matching the trailing spaces or losing the carriage returns. I haven't yet tried to remove the multiple empty lines.
Here's a couple of the patterns I've tried so far:
\s+$
(?<=\S)\s+$

Thanks for the answers so far. None of them are quite right for what I need, but they've helped me come up with what I needed. I think the issue is that are some oddities with regex in VS2013 (see Using Regular Expressions in Visual Studio). These two operations work for me:
Replace \ +(?=(\n|\r?$)) with nothing.
Replace ^\r?$(\n|\r\n){2,} with \r\n.

To remove multiple blank lines and trailing whitespace with
(?:\r\n[\s-[\rn]]*){3,}
and replace with \r\n\r\n.
See demo
And to remove the remaining whitespace, you can use
(?m)[\s-[\r]]+\r?$
See demo 2

\ +(?=(\n|$))
Any number of space, and checking that after a newline coming OR end of line (last characters in your string/text). (of course multi line needs to be enabled and global mode)

Just as a punt, without the use of Regex, you could always split the document by its end of line marker and then feedback using TrimEnd (as highlighted by Anton Semenov)...
(Assuming a text document read into a string...)
// Ascertain the linefeed...
string str = "This is a test \r\nto see if I can force \ra string to be broken \non multiple lines \r\n into an array.";
string[] t = str.Split(new string[] { "\r\n", "\r", "\n" } ,StringSplitOptions.RemoveEmptyEntries);
thediv.InnerHtml = str + "<br /><br />";
foreach(string s in t)
{
thediv.InnerHtml += s.TrimEnd() + "<br />";
}
I haven't timed this at all, but if you prefer to avoid the complications of Regex (which I do if I can - see below*), you should find this fast enough to do what you want.
* I avoid Regex if I can. That doesn't mean that I don't use it. Regex has its place, but I believe it to be a last resort tool for involved jobs, for instance complex flexible strings that adhere to a format - something where the alternative will generate large amounts of code. Keeping Regex to an absolute minimum aids the readability of your code.

As separate operations -
Remove trailing whitespace any (?m)[^\S\r\n]+$
Remove trailing whitespace lines with text (?m)(?<=\S)[^\S\r\n]+$
Remove duplicate blank lines (along with whitespace trim)
# Find: (?>\A(?:[^\S\r\n]*\r\n)+)|(?>\r\n(?:[^\S\r\n]*(\r\n)){2,})
# Replace: $1\r\n
(?>
\A
(?: [^\S\r\n]* \r \n )+
)
|
(?>
\r \n
(?:
[^\S\r\n]*
( \r \n ) # (1)
){2,}
)

The \s includes the linefeed, I would search for just multiple blanks instead. I do not know the specifics of VS, but this should hopefully do it:
[" "]*?$

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?
To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

C# remove line using regular expression, including line break

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?
To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

Regex match pattern

I am trying to count the amount of times I use \n\n\n (3 linebreaks) after some text. It counts almost as I like, the problem is when you spam newlines that will be counted which I don't want to.
Edit: Seems like the regex from regexr does not support .net so I have to come up with a pattern that works with .net.
Example for the text that the regex will check on:
Description for something
text \n \n \n // this will make DescriptionAmount++
Description for something
text\n \n \n // this will make DescriptionAmount++
\n \n \n // this shouldn't add on DescriptionAmount
Here's the code I've done so far.
int DescriptionAmount = Regex.Matches(somestring, "[^\w$](\r\n){2}[^\w$]").Count;
Try using a quantifier {x,y} to select how many tokens you want to match.
The '*' will match the preceding character 0 or many times, meaning it will match any \n after the 3rd token.
\n{3} says \n must be matched 3 times no more no less.
I find this tool http://regexr.com/ very useful for building and debugging regex statements.
To ensure you capture the 3 linebreaks after some text would take something like:
\w+\s*\n{3}
Since this is .net, you either need to put an # in front:
#"\w+\s*\n{3}
or escape the slashes like:
"\\w+\\s\n{3}
You mentioned that you are searching for three \n, but your search has \r\n. If you are looking for \r\n instead, just add \r in front of the \n and surround with () for (\\r\\n) or (\r\n) in the expressions above.
One other thing, depending on the text in some string, you may want to apply the multiline option like:
Regex.Matches(somestring, "\\w\\s\n{3}", RegexOptions.Multiline)

Regex filtering Regex, extra additional final \r

I want to filter a regex with a ... regex ...
My target is in a file which content is
...
information 1...
Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
information 2...
...
The file is transferred to mystring with the method ReadAllText(path); where path is the path to the text file.
I use the code
//Retrieve regex like ^\|1[\s\t]+[\S]+[\s\t]+(.*)$ in Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
//\d for any digit followed by =
// . for any character found 1 or + times, ended with space character
m = Regex.Match(mystring, #"Entity\d=(.+)\s");
string regex = m.Groups[1].Value;
which works almost fine
What I get is ( seen from inside the degugger )
^\|1[\s\t]+[\S]+[\s\t]+(.*)$\r
There is an additional \r at the end of the result. It causes an unwanted extra newline in other parts of the code.
Trying #"Entity\d=(.+)" (i.e removing the final \s) does not help.
Any idea of how to avoid the additionnal \r gracefully ( I do not want,if possible, to track the finale \r and remove it )
Online regex tester like regex101 did not permit to foresee this problem before going to C# code
Use a negated character class to make sure \r is not matched:
m = Regex.Match(mystring, #"Entity\d=([^\r\n]+)");
The [^\r\n] class means match any character other than a carriage return and a line feed.
It is true that regex101 does not keep carriage returns. You can see the \r matching at regexhero.net:
Check if this works:
#"Entity\d=(.+)(?=(\r|\n))";
(?=(\r|\n)) is a positive lookahead and means that the \r or \n won't be included in the result.
Edit:
#"Entity\d=(.+?)(?=\r|\n)";

regex can't recognize "\n"?

so at the end the end(after few days of debuging) i found a problem. It isnt in regex at all :/ . It seams that i was trimming ekstra white spaces with
intput= Regex.Replace(input, "\\s+", " ");
so all new lines are replaced with " ". Stupid! Moderator, please remove this if unnecesary!
I have regexp for tokenizing some text and it looks like this :
"(?<html>Ç)|
(?<number>\\d+(?:[.]\\d+)?(?=[][ \f\n\r\t\v!?.,():;\"'„Ç]|$))|
(?<other>(?:[^][Ç \f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü][^ Ç\f\n\r\t\vA-Za-zčćšđžČĆŠĐŽäöÖü]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„A-Za-zčćšđžČĆŠĐŽäöÖü](?=[][!?.,():;\"'„]*(?:$|[ Ç\f\n\r\t\v])))|
(?<word>(?:[^][ Ç\f\n\r\t\v!?.,():;\"'„][^ Ç\f\n\r\t\v]*)?[^][ Ç\f\n\r\t\v!?.,():;\"'„])|
(?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„])"
Problem is in this part: (?<punctuation>[][ \f\n\r\t\v!?.,():;\"'„]). So when im prsing text with input "\n\n" it is grouping in punctuation matches: " "," " - in other words, space and space... and I don't know why?
I could be wrong, but you need to hand the String as String to the RegEx...means you need to escape the backslashes.
... (?=[][ \\f\\n\\r\\t\\v!?.,():;\\" ...
Or otherwise C# will replace \n with a linebreak within the RegEx-Statement.
Edit: It's also possible to use literal strings, but the need to be marked with beginning # (see Martin's answer).
If you put an # in front of string you can use single backslashes and line-breaks will be recognized.
#"(?<html>Ç)|
Set RegexOptions.IgnorePatternWhiteSpace
Update:
Are you sure [^] is correct? Unless it's somekind of character group (that I have never used), that will be the same as . . Same goes for []. Perhaps I just have not used all of RE before :p

Categories