C# remove line using regular expression, including line break

C# remove line using regular expression, including line break - c#

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?

To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?

To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

I'm trying to highlight markdown code, but am running into this weird behavior of the .NET regex multiline option.
The following expression: ^(#+).+$ works fine on any online regex testing tool:
But it refuses to work with .net:
It doesn't seem to take into account the $ tag, and just highlights everything until the end of the string, no matter what. This is my C#
RegExpression = new Regex(#"^(#+).+$", RegexOptions.Multiline)
What am I missing?

It is clear your text contains a linebreak other than LF. In .NET regex, a dot matches any char but LF (a newline char, \n).
See Multiline Mode MSDN regex reference
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
So, use
#"^(#+).+?\r?$"
The .+?\r?$ will match lazily any one or more chars other than LF up to the first CR (that is optional) right before a newline.
Or just use a negated character class:
#"^(#+)[^\r\n]+"
The [^\r\n]+ will match one or more chars other than CR/LF.

What you have is good. The only thing you're missing is that . doesn't match newline characters, even with the multiline option. You can get around this in two different ways.
The easiest is to use the RegexOptions.Singleline flag which cause newlines to be treated as characters. That way, ^ still matches the start of the string, $ matches the end of the string and . matches everything including newlines.
The other way to fix this (although I wouldn't recomend it for your use case) is to modify your regex to explicitly allow newlines. To do this you can just replace any . with (?:.|\n) which means either anycharacter or a newline. For your example, you would end up with ^(#+)(?:.|\n)+$. If you want to ensure that there's a non-linebreak character first, add an extra dot: ^(#+).(?:.|\n)+$

Regex to match comma separated string with no comma at the end of the line

I am trying to write a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line. I have tried do this,that includes all the possible characters,but it still does not give me the correct output:
[RegularExpression("^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+$", ErrorMessage = "Comma is not allowed at the end of {0} ")]

^.*[^,]$
.* means all char,don't need so long

^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+(?<!,)$
^^
Just add lookbehind at the end.

a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line.
Mind that you can type much more than what you typed using a keyboard. Basically, you want to allow any character but a comma at the end of the line.
So,
(?!,).(?=\r\n|\z)
This regex is checking each line (because of the (?=\r\n|$) look-ahead), and the (?!,) look-ahead makes sure the last character (that we match using .) is not a comma. \z is an unambiguous string end anchor.
See regex demo
This will work even on a client side.
To also get the full line match, you can just add .* at the beginning of the pattern (as we are not using singleline flag, . does not match newline symbols):
.*(?!,).(?=\r\n|\z)
Or (making it faster with an atomic group or an inline multiline option with ^ start of line anchor, but will not work on the client side)
(?>.*)(?!,).(?=\r\n|\z)
(?m)^.*?(?!,).(?=\r\n|\z) // The fastest of the last three
See demo

What is the proper RegEx to grab all blocks of SAS code that start with 'proc sql;' and end with 'quit;'?

I have this:
var blockRegEx = new Regex("(proc sql;)(.*?)(quit;)", RegexOptions.IgnoreCase |
RegexOptions.Multiline);
but it only works if the string is on a single line.
For example:
proc sql;
create table xtr as
select
midsu_client_id,
prodt_cd,
confmt_ind,
maj_diag_categ,
mbr_num,
pay_amt format=comma16.2
from cr_data.rptng
where &acctnum
and gl_postg between "&date_1" and "&date_2"
;
quit;

RegexOptions.MultiLine changes the behavior of the '^' and '$' characters:
Multiline mode. Changes the meaning of ^ and $ so they match at the
beginning and end, respectively, of any line, and not just the
beginning and end of the entire string.
Multiline is useful if you're passing multiple lines at once into your regex search and you want to treat them as multiple lines (i.e. they all start with '^' and end with '$').
I think you want to try using RegexOptions.SingleLine instead:
Specifies single-line mode. Changes the meaning of the dot (.) so it
matches every character (instead of every character except \n).
SingleLine is useful if you're passing multiple lines at once into your regex search and you want to treat them as thought they were actually all a single line.

Regex to match full lines of text excluding crlf

How would a regex pattern to match each line of a given text be?
I'm trying ^(.+)$ but it includes crlf...

Just use RegexOptions.Multiline.
Multiline mode. Changes the meaning of
^ and $ so they match at the beginning
and end, respectively, of any line,
and not just the beginning and end of
the entire string.
Example:
var lineMatches = Regex.Matches("Multi\r\nlines", "^(.+)$", RegexOptions.Multiline);

I'm not sure what you mean by "match each line of a given text" means, but you can use a character class to exclude the CR and LF characters:
[^\r\n]+

The wording of your question seems a little unclear, but it sounds like you want RegexOptions.Multiline (in the System.Text.RegularExpressions namespace). It's an option you have to set on your RegEx object. That should make ^ and $ match the beginning and end of a line rather than the entire string.
For example:
Regex re = new Regex("^(.+)$", RegexOptions.Compiled | RegexOptions.Multiline);

Have you tried:
^(.+)\r?\n$
That way the match group includes everything except the CRLF, and requires that a new line be present (Unix default), but accepts the carriage return in front (Windows default).

I assume you're using the Multiline option? In that case you'll want to match the newline explicitly with "\n". (substitute "\r\n" as appropriate.)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# remove line using regular expression, including line break - c#

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

Regex to match comma separated string with no comma at the end of the line

What is the proper RegEx to grab all blocks of SAS code that start with 'proc sql;' and end with 'quit;'?

Regex to match full lines of text excluding crlf

Categories

Resources