Replace a string in multiline regex with end of line token - c#

I got the following regex
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)$", "",
RegexOptions.Multiline);
which doesn't work. It works if I use \r\n, but I would like to support all types of line breaks. As another answer states I have to use RegexOptions.Multiline to be able to use $ as end of line token (instead of end of string). But it doesn't seem to help.
What am I doing wrong?

I am not sure what you want to achieve, I think I understood, you want to replace also the newline character at the end of the row.
The problem is the $ is a zero width assertion. It does not match the newline character, it matches the position before \n.
You could do different other things:
If it is OK to match all following newlines, means also all following empty rows, you could do this:
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)[\r\n]+", "");
If you only want to match the newline after the row and keep following empty rows, you have to make a pattern for all possible combinations, e.g.:
var fixedString = Regex.Replace(subject, #"(:[\w]+ [\d]+)\r?\n", "");
This would match the combination \n and \r\n

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?
To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

Regex filtering Regex, extra additional final \r

I want to filter a regex with a ... regex ...
My target is in a file which content is
...
information 1...
Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
information 2...
...
The file is transferred to mystring with the method ReadAllText(path); where path is the path to the text file.
I use the code
//Retrieve regex like ^\|1[\s\t]+[\S]+[\s\t]+(.*)$ in Entity1=^\|1[\s\t]+[\S]+[\s\t]+(.*)$
//\d for any digit followed by =
// . for any character found 1 or + times, ended with space character
m = Regex.Match(mystring, #"Entity\d=(.+)\s");
string regex = m.Groups[1].Value;
which works almost fine
What I get is ( seen from inside the degugger )
^\|1[\s\t]+[\S]+[\s\t]+(.*)$\r
There is an additional \r at the end of the result. It causes an unwanted extra newline in other parts of the code.
Trying #"Entity\d=(.+)" (i.e removing the final \s) does not help.
Any idea of how to avoid the additionnal \r gracefully ( I do not want,if possible, to track the finale \r and remove it )
Online regex tester like regex101 did not permit to foresee this problem before going to C# code
Use a negated character class to make sure \r is not matched:
m = Regex.Match(mystring, #"Entity\d=([^\r\n]+)");
The [^\r\n] class means match any character other than a carriage return and a line feed.
It is true that regex101 does not keep carriage returns. You can see the \r matching at regexhero.net:
Check if this works:
#"Entity\d=(.+)(?=(\r|\n))";
(?=(\r|\n)) is a positive lookahead and means that the \r or \n won't be included in the result.
Edit:
#"Entity\d=(.+?)(?=\r|\n)";

Regex.Replace removes '\r' character in "\r\n"

Here is a simple example
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(?<=parameter\s*=).*", newValue.ToString());
text will be "parameter=250\n" after replacement. Replace() method removes '\r'. Does it uses unix-style for line feed by default? Adding \b to my regex (?<=parameter\s*=).*\b solves the problem, but I suppose there should be a better way to parse lines with windows-style line feeds.
Take a look at this answer. In short, the period (.) matches every character except \n in pretty much all regex implementations. Nothing to do with Replace in particular - you told it to remove any number of ., and that will slurp up \r as well.
Can't test now, but you might be able to rewrite it as (?<=parameter\s*=)[^\r\n]* to explicitly state which characters you want disallowed.
. by default doesn't match \n..If you want it to match you have to use single line mode..
(?s)(?<=parameter\s*=).*
^
(?s) would toggle the single line mode
Try this:
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(parameter\s*=).*\r\n", "${1}" + newValue.ToString() + "\n");
Final value of text:
parameter=250\n
Match carriage return and newline explicitly. Will only match lines ending in \r\n.

Regex to match full lines of text excluding crlf

How would a regex pattern to match each line of a given text be?
I'm trying ^(.+)$ but it includes crlf...
Just use RegexOptions.Multiline.
Multiline mode. Changes the meaning of
^ and $ so they match at the beginning
and end, respectively, of any line,
and not just the beginning and end of
the entire string.
Example:
var lineMatches = Regex.Matches("Multi\r\nlines", "^(.+)$", RegexOptions.Multiline);
I'm not sure what you mean by "match each line of a given text" means, but you can use a character class to exclude the CR and LF characters:
[^\r\n]+
The wording of your question seems a little unclear, but it sounds like you want RegexOptions.Multiline (in the System.Text.RegularExpressions namespace). It's an option you have to set on your RegEx object. That should make ^ and $ match the beginning and end of a line rather than the entire string.
For example:
Regex re = new Regex("^(.+)$", RegexOptions.Compiled | RegexOptions.Multiline);
Have you tried:
^(.+)\r?\n$
That way the match group includes everything except the CRLF, and requires that a new line be present (Unix default), but accepts the carriage return in front (Windows default).
I assume you're using the Multiline option? In that case you'll want to match the newline explicitly with "\n". (substitute "\r\n" as appropriate.)

Regex that matches a newline (\n) in C#

OK, this one is driving me nuts....
I have a string that is formed thus:
var newContent = string.Format("({0})\n{1}", stripped_content, reply)
newContent will display like:
(old text)
new text
I need a regular expression that strips away the text between parentheses with the parenthesis included AND the newline character.
The best I can come up with is:
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(original_content, regex);
var stripped_content = match.Groups["capture"].Value;
This works, but I want specifically to match the newline (\n), not any whitespace (\s)
Replacing \s with \n \\n or \\\n does NOT work.
Please help me hold on to my sanity!
EDIT: an example:
public string Reply(string old,string neww)
{
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(old, regex);
var stripped_content = match.Groups["capture"].Value;
var result= string.Format("({0})\n{1}", stripped_content, neww);
return result;
}
Reply("(messageOne)\nmessageTwo","messageThree") returns :
(messageTwo)
messageThree
If you specify RegexOptions.Multiline then you can use ^ and $ to match the start and end of a line, respectively.
If you don't wish to use this option, remember that a new line may be any one of the following: \n, \r, \r\n, so instead of looking only for \n, you should perhaps use something like: [\n\r]+, or more exactly: (\n|\r|\r\n).
Actually it works but with opposite option i.e.
RegexOptions.Singleline
You are probably going to have a \r before your \n. Try replacing the \s with (\r\n).
Think I may be a bit late to the party, but still hope this helps.
I needed to get multiple tokens between two hash signs.
Example i/p:
## token1 ##
## token2 ##
## token3_a
token3_b
token3_c ##
This seemed to work in my case:
var matches = Regex.Matches (mytext, "##(.*?)##", RegexOptions.Singleline);
Of course, you may want to replace the double hash signs at both ends with your own chars.
HTH.
Counter-intuitive as it is, you can use both Multiline and Singleline option.
Regex.Match(input, #"(.+)^(.*)", RegexOptions.Multiline | RegexOptions.Singleline)
First capturing group will contain first line (including \r and \n) and second group will have second line.
Why:
First of all RegexOptions enum is flag so it can be combined with bitwise operators, then
Multiline:
^ and $ match the beginning and end of each line (instead of the beginning and end of the input string).
Singleline:
The period (.) matches every character (instead of every character except \n)
see docs

Categories