Regex to match full lines of text excluding crlf - c#

How would a regex pattern to match each line of a given text be?
I'm trying ^(.+)$ but it includes crlf...

Just use RegexOptions.Multiline.
Multiline mode. Changes the meaning of
^ and $ so they match at the beginning
and end, respectively, of any line,
and not just the beginning and end of
the entire string.
Example:
var lineMatches = Regex.Matches("Multi\r\nlines", "^(.+)$", RegexOptions.Multiline);

I'm not sure what you mean by "match each line of a given text" means, but you can use a character class to exclude the CR and LF characters:
[^\r\n]+

The wording of your question seems a little unclear, but it sounds like you want RegexOptions.Multiline (in the System.Text.RegularExpressions namespace). It's an option you have to set on your RegEx object. That should make ^ and $ match the beginning and end of a line rather than the entire string.
For example:
Regex re = new Regex("^(.+)$", RegexOptions.Compiled | RegexOptions.Multiline);

Have you tried:
^(.+)\r?\n$
That way the match group includes everything except the CRLF, and requires that a new line be present (Unix default), but accepts the carriage return in front (Windows default).

I assume you're using the Multiline option? In that case you'll want to match the newline explicitly with "\n". (substitute "\r\n" as appropriate.)

Related

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

I'm trying to highlight markdown code, but am running into this weird behavior of the .NET regex multiline option.
The following expression: ^(#+).+$ works fine on any online regex testing tool:
But it refuses to work with .net:
It doesn't seem to take into account the $ tag, and just highlights everything until the end of the string, no matter what. This is my C#
RegExpression = new Regex(#"^(#+).+$", RegexOptions.Multiline)
What am I missing?
It is clear your text contains a linebreak other than LF. In .NET regex, a dot matches any char but LF (a newline char, \n).
See Multiline Mode MSDN regex reference
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
So, use
#"^(#+).+?\r?$"
The .+?\r?$ will match lazily any one or more chars other than LF up to the first CR (that is optional) right before a newline.
Or just use a negated character class:
#"^(#+)[^\r\n]+"
The [^\r\n]+ will match one or more chars other than CR/LF.
What you have is good. The only thing you're missing is that . doesn't match newline characters, even with the multiline option. You can get around this in two different ways.
The easiest is to use the RegexOptions.Singleline flag which cause newlines to be treated as characters. That way, ^ still matches the start of the string, $ matches the end of the string and . matches everything including newlines.
The other way to fix this (although I wouldn't recomend it for your use case) is to modify your regex to explicitly allow newlines. To do this you can just replace any . with (?:.|\n) which means either anycharacter or a newline. For your example, you would end up with ^(#+)(?:.|\n)+$. If you want to ensure that there's a non-linebreak character first, add an extra dot: ^(#+).(?:.|\n)+$

What is the proper RegEx to grab all blocks of SAS code that start with 'proc sql;' and end with 'quit;'?

I have this:
var blockRegEx = new Regex("(proc sql;)(.*?)(quit;)", RegexOptions.IgnoreCase |
RegexOptions.Multiline);
but it only works if the string is on a single line.
For example:
proc sql;
create table xtr as
select
midsu_client_id,
prodt_cd,
confmt_ind,
maj_diag_categ,
mbr_num,
pay_amt format=comma16.2
from cr_data.rptng
where &acctnum
and gl_postg between "&date_1" and "&date_2"
;
quit;
RegexOptions.MultiLine changes the behavior of the '^' and '$' characters:
Multiline mode. Changes the meaning of ^ and $ so they match at the
beginning and end, respectively, of any line, and not just the
beginning and end of the entire string.
Multiline is useful if you're passing multiple lines at once into your regex search and you want to treat them as multiple lines (i.e. they all start with '^' and end with '$').
I think you want to try using RegexOptions.SingleLine instead:
Specifies single-line mode. Changes the meaning of the dot (.) so it
matches every character (instead of every character except \n).
SingleLine is useful if you're passing multiple lines at once into your regex search and you want to treat them as thought they were actually all a single line.

Why is my C# Regular Expression not matcing between lines?

I have the following Regex in C#:
Regex h1Separator = new Regex(#"<h1>(?'name'[\w\d\s]+?)(<br\s?/?>)?</h1>", RegexOptions.Singleline);
Trying to match a string that looks like this:
<h1>test content<br>
</h1>
right now it matches strings that look like the following:
<h1>test content<br></h1>
<h1>test content</h1>
What am I doing wrong? Should I be matching for a newline character? If so, what is it in C#? I can't find one.
You don't check for whitespace between the end of the br tag and the start of the next tag, so it expects to see the hr tag immediately after. Add a \s* in between to allow that.
You have it defined as a single line regex, see the RegexOptions.Singleline flag :) use RegexOptions.Multiline
The newline character in C# is: \n. However, I am not skilled in regex and couldn't tell you what would happen if there was a newline in a regex expression.
you can either add a dot . to your string before the ending </h1> and keep the RegexOptions.Singleline option, or change it to RegexOptions.Multiline and add a $ to the regex before the </h1>. details here
Use the Multiline flag. (Edit to address my mispeaking about the .Net platform).
Singleline mode treats the entire string you are passing in as one entry. Therefore ^ and $ represent the entire string and not the beginning and ending of a line within the string. Example <h1>(?'name'[\w\d\s]+?)(<br\s?/?>)?</h1> will match this:
<h1>test content<br></h1>
Multiline mode changes the meaning of ^ and $ to the beginning and ending of each line within the string (i.e. they will look at every line break).
Regex h1Separator = new Regex(#"<h1>(?'name'[\w\d\s]+?)$(<br\s?/?>)?</h1>", RegexOptions.Multiline);
will match the desired pattern:
<h1>test content<br>
</h1>
In short, you need to tell the regex parser you expect to work with multiple lines. It helps to have a regex designer that speaks your dialect of regex. There are many.

What is wrong with my regex (simple)?

I am trying to make a regex that matches all occurrences of words that are at the start of a line and begin with #.
For example in:
#region #like
#hey
It would match #region and #hey.
This is what I have right now:
^#\w*
I apologize for posting this question. I'm sure it has a very simple answer, but I have been unable to find it. I admit that I am a regex noob.
What you've got should work, depending on what flags you pass for RegexOptions. You need to make sure you pass RegexOptions.Multiline:
var matches = Regex.Matches(input, #"^#\w*", RegexOptions.Multiline);
See the documentation I linked to above:
Multiline Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
The regex looks fine, make sure you're using a verbatim string literal (# prefix) to define your regex, i.e. #"^#\w*" otherwise the backslash will be treated as an escape sequence.
Use this regex
^#.+?\b
.+ will ensure at least one character after # and \b indicates word boundry. ? adds non-greediness to the + operator so as to avoid matching whole string #region #like

Regex that matches a newline (\n) in C#

OK, this one is driving me nuts....
I have a string that is formed thus:
var newContent = string.Format("({0})\n{1}", stripped_content, reply)
newContent will display like:
(old text)
new text
I need a regular expression that strips away the text between parentheses with the parenthesis included AND the newline character.
The best I can come up with is:
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(original_content, regex);
var stripped_content = match.Groups["capture"].Value;
This works, but I want specifically to match the newline (\n), not any whitespace (\s)
Replacing \s with \n \\n or \\\n does NOT work.
Please help me hold on to my sanity!
EDIT: an example:
public string Reply(string old,string neww)
{
const string regex = #"^(\(.*\)\s)?(?<capture>.*)";
var match= Regex.Match(old, regex);
var stripped_content = match.Groups["capture"].Value;
var result= string.Format("({0})\n{1}", stripped_content, neww);
return result;
}
Reply("(messageOne)\nmessageTwo","messageThree") returns :
(messageTwo)
messageThree
If you specify RegexOptions.Multiline then you can use ^ and $ to match the start and end of a line, respectively.
If you don't wish to use this option, remember that a new line may be any one of the following: \n, \r, \r\n, so instead of looking only for \n, you should perhaps use something like: [\n\r]+, or more exactly: (\n|\r|\r\n).
Actually it works but with opposite option i.e.
RegexOptions.Singleline
You are probably going to have a \r before your \n. Try replacing the \s with (\r\n).
Think I may be a bit late to the party, but still hope this helps.
I needed to get multiple tokens between two hash signs.
Example i/p:
## token1 ##
## token2 ##
## token3_a
token3_b
token3_c ##
This seemed to work in my case:
var matches = Regex.Matches (mytext, "##(.*?)##", RegexOptions.Singleline);
Of course, you may want to replace the double hash signs at both ends with your own chars.
HTH.
Counter-intuitive as it is, you can use both Multiline and Singleline option.
Regex.Match(input, #"(.+)^(.*)", RegexOptions.Multiline | RegexOptions.Singleline)
First capturing group will contain first line (including \r and \n) and second group will have second line.
Why:
First of all RegexOptions enum is flag so it can be combined with bitwise operators, then
Multiline:
^ and $ match the beginning and end of each line (instead of the beginning and end of the input string).
Singleline:
The period (.) matches every character (instead of every character except \n)
see docs

Categories