.NET Regular Expression: Get paragraphs - c#

I'm trying to get paragraphs from a string in C# with Regular Expressions.
By paragraphs; I mean string blocks ending with double or more \r\n. (NOT HTML paragraphs <p>)...
Here is a sample text:
For example this is a paragraph with a carriage return here
and a new line here. At this point, second paragraph starts. A paragraph ends if double or more \r\n is matched or if reached at the end of the string ($).
I tried the pattern:
Regex regex = new Regex(#"(.*)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Multiline);
but this does not work. It matches every line ending with a single \r\n. What I need is to get all characters including single carriage returns and newline chars till reached a double \r\n.

.* is being greedy and consuming as much as it can. Your second set of () has a $ so the expression that is being used is (.*)(?). In order to make the .* not be greedy, follow it with a ?.
When you specify RegexOptions.Multiline, .NET will split the input on line breaks. Use RegexOptions.Singleline to make it treat the entire input as one.
Regex regex = new Regex(#"(.*?)(?:(\r\n){2,}|\r{2,}|\n{2,}|$)", RegexOptions.Singleline);

An opposite approach will be to match the separators instead of the paragraphs, making the problem almost trivial. Consider:
string[] paragraphs = Regex.Split(text, #"^\s*$", RegexOptions.Multiline);
By splitting the input string by empty lines you can easily get all paragraphs. If you only want blank lines with no spaces you can simplify that even further, and use the parretn ^$. In that case you can also use the non-regex String.Split, with an array of separators:
string[] separators = {"\n\n", "\r\r", "\r\n\r\n"};
string[] paragraphs = text.Split(separators,
StringSplitOptions.RemoveEmptyEntries);

Do you have to use a regular expression? Tools like COCO/R could make this job pretty easy as well. In addition it might just prove to be faster than generating code at runtime using a regex.
COMPILER YourParaProcessor
// your code goes here
TOKENS
newLine= '\r'|'\n'.
paraLetter = ANY - '\n' - '\r' .
YourParaProcessor
=
{Paragraph}
.
Paragraph =
{paraLetter} '\r\n' .

Related

Matching multiple different start and end of line in one multiline regex [duplicate]

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?
To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

C# remove line using regular expression, including line break

I need to remove lines that match a particular pattern from some text. One way to do this is to use a regular expression with the begin/end anchors, like so:
var re = new Regex("^pattern$", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works fine except that it leaves an empty line instead of removing the entire line (including the line break).
To solve this, I added an optional capturing group for the line break, but I want to be sure it includes all of the different flavors of line breaks, so I did it like so:
var re = new Regex(#"^pattern$(\r\n|\r|\n)?", RegexOptions.Multiline);
string final = re.Replace(initial, "");
This works, but it seems like there should be a more straightforward way to do this. Is there a simpler way to reliably remove the entire line including the ending line break (if any)?
To match any single line break sequence you may use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029]) pattern. So, instead of (\r\n|\r|\n)?, you can use (?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?.
Details:
‎000A - a newline, \n
‎000B - a line tabulation char
‎000C - a form feed char
‎000D - a carriage return, \r
‎0085 - a next line char, NEL
‎2028 - a line separator char
‎- 2029 - a paragraph separator char.
If you want to remove any 0+ non-horizontal (or vertical) whitespace chars after a matched line, you may use [\s-[\p{Zs}\t]]*: any whitespace (\s) but (-[...]) a horizontal whitespace (matched with [\p{Zs}\t]). Note that for some reason, \p{Zs} Unicode category class does not match tab chars.
One more aspect must be dealt with here since you are using the RegexOptions.Multiline option: it makes $ match before a newline (\n) or end of string. That is why if your line endings are CRLF the pattern may fail to match. Hence, add an optional \r? before $ in your pattern.
So, either use
#"^pattern\r?$(?:\r\n|[\r\n\u000B\u000C\u0085\u2028\u2029])?"
or
#"^pattern\r?$[\s-[\p{Zs}\t]]*"

How to escape a delimiter by doubling the delimiter in a regex

I need to split a string on a delimiter, but not where the delimiter is doubled.
For instance "\m55.\m207|DEFAULT||DEFAULT|55||207" once split should result in
\m55.\m207
DEFAULT||DEFAULT
55||207
I'm trying to do this with a regex. If it makes a difference, I'm using C# System.Text.RegularExpression.Regex.
So far I have "[^|]\|[^|]" but that doesn't handle where an escaped delimiter is next to the delimter. IE |||
I'm sure there is a solution on the net, but I've tried searching with multiple different terms and couldn't find the right combination of terms to find it.
How do I escape the delimiter by doubling it in a regex? Or if there is a simpler solution what is it?
EDIT
Here is a more complicated example:
Input: "\m55.\m207|DEFAULT||DEFAULT|||55||207"
Expected output:
"\m55.\m207"
"DEFAULT||DEFAULT||"
"55||207"
Because your demo is so simple,and you just want to split with single |,so I can use \b here
string txt = #"\m55.\m207|DEFAULT||DEFAULT|55||207";
string patten = #"\b\|\b";
foreach (var str in Regex.Split(txt, patten))
{
Console.WriteLine(str);
}
(?<=[^|](?:\|{2})+)\|(?!\|)|(?<!\|)\|(?!\|)
You need to use lookarounds to make sure split happens on only one |.
See Demo

Regex.Replace removes '\r' character in "\r\n"

Here is a simple example
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(?<=parameter\s*=).*", newValue.ToString());
text will be "parameter=250\n" after replacement. Replace() method removes '\r'. Does it uses unix-style for line feed by default? Adding \b to my regex (?<=parameter\s*=).*\b solves the problem, but I suppose there should be a better way to parse lines with windows-style line feeds.
Take a look at this answer. In short, the period (.) matches every character except \n in pretty much all regex implementations. Nothing to do with Replace in particular - you told it to remove any number of ., and that will slurp up \r as well.
Can't test now, but you might be able to rewrite it as (?<=parameter\s*=)[^\r\n]* to explicitly state which characters you want disallowed.
. by default doesn't match \n..If you want it to match you have to use single line mode..
(?s)(?<=parameter\s*=).*
^
(?s) would toggle the single line mode
Try this:
string text = "parameter=120\r\n";
int newValue = 250;
text = Regex.Replace(text, #"(parameter\s*=).*\r\n", "${1}" + newValue.ToString() + "\n");
Final value of text:
parameter=250\n
Match carriage return and newline explicitly. Will only match lines ending in \r\n.

Regex to replace every newline

I need to replace every newline character in a string with another character (say 'X'). I do not wish to collapse multiple newlines with a single newline!
PS: this regex replaces all consecutive newlines with a single newline but its not what I need.
Regex regex_newline = new Regex("(\r\n|\r|\n)+");
That will replace one or more newlines with something, not necessarily with a single newline -- that's determined by the regex_newline.Replace(...) call, which you don't show.
So basically, the answer is
Regex regex_newline = new Regex("(\r\n|\r|\n)"); // remove the '+'
// :
regex_newline.replace(somestring, "X");
Just use the String.Replace method and replace the Environment.NewLine in your string. No need for Regex.
http://msdn.microsoft.com/en-us/library/system.environment.newline.aspx

Categories