Optimizing a regex expression

Optimizing a regex expression - c#

I'm having issues where it's taking very long to run a match against this query. I'm trying to match up content that looks like the following:
One or more content paragraph of any length
Here is an optional paragraph
A single line or list item
A single line or list item
Here is my pattern. While it works for short expressions, it fails for longer ones.
^((.+[\r\n]?)+)\r\n\r\n([* -]*(.+)[\r\n]?)+$
My goal really is to separate out the first piece of content into a paragraph, and collect the last items into a list object using the matching pattern. I'm assuming two line breaks separate the paragraph(s) and a set of single-line items (only one line break).
Hope this isn't confusing. How can I optimize this regex? Thanks.

Time-consuming, inefficient backtracking can often be avoided by adding the ? modifier to the * and + quantifiers to make them match lazily or reluctantly, i.e. as few times as possible.
This can be particularly important when the quantifiers follow the . wildcard meta-character.
Try
(.+?)\r\n\r\n(?:[* -]*(.+?)(?:\r\n|$))+
with RegexOptions.Singleline so . matches any character including newlines.
(Alternatively use [\s\S] in place of the first .).
The first capture group will capture all that comes before the consecutive newlines, and then the next capture groups will capture each single line that follows. As in your regex, any leading *, - or space characters in the single lines will not be captured.
The paragraph/s will be match.Groups[1].Value, the first captured single line will be match.Groups[2].Captures[0].Value and the second match.Groups[2].Captures[1].Value) etc.
If the line-endings may be simply \n, change \r\n to \r?\n.

I'm not that good at regex but your one seems quite optimized to me. But to make it faster, use split instead to seperate the paragraph from the list
res = yourstring.Split('\r\n\r\n');
paragraph = res[0];
list=res[1];
then you can use regex or again split to seperate the list items from each other

Related

Remove all Trailing <br> Using Regex, Substitute Group not Returning Full Match

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.
EG:
This:
<span>Here is some<br></span><br>
<span><span>Here is some text</span><br><span><br> </span></span><br><br>
Becomes this:
<span>Here is some<br></span><br>
<span><span>Here is some text<span></span></span>
My first pass. I use this: Regex.Replace(htmlString, #"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.
I'm attempting to use this:
While(Regex.IsMatch(#"(<br>|\s| )*(<[^>]*>)*$")
{
Regex.Replace(htmlString, #"(<br>|\s| )*(<[^>]*>)*$", $2)
}
The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:
<span>Here is some<br></span><br>
<span><span>Here is some text</span></span>

The regular expression is in #"(<br>|\s| )*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.
Putting the repetition in a group will capture the whole repetition. Change the regular expression to be #"(<br>|\s| )*((<[^>]*>)*)$".
Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be #"(<br>|\s| )+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.

I guess you can use:
resultString = Regex.Replace(subjectString, #"<br>| |\n", "");
Regex Demo

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.

You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

Regex to parse formatter string

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.

Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.

Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.

Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.

After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

How to prevent regex from stopping at the first match of alternatives?

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}

I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.

If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.

As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.