Do backreferences need to come after the group they reference? - c#

While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag> after the first:
var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, #"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>
But this will not:
Regex.Replace(input, #"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
Similarly, this will remove all occurences of <tag> before the last:
Regex.Replace(input, #"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>
But this will not:
Regex.Replace(input, #"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
So this got me thinking…
In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?

Your question got me thinking too, so I ran a few tests with RegexBuddy and to my surprise the second regex (?<=\1.*)(<[^>]+>) which you said didn't work actually worked and the others worked exactly like you said. I then tried the same expression - the second one - in C# code but it didn't work like what happened with you.
This got me confused, then I noticed that my RegexBuddy version dates back to 2008 so there must have been some change in how the .NET engine works, but this shed the light on a fact I though is rational, it seems that before 2008 lookbehinds were evaluated after the rest of the expression matched. I felt this behavior is a bit acceptable with lookbehinds since you need to match something before you look behind to match something before it.
Nevertheless, the engines these days seem to evaluate lookarounds when it encounters them and I was able to find this out by using the following expression which is like the reverse situation of your case:
(?<=(\w))\1
As you can see I captured a word character inside the regex and referenced it outside it. I tested this on the string hello and it matched at the second l character as expected and this proves that the lookbehind was executed before attempting to match the rest of the expression.
Conclusion: Yes, a back reference need to appear after the group it references or it will have no match semantics.

Related

Reusable Non-Capture Groups [duplicate]

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:
Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:
nnnn
nnnn
nnnn
I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.
Update 2016:
To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).
(?=(nn))
Original answer (2008)
A possible solution could be to use a positive look behind:
(?<=n)n
It would give you the end position of:
nnnn
 
nnnn
 
nnnn
As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)
I would prefer to his proposition (?=nn)n the simpler form:
(n)(?=(n))
That would reference the first position of the strings you want and would capture the second n in group(2).
That is so because:
Any valid regular expression can be used inside the lookahead.
If it contains capturing parentheses, the backreferences will be saved.
So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).
Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:
Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
matchObj = regexObj.Match(subjectString, matchObj.Index + 1);
}
AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).
Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.
[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

Why is this regex lookbehind not following the left priority in an alternation?

Say input is String1OptionalString2WhatWeWant
Another kind of input is String1WhatWeWant
So I want to match WhatWeWant part, and first part should go to prefix.
However I cant seem to get this result.
Following regex doesn't produce desired effect
(?<=string1optionalstring2|string1)\w+
It still matches optionalstring2 while I don't what that.
I assumed that it would prefer left full match ..
I assume String1 is always present? Then:
(?:String1)(?:OptionalString2)?\w+
What happened
To understand why the lookbehind behave in a seemingly incoherent way, remember that the regex engine goes from left to right and returns the first match it finds.
Let's look at the steps it takes to match (?<=ab|a)\w+ on abc:
the engine starts at a. There isn't anything before, so the lookbehind fails
transmission kicks in, the engine is now considering a match starting from b
the lookbehind tries the first item of the alternation (ab) which fails
... but the second item (a) matches
\w+ matches the rest of the string
The overall match is therefore bc, and the regex engine hasn't broken any of its rule in the process.
How to fix it
If C# supported the \K escape sequence, you could just use the greediness of ? to do the work for you (demo here):
string1(?:optionalstring2)?\K\w+
However, this (sadly) isn't the case. It therefore seems that you are stuck with using a capturing group:
string1(?:optionalstring2)?(\w+)

How to use regex in order to catch a power statement

How can I use regex to catch a power statement, here are some examples:
24
(2*5)x
y(y+1)
or more complex ones such as x4+(x*2)(x+1) in which case it has 2 matches ("x4" and "(x*2)(x+1)")
I managed to get it working without the parenthesis using the expression:
Regex rPower = new Regex(#"\w\^\w");
But to deal with the possible existence of parenthesis I was thinking of something along these lines, but it still isn't working...
Regex rPower = new Regex(#"(?(?=\()(.*?(?=\)))|(\w))\^(?(?=\()(.*?(?=\)))|(\w))");
Any help/explanation that includes the thought process behind it would be deeply appreciated since I don't know much about regex and and I'm just now starting to learn it.
Thanks in advance
EDIT: For clarity what I intend to do is:
If in the string there is a substring which may start with an "(" in which case it should read everything from that "(" until it find and ")" otherwise assume it's an "\w", separated by a "^" which in turn follows another pattern just like the one it started with.
Basically it will match the expression "(random_Expression)(random_Expression)", but it may not actually be a complex expression, if it does not contain any parenthesis I will assume it's a simple "\w".
I hope I made myself clear :S
You may use this:
(\([^)]*\)|\w)\^(\([^)]*\)|\w)
Sample matches:
2^2 matches 2^2
a+b^c matches b^c
(a+b)^(c+d) matches (a+b)^(c+d)
2^(a+b) matches 2^(a+b)
(a+b)^2 matches (a+b)^2
(a+b)^2+5^2-(3+2)^(2+3) matches (a+b)^2, 5^2, (3+2)^(2+3)
Obviously, you may find bugs on the expression if stuff like nested operations is used. If you are going to work with complex expressions, I guess you will have to parse them carefully with a more elaborated method.
Could you please edit or reply with an explanation even if brief of
how the expression is working?
It is similar to your original expression \w\^\w, but it changes each \w with (\([^)]*\)|\w). If you look closely, that matches either "something inside parentheses" (given by\([^)]*\), which doesn't work for nested brackets) or "a simple word" (\w).
Hope that helps a bit :)

Regex to validate domain name with port

I am new developer and don't have much exposure on Regular Expression. Today I assigned to fix a bug using regex but after lots of effort I am unable to find the error.
Here is my requirement.
My code is:
string regex = "^([A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9] {1,3}\\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\\*)$";
Regex _hostEndPointRegex = new Regex(regex);
bool isTrue = _hostEndPointRegex.IsMatch(textBox1.Text);
It's throwing an error for the domain name like "nikhil-dev.in.abc.ni:8080".
I am not sure where the problem is.
Your regex is a bit redundant in that you or in some stuff that is already included in the other or block.
I just simplified what you had to
(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
and it works just fine...
I'm not sure why you had \ in the allowed characters as I am pretty sure \ is not allowed in a host name.
Your problem is that your or | breaks things up like this...
[A-Za-z0-9\\-]+
or
[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}
or
\*
Which as the commentor said was not including "-" in the 2nd block.
So perhaps you intended
^((?:[A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3})\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\*)$
However the first to two or'ed items would be redundant as + includes {1-3}.
ie. [A-Za-z0-9\-]+ would also match anything that this matches [A-Za-z0-9]{1,3}
You can use this tool to help test your Regex:
http://regexpal.com/
Personally I think every developer should have regexbuddy
The regex above although it works will allow non-valid host names.
it should be modified to not allow punctuation in the first character.
So it should be modified to look like this.
(?:[A-Za-z0-9][A-Za-z0-9-]+\.)(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
Also in theory the host isn't allowed to end in a hyphen.
it is all so complicated I would use the regex only to capture the parts and then use Uri.CheckHostName to actually check the Uri is valid.
Or you can just use the regex suggested by CodeCaster

Help writing a regular expression

I asked a very similar question to this one almost a month ago here.
I am trying very hard to understand regular expressions, but not a bit of it makes any sense. SLak's solution in that question worked well, but when I try to use the Regex Helper at http://gskinner.com/RegExr/ it only matches the first comma of -2.2,1.1-6.9,2.3-12.8,2.3 when given the regex ,|(?<!^|,)(?=-)
In other words I can't find a single regex tool that will even help me understand it. Well, enough whining. I'm now trying to re-write this regex so that I can do a Regex.Split() to split up the string 2.2 1.1-6.9,2.3-12.8 2.3 into -2.2, 1.1, -6.9, 2.3, -12.8, and 2.3.
The difference the aforementioned question is that there can now be leading and/or trailing whitespace, and that whitespace can act as a delimiter as can a comma.
I tried using \s|,|(?<!^|,)(?=-) but this doesn't work. I tried using this to split 293.46701,72.238185, but C# just tells me "the input string was not in a correct format". Please note that there is leading and trailing whitespace that SO does not display correctly.
EDIT: Here is the code which is executed, and the variables and values after execution of the code.
If it doesn't have to be Regex, and if it doesn't have to be slow :-) this should do it for you:
var components = "2.2 1.1-6.9,2.3-12.8 2.3".Replace("-", ",-").
Split(new[]{' ', ','},StringSplitOptions.RemoveEmptyEntries);
Components would then contain:[2.2 1.1 -6.9 2.3 -12.8 2.3]
Does it need to be split? You could do Regex.Matches(text, #"\-?[\d]+(\.[\d]+)?").
If you need split, Regex.Split(text, #"[^\d.-]+|(?=-)") should work also.
P.S. I used Regex Hero to test on the fly http://regexhero.net
Unless I'm missing the point entirely (it's Sunday night and I'm tired ;) ) I think you need to concentrate more on matching the things you do want and not the things you don't want.
Regex argsep = new Regex(#"\-?[0-9]+\.?[0-9]*");
string text_to_split = "-2.2 1.1-6.9,2.3-12.8 2.3 293.46701,72.238185";
var tmp3 = argsep.Matches(text_to_split);
This gives you a MatchCollection of each of the values you wanted.
To break that down and try and give you an understanding of what it's saying, split it up into parts:
\-? Matches a literal minus sign (\ denotes literal characters) zero or one time (?)
[0-9]+ Matches any character from 0 to 9, one or more times (+)
\.? Matches a literal full stop, zero or one time (?)
[0-9]* Matches any character from 0 to 9 again, but this time it's zero or more times (*)
You don't need to worry about things like \s (spaces) for this regex, as the things you're actually trying to match are the positive/negative numbers.
Consider using the string split function. String operations are way faster than regular expressions and much simpler to use/understand.
If the "Matches" approach doesnt work you could perhaps hack something in two steps?
Regex RE = new Regex(#"(-?[\d.]+)|,|\s+");
RE.Split(" -2.2,1.1-6.9,2.3-12.8,2.3 ")
.Where(s=>!string.IsNullOrEmpty(s))
Outputs:
-2.2
1.1
-6.9
2.3
-12.8
2.3

Categories