Reusable Non-Capture Groups [duplicate] - c#

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:
Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:
nnnn
nnnn
nnnn
I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.

Update 2016:
To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).
(?=(nn))
Original answer (2008)
A possible solution could be to use a positive look behind:
(?<=n)n
It would give you the end position of:
nnnn
 
nnnn
 
nnnn
As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)
I would prefer to his proposition (?=nn)n the simpler form:
(n)(?=(n))
That would reference the first position of the strings you want and would capture the second n in group(2).
That is so because:
Any valid regular expression can be used inside the lookahead.
If it contains capturing parentheses, the backreferences will be saved.
So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).

Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:
Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
matchObj = regexObj.Match(subjectString, matchObj.Index + 1);
}

AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).
Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.
[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

Related

How to use regex in order to catch a power statement

How can I use regex to catch a power statement, here are some examples:
24
(2*5)x
y(y+1)
or more complex ones such as x4+(x*2)(x+1) in which case it has 2 matches ("x4" and "(x*2)(x+1)")
I managed to get it working without the parenthesis using the expression:
Regex rPower = new Regex(#"\w\^\w");
But to deal with the possible existence of parenthesis I was thinking of something along these lines, but it still isn't working...
Regex rPower = new Regex(#"(?(?=\()(.*?(?=\)))|(\w))\^(?(?=\()(.*?(?=\)))|(\w))");
Any help/explanation that includes the thought process behind it would be deeply appreciated since I don't know much about regex and and I'm just now starting to learn it.
Thanks in advance
EDIT: For clarity what I intend to do is:
If in the string there is a substring which may start with an "(" in which case it should read everything from that "(" until it find and ")" otherwise assume it's an "\w", separated by a "^" which in turn follows another pattern just like the one it started with.
Basically it will match the expression "(random_Expression)(random_Expression)", but it may not actually be a complex expression, if it does not contain any parenthesis I will assume it's a simple "\w".
I hope I made myself clear :S
You may use this:
(\([^)]*\)|\w)\^(\([^)]*\)|\w)
Sample matches:
2^2 matches 2^2
a+b^c matches b^c
(a+b)^(c+d) matches (a+b)^(c+d)
2^(a+b) matches 2^(a+b)
(a+b)^2 matches (a+b)^2
(a+b)^2+5^2-(3+2)^(2+3) matches (a+b)^2, 5^2, (3+2)^(2+3)
Obviously, you may find bugs on the expression if stuff like nested operations is used. If you are going to work with complex expressions, I guess you will have to parse them carefully with a more elaborated method.
Could you please edit or reply with an explanation even if brief of
how the expression is working?
It is similar to your original expression \w\^\w, but it changes each \w with (\([^)]*\)|\w). If you look closely, that matches either "something inside parentheses" (given by\([^)]*\), which doesn't work for nested brackets) or "a simple word" (\w).
Hope that helps a bit :)

How does non-backtracking subexpression work "(?>exp)"

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)
As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)
The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.
Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

The question of how to find every match when they might overlap was asked in Overlapping matches in Regex. However, as far as I can see, the answers there does not cover a more general case.
How can we find all substrings that begin with "a" and end with "z"? For example, given "akzzaz", it should find "akz", "akzz", "az" and "akzzaz".
Since there may be more than one match starting at the same position, ("akz" and "akzz") and also there may be more than one match ending at the same position ("az" and "akzzaz") I cannot see how using a lookahead or lookbehind helps as in the mentioned link. (Also, please bear in mind that in the general case "a" and "z" might be more complex regular expressions)
I use C#, so, in case it matters, having any feature specific to .Net Regular Expressions is OK.
Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.
I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.
The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.
For your current problem, string.startwith and string.endwith would do be a better job. Regular Expression is not necessarily faster in all cases.
Try this regular expression
a[akz]+z - in case a, k and z are the only characters
a[a-z]+z - in case of any alphabet
I think it's worth noting that there is actually a way for a regex to return more than one match at the same time. Although this doesn't answer your question, I think this would be a good place to mention this for others who may run into a similar situation.
The regex below for example would return all the right substrings of a string with a single match and has them in different capturing groups:
(?=(\w+)).
This regex uses capturing groups inside a zero-width assertion and for each match at position i(each character) the capturing group is a substring of length n-i.
Doing anything that would require the regex engine to stay in the same place after a match is probably overkill for a regular expression approach.

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

In C# regular expression why does the initial match show up in the groups?

So if I write a regex it's matches I can get the match or I can access its groups. This seems counter intuitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?
Regex quickCheck = new Regex(#"(\D+)\d+");
string source = "abc123";
m.Value //Equals source
m.Groups.Count //Equals 2
m.Groups[0]) //Equals source
m.Groups[1]) //Equals "abc"
I agree - it is a little strange, however I think there are good reasons for it.
A Regex Match is itself a Group, which in turn is a Capture.
But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.
But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:
[TestMethod]
public void UnMinifyExample()
{
string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
string result = Regex.Replace(toUnMinify, #"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}
The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.
Okay - you might wonder why you'd bother doing this with a regex - but humour me :)
If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!
The documentation for Match says that the first group is always the entire match so it's not an implementation detail.
It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?
FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)
Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.
I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)
Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.
It might be redundant, however it has some nice properties.
For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.
Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.
Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().

Categories