C# Regex Captures Extra Parameters - c#

I'm using .NET's regex as part of my university assignment (writing a compiler). I found an interesting caveat that's driving me nuts.
I have this regex pattern: \A(?:(func)[^\w\d]*|(func)\z)
When I try to match string "func sum(a, b)\n..., the resulting Match object has one item in CaptureCollection containing the string "func ".
Why am I getting the whitespace along with my keyword?

You're talking about item #0. The item at index 0 is always the whole match. The following items are the captured groups.
You got a match from the (func)[^\w\d]* part, and [^\w\d]* captured the whitespace you're seeing in the result.

Because [^\w\d]* part match a space character, without it it gives only func. Compere it to THIS

You're trying to negate a character group of either a word or digit to come immediately after "func" with [^\w\d]*, a whitespace qualifies.
You also specify any number of non-words and non-digits with the *, explaining the several whitespaces captured alongside "func".
I hope that answers your question as to why you're capturing the whitespaces.
I am uncertain what your exact goal is, so here are some examples:
This statement matches only "func" with any word immediately after it: \A(?:(func)[\w\d]*|(func)\z)
This statement matches "func" at the beginning of EACH line and the end of the ENTIRE string: ^func|func\z
This statement matches "func" at the beginning of the entire string and the end of the ENTIRE string: \Afunc|func\z
You can find a quick reference page here: Regular Expression Language - Quick Reference

Related

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information
There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Regex - find matches not contained within pattern

I would like to use a regular expression to match all occurrences of a phrase where it's not contained within some delimiting characters. I tried putting one together but had some difficulty with the negative lookaheads.
My search phrase is "my phrase". The start delimiter tag is [[ and the end delimiter tag is ]]. The string I'd like to search is:
Here is a sentence with my phrase, here's another part which I don't want to match on [[my phrase]]. I would like to find this occurrence of my phrase.
From this string I would expect to find all occurrences of "my phrase" except the one contained within [[ ]].
I hope that makes sense, thanks in advance for any guidance.
[^#]my phrase[^#]
I have knocked up a RegEx that will do what you ask, this can be seen here.
Literally just escaping out # as a character and allowing any other character to be returned. You can return the index of these results but remember to strip off the first and last character of the string.
Note: This will not pick up any "my phrase" that end the sentence without a character following it
Edit - Seeing as you changed the scope while I was writing this answer,
here is the RegEx for the other delimiter:
[^[[]my phrase[^\]\]]
(?<=[^\[])my phrase(?=[^\]]*)
This will also elliminate the trailing punctuation marks.

Character 'e' is not recognized by simple regular expression - why?

I wrote a very simple regular expression that need to match the next pattern:
word.otherWord
- Word must have at least 2 characters and must not start with digit.
I wrote the next expression:
[a-zA-Z][a-zA-Z](.[a-zA-Z0-9])+
I tested it using Regex tester and it seems to be working at most of the cases but when I try some inputs that ends with 'e' it's not working.
for example:
Hardware.Make does not work but Hardware.Makee is works fine, why? How can I fix it?
That's because your regex looks for inputs which length is even.
You have two characters matched by [a-zA-Z][a-zA-Z] and then another two characters matched by (.[a-zA-Z0-9]) as a group which is repeated one or more times (because of +).
You can see it here: http://regex101.com/r/fW2bC1
I think you need that:
[a-zA-Z]+(\.[a-zA-Z0-9]+)+
Actually, the dot is a regex metacharacter, which stands for "any character". You'll need to escape the dot.
For your situation, I'd do this:
[a-zA-Z]{2,}\.[a-zA-Z0-9]+
The {2,} means, at least 2 characters from the previous range.
In regex, the dot period is one of the most commonly used metacharacters and unfortunately also commonly misused metacharacter. The dot matches a single character without caring what that character is...
So u would also re-write it like
[a-zA-Z]+(\.[a-zA-Z0-9]+)+

Does regex + symbol apply to previous element only?

In order to match all strings beginning with 04 and only containing digits, will the following work?
Regex.IsMatch(str, "^04[0-9]+$")
Or is another set of brackets necessary?
Regex.IsMatch(str, "^04([0-9])+$")
In Regex:
[character_group]
Matches any single character in character_group.
\d
Matches any decimal digit.
+
Matches the previous element one or more times.
(subexpression)
Captures the matched subexpression and assigns it a ordinal number.
^
The match must start at the beginning of the string or line.
$
The match must occur at the end of the string or before \n at the end of the line or string.
so that this code could be helpful:
Regex.IsMatch(str, "^04\d+$")
and all of your code works correctly.
Your first regex is correct, but the second one isn't. It matches the same things as the first regex, but it does a lot of unnecessary work in the process. Check it out:
Regex.IsMatch("04123", #"^04([0-9])+$")
In this example, the 1 is captured in group #1, only to be overwritten by 2 and again by 3. It's almost never a good idea to add a quantifier to a capturing group. For a detailed explanation, read this.
But maybe it's precedence rules you're asking about. Quantifiers have higher precedence than concatenation, so there's no need to isolate the character class with parentheses (if that's what you're doing).

Using regex to match any character until a substring is reached?

I'd like to be able to match a specific sequence of characters, starting with a particular substring and ending with a particular substring. My positive lookahead regex works if there is only one instance to match on a line, but not if there should be multiple matches on a line. I understand this is because (.+) captures up everything until the last positive lookahead expression is found. It'd be nice if it would capture everything until the first expression is found.
Here is my regex attempt:
##FOO\[(.*)(?=~~)~~(.*)(?=\]##)\]##
Sample input:
##FOO[abc~~hi]## ##FOO[def~~hey]##
Desired output: 2 matches, with 2 matching groups each (abc, hi) and (def, hey).
Actual output: 1 match with 2 groups (abc~~hi]## ##FOO[def, hey)
Is there a way to get the desired output?
Thanks in advance!
Use the question mark, it will match as few times as possible.
##FOO\[(.*?)(?=~~)~~(.*?)(?=\]##)\]##
This one also works but is not as strict although easier to read
##FOO\[(.*?)~~(.*?)\]##
The * operator is greedy by default, meaning it eats up as much of the string as possible while still leaving enough to match the remaining regex. You can make it not greedy by appending a ? to it. Make sure to read about the differences at the link.
You could use the String.IndexOf() method instead to find the first occurrence of your substring.

Categories