Conditional match without false force a match? - c#

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Related

ASP.net core RegularException attribute - multiple conditions

I have two regex that should be matched:
"^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$"
and
".*(g[o0]+gle).*"
The first one accept any alpha numeric character (with few more extras). Like helloworld123. The second one should reject any string that contain the word "google" (in diffrent forms - like: gooo0gle).
Allowed:
hello
helloworld
helloworld123
Disallowed:
hellogoogle
google
...
I want to use the RegularExpression to match this string. Thought about something like:
[RegularExpression("^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$|.*(g[o0]+gle).*"]
But it's not working since the second part (.*(g[o0]+gle).*) should be NOT.
How to do it right?
Thanks.
You can use your second regex by placing it in a negative look ahead and use the first regex as character set and combine both to get following regex that you can use,
^(?!.*g[o0]+gle)[-a-z0-9!#$^&+%=_(){}<>'";:\/.,~`|]+$
Here, this (?!.*g[o0]+gle) negative look ahead will reject any strings that contains google or any variation as supported by your regex, and this character set [-a-z0-9!#$^&+%=_(){}<>'";:\/.,~|]+` will match one or more characters allowed by it.
Also, you don't need to escape most special characters while they are in character set, hence I have unescaped most of them except / and also always place the hyphen - either as the very first character or very last character in the character set, else depending upon the regex dialects, you may see weird behavior.
Regex Demo

Regex to allow periods unless it's alone

I'm trying to create a name verification regex that allows users to use names like St. Germain but I don't want names that are only a period like . which it currently accepts.
my current regex is /^[A-Za-z\ -\.\']+$/
Taken from #Mong Zhu's example but allowing first word without dots as well:
\w+\.?\s?\w+
Brief
Your current regex has a potential unwanted bug \ -., which will match any character in the range from space to dot. I'm not sure if this is the intended behaviour, if so, you can use the second regex below.
Code
Version 1
See regex in use here
^(?!\.+$)[a-zA-Z .'-]+$
Version 2
^(?!\.+$)[a-zA-Z -.']+$
Results
Input
username
.
Something.a
...
.Some
some. some
some.
Output
Note: Only matches are shown below
username
Something.a
.Some
some. some
some.
Explanation
^ Assert position at the start of the line
(?!\.+$) Negative lookahead ensuring what follows is not the dot character \. literally, one or more times, asserting the ending position at the end of the line
[a-zA-Z .'-]+ Any character in the set a-zA-Z .'- between one and unlimited times
$ Assert position at the end of the line
Additionally
You may want to use p{L} instead of a-zA-Z to accept foreign characters

Why is this regex lookbehind not following the left priority in an alternation?

Say input is String1OptionalString2WhatWeWant
Another kind of input is String1WhatWeWant
So I want to match WhatWeWant part, and first part should go to prefix.
However I cant seem to get this result.
Following regex doesn't produce desired effect
(?<=string1optionalstring2|string1)\w+
It still matches optionalstring2 while I don't what that.
I assumed that it would prefer left full match ..
I assume String1 is always present? Then:
(?:String1)(?:OptionalString2)?\w+
What happened
To understand why the lookbehind behave in a seemingly incoherent way, remember that the regex engine goes from left to right and returns the first match it finds.
Let's look at the steps it takes to match (?<=ab|a)\w+ on abc:
the engine starts at a. There isn't anything before, so the lookbehind fails
transmission kicks in, the engine is now considering a match starting from b
the lookbehind tries the first item of the alternation (ab) which fails
... but the second item (a) matches
\w+ matches the rest of the string
The overall match is therefore bc, and the regex engine hasn't broken any of its rule in the process.
How to fix it
If C# supported the \K escape sequence, you could just use the greediness of ? to do the work for you (demo here):
string1(?:optionalstring2)?\K\w+
However, this (sadly) isn't the case. It therefore seems that you are stuck with using a capturing group:
string1(?:optionalstring2)?(\w+)

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.
Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.
After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

What Regex to capture Multiline Text Between Two Phrases?

I need to capture form data text from an email form by capturing what exists between elements.
The text I get in the body of the email is multiline with a lot of whitespace between keywords. I don't care about the whitespace; I'll trim it out, but I have to be able to capture what occurs between two form field descriptors.
The key phrases are really clear and unique, but I can't get the Regex to work:
Sample data:
Loan Number:
123456789
Address:
101 Main Street
My City, WA
99101
Servicemember Name:
Joe Smith
Servicemember Phone Number:
423-283-5000
Complaint Description:
He has a complaint
Associate Information
Associate Name:
Some Dude
Phone Login:
654312
Complaint Date:
1/10/2012
Regex (to capture the loan number, for example):
^Loan Number:(.*?)Address:.$
What am I missing>?
EDIT: Also, in addition to capturing data between the various form labels, I need to capture the data between the last label and the end of the file. After reading the responses here, I've been able to capture the data between form labels, but not the last piece of data, the Complaint Date.
What am I missing?
You'll need to drop the anchors (^ and $) and enable the dotall which allows the . to match new lines. Not familiar enough with C#, but it should be the m modifier. Check the docs.
Why is this so difficult?
Regular Expressions are a very powerful tool. With great power comes great responsibility. That is, no one said it would be easy...
UPDATE
After reviewing the question more closely, you have solid anchor points and a very specific capture (i.e. loan number digits. The following regular expression should work and without the modifier mentioned about.
Loan Number\s+(\d+)\s+Escalation Required
This one works for me:
Loan Number(?<Number>(.*\n)+)Escalation Required
Where Number named group is the result.
Your main problem is that you aren't specifying Multiline mode. Without that, ^ only matches the very beginning of the text and $ only matches the very end. Also, the (.*?) needs to match the line separators before and after the loan number in addition to the number itself, and it can't do that unless you specify Singleline mode.
There are two ways you can specify these matching modes. One is by passing the appropriate RegexOptions argument when you create the Regex:
Regex r = new Regex(#"^Loan Number(.*?)Escalation Required.$",
RegexOptions.Multiline | RegexOptions.Singleline);
The other is by adding "inline" modifiers to the regex itself:
Regex r = new Regex(#"(?ms)^Loan Number(.*?)Escalation Required.$");
But I recommend you do this instead:
Regex r = new Regex(#"(?m)^Loan Number\s*(\d+)\s*Escalation Required(?=\z|\r\n|[\r\n])");
About \s*(\d+)\s*:
In Singleline mode (known as DOTALL mode in some flavors), there's nothing to stop .*? from matching all the way to the end of the document, however long it happens to be. It will try to consume as little as possible thanks to the non-greedy modifier (?), but in cases where no match is possible, the regex engine will have to do a lot of pointless work before it admits defeat. I practically never use Singleline mode for that reason.
Singleline mode or not, don't use .* or .*? without at least considering something more specific. In this case, \s*(\d+)\s* has the advantage that it allows you to capture the loan number only. You don't have to trim whitespace or perform any other operations to extract the part that interests you.
About (?=\z|\r\n|[\r\n]):
According to the Unicode standard, $ in multiline mode should match before a carriage-return (\r) or before a linefeed (\n) if it's not preceded by \r--it should never match between \r and \n. There are several other single-character line separators as well, but the .NET regex flavor doesn't recognize anything but \n. Your source text (an email message) uses \r\n to separate lines, which is why you had to add that dot before the anchor: .$.
But what if you don't know which kind of line separators to expect? Realistically, \n or \r\n are by far the most common choices, but even if you disregard the others, .$ is going to fail half the time. (?=\z|\r\n|[\r\n]) is still a hack, but it's a much more portable hack. ;) It even handles \r (carriage-return only) the line separator associated with pre-OSX Macintosh systems.

Categories