Matching multiple heading styles using regex

Matching multiple heading styles using regex - c#

I'm trying to use regex to capture section headings, but why is it that I am able to capture "4.1 General" with this, however if I add a newline to the end of the regex \n([\d\.]+ ?\w+)\n it no longer captures that line? Is it not followed by a newline or am I missing something?
Here's my example for reference
\n([\d\.]+ ?\w+)
Input
3.6.10
POLLUTION DEGREE 4
continuous conductivity occurs due to conductive dust, rain or other wet conditions
3.6.11
CLEARANCE
shortest distance in air between two conductive parts
3.6.12
CREEPAGE DISTANCE
shortest distance along the surface of a solid insulating material between two conductive
parts
4 Tests
4.1 General
Tests in this standard are TYPE TESTS to be carried out on samples of equipment or parts.
\n([\d\.]+ ?\w+)\n? doesn't seem to work either.

It is a classical case of overlapping matches. The previous match contains \n4 Tests\n and that last \n is already consumed, thus preventing the next match.
I see you want to match texts that are whole lines of the text, so, it makes more sense to use ^ and $ anchors with the RegexOptions.Multiline option:
#"(?m)^([\d.]+ ?\w+)\r?$"
See the .NET regex online demo
Note that $ in a .NET regex matches only before \n and since Windows line endings are CRLF, it is required to use an optional CR before $, \r?.
Results:

Have you considered that the new line may not be a single character?
\n([0-9\.]+ ?\w+)(\n|\r)
Using Expresso the above regex has 4 matches from your sample, the last one is
[LF]4.1 General[CR]
where [LF] is \n and [CR] is \r.
Keep in mind [CR], [LF] and [CRLF] are all possible designations for end of line.

Related

Using RegEx, what's the best way to capture groups of digits, ignoring any whitespace in them

Given the following string...
ABC DEF GHI: 319 022 6543 QRS : 531 450
I'm trying to extract all ranges that start/end with a digit, and which may contain whitespace, but I want that whitespace itself removed.
For instance, the above should yield two results (since there are two 'ranges' that match what I aim looking for)...
3190226543
531450
My first thought was this, but this matches the spaces between the letters...
([\d\s])
Then I tried this, but it didn't seem to have any effect...
([\d+\s*])
This one comes close, but its grabbing the trailing spaces too. Also, this grabs the whitespace, but doesn't remove it.
(\d[\d\s]+)
If it's impossible to remove the spaces in a single statement, I can always post-process the groups if I can properly extract them. That most recent statement comes close, but how do I say it doesn't end with whitespace, but only a digit?
So what's the missing expression? Also, since sometimes people just post an answer, it would be helpful to explain out the RegEx too to help others figure out how to do this. I for one would love not just the solution, but an explanation. :)
Note: I know there can be some variations between RegEx on different platforms so that's fine if those differences are left up to the reader. I'm more interested in understanding the basic mechanics of the regex itself more so than the syntax. That said, if it helps, I'm using both Swift and C#.

You cannot get rid of whitespace from inside the match value within a single match operation. You will need to remove spaces as a post-processing step.
To match a string that starts with a digit and then optionally contains any amount of digits or whitespaces and then a digit you can use
\d(?:[\d\s]*\d)?
Details:
\d - a digit
(?:[\d\s]*\d)? - an optional non-capturing group matching
[\d\s]* - zero or more whitespaces / digits
\d - a digit.
See the regex demo.

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Regex to allow periods unless it's alone

I'm trying to create a name verification regex that allows users to use names like St. Germain but I don't want names that are only a period like . which it currently accepts.
my current regex is /^[A-Za-z\ -\.\']+$/

Taken from #Mong Zhu's example but allowing first word without dots as well:
\w+\.?\s?\w+

Brief
Your current regex has a potential unwanted bug \ -., which will match any character in the range from space to dot. I'm not sure if this is the intended behaviour, if so, you can use the second regex below.
Code
Version 1
See regex in use here
^(?!\.+$)[a-zA-Z .'-]+$
Version 2
^(?!\.+$)[a-zA-Z -.']+$
Results
Input
username
.
Something.a
...
.Some
some. some
some.
Output
Note: Only matches are shown below
username
Something.a
.Some
some. some
some.
Explanation
^ Assert position at the start of the line
(?!\.+$) Negative lookahead ensuring what follows is not the dot character \. literally, one or more times, asserting the ending position at the end of the line
[a-zA-Z .'-]+ Any character in the set a-zA-Z .'- between one and unlimited times
$ Assert position at the end of the line
Additionally
You may want to use p{L} instead of a-zA-Z to accept foreign characters

Regex to match `xyz` in `abc|qw|xzy mno`

It's driving nuts.
The input strings are:
abc|qw|xzy mno
abc||xzy mno
abc|qw|xzy
abc|qw|
I need to extract the first word (if any) after the 2nd vertical bar, in all cases above xyz but in general words in multiple (natural) languages.
Also, all lines must be considered as a block so single line does not apply, iow, the EOL is the break to account for.
Thank you, guys.

You can use the following regexp with the RegexOptions.Multiline option.
(?<=^(?:[^|]*\|){2})\w+
(?<= begins a positive lookbehind, so this matches a word that must be preceded by the beginning of the line followed by two pipe-delimited sequences.

What Regex to capture Multiline Text Between Two Phrases?

I need to capture form data text from an email form by capturing what exists between elements.
The text I get in the body of the email is multiline with a lot of whitespace between keywords. I don't care about the whitespace; I'll trim it out, but I have to be able to capture what occurs between two form field descriptors.
The key phrases are really clear and unique, but I can't get the Regex to work:
Sample data:
Loan Number:
123456789
Address:
101 Main Street
My City, WA
99101
Servicemember Name:
Joe Smith
Servicemember Phone Number:
423-283-5000
Complaint Description:
He has a complaint
Associate Information
Associate Name:
Some Dude
Phone Login:
654312
Complaint Date:
1/10/2012
Regex (to capture the loan number, for example):
^Loan Number:(.*?)Address:.$
What am I missing>?
EDIT: Also, in addition to capturing data between the various form labels, I need to capture the data between the last label and the end of the file. After reading the responses here, I've been able to capture the data between form labels, but not the last piece of data, the Complaint Date.

What am I missing?
You'll need to drop the anchors (^ and $) and enable the dotall which allows the . to match new lines. Not familiar enough with C#, but it should be the m modifier. Check the docs.
Why is this so difficult?
Regular Expressions are a very powerful tool. With great power comes great responsibility. That is, no one said it would be easy...
UPDATE
After reviewing the question more closely, you have solid anchor points and a very specific capture (i.e. loan number digits. The following regular expression should work and without the modifier mentioned about.
Loan Number\s+(\d+)\s+Escalation Required

This one works for me:
Loan Number(?<Number>(.*\n)+)Escalation Required
Where Number named group is the result.

Your main problem is that you aren't specifying Multiline mode. Without that, ^ only matches the very beginning of the text and $ only matches the very end. Also, the (.*?) needs to match the line separators before and after the loan number in addition to the number itself, and it can't do that unless you specify Singleline mode.
There are two ways you can specify these matching modes. One is by passing the appropriate RegexOptions argument when you create the Regex:
Regex r = new Regex(#"^Loan Number(.*?)Escalation Required.$",
RegexOptions.Multiline | RegexOptions.Singleline);
The other is by adding "inline" modifiers to the regex itself:
Regex r = new Regex(#"(?ms)^Loan Number(.*?)Escalation Required.$");
But I recommend you do this instead:
Regex r = new Regex(#"(?m)^Loan Number\s*(\d+)\s*Escalation Required(?=\z|\r\n|[\r\n])");
About \s*(\d+)\s*:
In Singleline mode (known as DOTALL mode in some flavors), there's nothing to stop .*? from matching all the way to the end of the document, however long it happens to be. It will try to consume as little as possible thanks to the non-greedy modifier (?), but in cases where no match is possible, the regex engine will have to do a lot of pointless work before it admits defeat. I practically never use Singleline mode for that reason.
Singleline mode or not, don't use .* or .*? without at least considering something more specific. In this case, \s*(\d+)\s* has the advantage that it allows you to capture the loan number only. You don't have to trim whitespace or perform any other operations to extract the part that interests you.
About (?=\z|\r\n|[\r\n]):
According to the Unicode standard, $ in multiline mode should match before a carriage-return (\r) or before a linefeed (\n) if it's not preceded by \r--it should never match between \r and \n. There are several other single-character line separators as well, but the .NET regex flavor doesn't recognize anything but \n. Your source text (an email message) uses \r\n to separate lines, which is why you had to add that dot before the anchor: .$.
But what if you don't know which kind of line separators to expect? Realistically, \n or \r\n are by far the most common choices, but even if you disregard the others, .$ is going to fail half the time. (?=\z|\r\n|[\r\n]) is still a hack, but it's a much more portable hack. ;) It even handles \r (carriage-return only) the line separator associated with pre-OSX Macintosh systems.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.