What Regex to capture Multiline Text Between Two Phrases?

What Regex to capture Multiline Text Between Two Phrases? - c#

I need to capture form data text from an email form by capturing what exists between elements.
The text I get in the body of the email is multiline with a lot of whitespace between keywords. I don't care about the whitespace; I'll trim it out, but I have to be able to capture what occurs between two form field descriptors.
The key phrases are really clear and unique, but I can't get the Regex to work:
Sample data:
Loan Number:
123456789
Address:
101 Main Street
My City, WA
99101
Servicemember Name:
Joe Smith
Servicemember Phone Number:
423-283-5000
Complaint Description:
He has a complaint
Associate Information
Associate Name:
Some Dude
Phone Login:
654312
Complaint Date:
1/10/2012
Regex (to capture the loan number, for example):
^Loan Number:(.*?)Address:.$
What am I missing>?
EDIT: Also, in addition to capturing data between the various form labels, I need to capture the data between the last label and the end of the file. After reading the responses here, I've been able to capture the data between form labels, but not the last piece of data, the Complaint Date.

What am I missing?
You'll need to drop the anchors (^ and $) and enable the dotall which allows the . to match new lines. Not familiar enough with C#, but it should be the m modifier. Check the docs.
Why is this so difficult?
Regular Expressions are a very powerful tool. With great power comes great responsibility. That is, no one said it would be easy...
UPDATE
After reviewing the question more closely, you have solid anchor points and a very specific capture (i.e. loan number digits. The following regular expression should work and without the modifier mentioned about.
Loan Number\s+(\d+)\s+Escalation Required

This one works for me:
Loan Number(?<Number>(.*\n)+)Escalation Required
Where Number named group is the result.

Your main problem is that you aren't specifying Multiline mode. Without that, ^ only matches the very beginning of the text and $ only matches the very end. Also, the (.*?) needs to match the line separators before and after the loan number in addition to the number itself, and it can't do that unless you specify Singleline mode.
There are two ways you can specify these matching modes. One is by passing the appropriate RegexOptions argument when you create the Regex:
Regex r = new Regex(#"^Loan Number(.*?)Escalation Required.$",
RegexOptions.Multiline | RegexOptions.Singleline);
The other is by adding "inline" modifiers to the regex itself:
Regex r = new Regex(#"(?ms)^Loan Number(.*?)Escalation Required.$");
But I recommend you do this instead:
Regex r = new Regex(#"(?m)^Loan Number\s*(\d+)\s*Escalation Required(?=\z|\r\n|[\r\n])");
About \s*(\d+)\s*:
In Singleline mode (known as DOTALL mode in some flavors), there's nothing to stop .*? from matching all the way to the end of the document, however long it happens to be. It will try to consume as little as possible thanks to the non-greedy modifier (?), but in cases where no match is possible, the regex engine will have to do a lot of pointless work before it admits defeat. I practically never use Singleline mode for that reason.
Singleline mode or not, don't use .* or .*? without at least considering something more specific. In this case, \s*(\d+)\s* has the advantage that it allows you to capture the loan number only. You don't have to trim whitespace or perform any other operations to extract the part that interests you.
About (?=\z|\r\n|[\r\n]):
According to the Unicode standard, $ in multiline mode should match before a carriage-return (\r) or before a linefeed (\n) if it's not preceded by \r--it should never match between \r and \n. There are several other single-character line separators as well, but the .NET regex flavor doesn't recognize anything but \n. Your source text (an email message) uses \r\n to separate lines, which is why you had to add that dot before the anchor: .$.
But what if you don't know which kind of line separators to expect? Realistically, \n or \r\n are by far the most common choices, but even if you disregard the others, .$ is going to fail half the time. (?=\z|\r\n|[\r\n]) is still a hack, but it's a much more portable hack. ;) It even handles \r (carriage-return only) the line separator associated with pre-OSX Macintosh systems.

Related

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Matching multiple heading styles using regex

I'm trying to use regex to capture section headings, but why is it that I am able to capture "4.1 General" with this, however if I add a newline to the end of the regex \n([\d\.]+ ?\w+)\n it no longer captures that line? Is it not followed by a newline or am I missing something?
Here's my example for reference
\n([\d\.]+ ?\w+)
Input
3.6.10
POLLUTION DEGREE 4
continuous conductivity occurs due to conductive dust, rain or other wet conditions
3.6.11
CLEARANCE
shortest distance in air between two conductive parts
3.6.12
CREEPAGE DISTANCE
shortest distance along the surface of a solid insulating material between two conductive
parts
4 Tests
4.1 General
Tests in this standard are TYPE TESTS to be carried out on samples of equipment or parts.
\n([\d\.]+ ?\w+)\n? doesn't seem to work either.

It is a classical case of overlapping matches. The previous match contains \n4 Tests\n and that last \n is already consumed, thus preventing the next match.
I see you want to match texts that are whole lines of the text, so, it makes more sense to use ^ and $ anchors with the RegexOptions.Multiline option:
#"(?m)^([\d.]+ ?\w+)\r?$"
See the .NET regex online demo
Note that $ in a .NET regex matches only before \n and since Windows line endings are CRLF, it is required to use an optional CR before $, \r?.
Results:

Have you considered that the new line may not be a single character?
\n([0-9\.]+ ?\w+)(\n|\r)
Using Expresso the above regex has 4 matches from your sample, the last one is
[LF]4.1 General[CR]
where [LF] is \n and [CR] is \r.
Keep in mind [CR], [LF] and [CRLF] are all possible designations for end of line.

Regex to allow periods unless it's alone

I'm trying to create a name verification regex that allows users to use names like St. Germain but I don't want names that are only a period like . which it currently accepts.
my current regex is /^[A-Za-z\ -\.\']+$/

Taken from #Mong Zhu's example but allowing first word without dots as well:
\w+\.?\s?\w+

Brief
Your current regex has a potential unwanted bug \ -., which will match any character in the range from space to dot. I'm not sure if this is the intended behaviour, if so, you can use the second regex below.
Code
Version 1
See regex in use here
^(?!\.+$)[a-zA-Z .'-]+$
Version 2
^(?!\.+$)[a-zA-Z -.']+$
Results
Input
username
.
Something.a
...
.Some
some. some
some.
Output
Note: Only matches are shown below
username
Something.a
.Some
some. some
some.
Explanation
^ Assert position at the start of the line
(?!\.+$) Negative lookahead ensuring what follows is not the dot character \. literally, one or more times, asserting the ending position at the end of the line
[a-zA-Z .'-]+ Any character in the set a-zA-Z .'- between one and unlimited times
$ Assert position at the end of the line
Additionally
You may want to use p{L} instead of a-zA-Z to accept foreign characters

Noncapturing along with capturing match

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?

So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

Regular expression need to identify where sentences don't have a space between them

I need a regular expression to identify all instances where a sentence begins without a space following the previous period.
For example, this is a bad sentence:
I'm sentence one.This is sentence two.
this needs to be fixed as follows:
I'm sentence one. This is sentence two.
It's not simply a case of doing a string replace of '.' with '. ' because there are a also a lot of isntances where the rest of the sentences in the paragraph the correct spacing, and this would give those an extra space.

\.(?!\s) will match dots not followed by a space. You probably want exclamation marks and question marks as well though: [\.\!\?](?!\s)
Edit:
If C# supports it, try this: [\.\!\?](?!\s|$). It won't match the punctuation at the end of the string.

You could search for \w\s{1}\.[A-Z] to find a word character, followed by a single space character, followed by a period, followed by a Capital letter, to identify these. For a find/replace: find: (\w\s{1}\.)(A-Z]) and replace with $1 $2.

I doubt that you can create a regular expression that will work in the general case.
Any regex solution you come up with is going to have some interesting edge cases that you'll have to look at carefully. For example, the abbreviation "i.e." would become "i. e." (i.e., it will have an extra space and, if this parenthetical comment were run through the regex, it would become "i. e. ,").
Also, the proper way to quote text is to include the punctuation inside the quotes, as in "He said it was okay." If you had ["He said it was okay."This is a new sentence.], your regex solution might put a space before the final quote, or might ignore the error altogether.
Those are just two cases that come to mind immediately. There are plenty of others.
Whereas a regular expression will work in a limited set of simple sentences, real written language will quickly show that regular expressions are insufficient to provide a general solution to this problem.

if a sentence ends with e.g. ... you probably don't want to change this to . . .
I think the previous answers don't consider this case.
try to insert space where you find a word followed a new word starting with uppercase
find (\w+[\.!?])([A-Z]'?\w+) replace $1 $2

Best website ever: http://www.regular-expressions.info/reference.html

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.