Overlapping rules in regex with named groups

Overlapping rules in regex with named groups - c#

I'm experiencing problems with a regex that parses custom phone numbers:
A value matching "wtvCode" group is optional;
A value matching "countryCode" group is optional;
The countryCode rule overlaps with areaCityCode rule for some values. In such cases, when countryCode is missing, its expression captures the areaCityCode value instead.
Code example is below.
Regex regex = new Regex(string.Concat(
"^(",
"(?<wtvCode>[A-Z]{3}|)",
"([-|/|#| |]|)",
"(?<countryCode>[2-9+]{2,5}|)",
"([-|/|#| |]|)",
"(?<areaCityCode>[0-9]{2,3}|)",
"([-|/|#| |]|))",
"(?<phoneNumber>(([0-9]{8,18})|([0-9]{3,4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4})|([0-9]{4}([-|/|#| |]|)[0-9]{4}([-|/|#| |]|)[0-9]{1,5})))",
"([-|/|#| |]|)",
"(?<foo>((A)|(B)))",
"([-|/|#| |]|)",
"(?<bar>(([1-9]{1,2})|)",
")$"
));
string[] validNumbers = new[] {
"11-1234-5678-27-A-2", // missing wtvCode and countryCode
"48-1234-5678-27-A-2", // missing wtvCode and countryCode
"55-48-1234-5678-27-A-2" // missing wtvCode
};
foreach (string number in validNumbers) {
Console.WriteLine("countryCode: {0}", regex.Match(number).Groups["countryCode"].Value);
Console.WriteLine("areaCityCode: {0}", regex.Match(number).Groups["areaCityCode"].Value);
Console.WriteLine("phoneNumber: {0}", regex.Match(number).Groups["phoneNumber"].Value);
}
The output for that is:
// First number
// countryCode: <- correct
// areaCityCode: 11 <- correct, but that's because "11" is never a countryCode
// phoneNumber: 1234-5678-27 <- correct
// Second number
// countryCode: 48 <- wrong, should be ""
// areaCityCode: <- wrong, should be "48"
// phoneNumber: 1234-5678-27 <- correct
// Third number
// countryCode: 55 <- correct
// areaCityCode: 48 <- correct
// phoneNumber: 1234-5678-27 <- correct
I've failed so far on fixing this regular expression in a way that it covers all my constraints and doesn't mess with countryCode and areaCityCode when a value match both rules. Any ideas?
Thanks in advance.
Update
The correct regex pattern for phone country codes can be found here: https://stackoverflow.com/a/6967885/136381

First I recommend using the ? quantifier to make things optional instead of the empty alternatives you're using now. And in the case of the country code, add another ? to make it non-greedy. That way it will try initially to capture the first bunch of digits in the areaCityCode group. Only if the overall match fails will it go back and use the countryCode group instead.
Regex regex = new Regex(
#"^
( (?<wtvCode>[A-Z]{3}) [-/# ] )?
( (?<countryCode>[2-9+]{2,5}) [-/# ] )??
( (?<areaCityCode>[0-9]{2,3}) [-/# ] )?
(?<phoneNumber> [0-9]{8,18} | [0-9]{3,4}[-/# ][0-9]{4}([-/# ][0-9]{1,5})? )
( [-/# ] (?<foo>A|B) )
( [-/# ] (?<bar>[1-9]{1,2}) )?
$",
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
As you can see, I've made a few other changes to your code, the most important being the switch from ([-|/|#| |]|) to [-/# ]. The pipes inside the brackets would just match a literal |, which I'm pretty sure you don't want. And the last pipe made the separator optional; I hope they don't really have to be optional, because that would make this job a lot more difficult.

There are two things overlooked by yourself and the other responders.
The first is that it makes more sense to work in reverse, in other words right to left because there are more required fields to the end of the text than at the begininning. By removing the doubt of the WTV and the Country code it becomes much easier for the regex parser to work (though intellectually harder for the person writting the pattern).
The second is the use of the if conditional in regex (? () | () ). That allows us to test out a scenario and implement one match pattern over another. I describe the if conditional on my blog entitled Regular Expressions and the If Conditional. The pattern below tests out whether there is the WTV & Country, if so it matches that, if not it checks for an optional country.
Also instead of concatenating the pattern why not use IgnorePatternWhitespace to allow the commenting of a pattern as I show below:
string pattern = #"
^
(?([A-Z][^\d]?\d{2,5}(?:[^\d])) # If WTV & Country Code (CC)
(?<wtvCode>[A-Z]{3}) # Get WTV & CC
(?:[^\d]?)
(?<countryCode>\d{2,5})
(?:[^\d]) # Required Break
| # else maybe a CC
(?<countryCode>\d{2,5})? # Optional CC
(?:[^\d]?) # Optional Break
)
(?<areaCityCode>\d\d\d?) # Required area city
(?:[^\d]?) # Optional break (OB)
(?<PhoneStart>\d{4}) # Default Phone # begins
(?:[^\d]?) # OB
(?<PhoneMiddle>\d{4}) # Middle
(?:[^\d]?) # OB
(?<PhoneEnd>\d\d) # End
(?:[^\d]?) # OB
(?<foo>[AB]) # Foo?
(?:[^AB]+)
(?<bar>\d)
$
";
var validNumbers = new List<string>() {
"11-1234-5678-27-A-2", // missing wtvCode and countryCode
"48-1234-5678-27-A-2", // missing wtvCode and countryCode
"55-48-1234-5678-27-A-2", // missing wtvCode
"ABC-501-48-1234-5678-27-A-2" // Calling Belize (501)
};
validNumbers.ForEach( nm =>
{
// IgnorePatternWhitespace only allows us to comment the pattern; does not affect processing
var result = Regex.Match(nm, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.RightToLeft).Groups;
Console.WriteLine (Environment.NewLine + nm);
Console.WriteLine("\tWTV code : {0}", result["wtvCode"].Value);
Console.WriteLine("\tcountryCode : {0}", result["countryCode"].Value);
Console.WriteLine("\tareaCityCode: {0}", result["areaCityCode"].Value);
Console.WriteLine("\tphoneNumber : {0}{1}{2}", result["PhoneStart"].Value, result["PhoneMiddle"].Value, result["PhoneEnd"].Value);
}
);
Results:
11-1234-5678-27-A-2
WTV code :
countryCode :
areaCityCode: 11
phoneNumber : 1234567827
48-1234-5678-27-A-2
WTV code :
countryCode :
areaCityCode: 48
phoneNumber : 1234567827
55-48-1234-5678-27-A-2
WTV code :
countryCode : 55
areaCityCode: 48
phoneNumber : 1234567827
ABC-501-48-1234-5678-27-A-2
WTV code : ABC
countryCode : 501
areaCityCode: 48
phoneNumber : 1234567827
Notes:
If there is no divider between the country code and the city code,
there is no way a parser can determine what is city and what is
country.
Your original country code pattern failed [2-9] failed for any
country with a 0 in it. Hence I changed it to [2-90].

Related

Extract phone numbers and exclude extraneous characters

I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890

I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code

The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.

Find multiply groups matching in specific substring

I would like to catch bold values in the string below that starts with "need" word, while words in other string that starts from "skip" and "ignored" must be ignored. I tried the pattern
need.+?(:"(?'index'\w+)"[,}])
but it found only first(ephasised) value. How I can get needed result using RegEx only?
"skip" : {"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"}
"need" : {"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"}
"ignore" : {"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"}

We are going find need and group what we find into Named Match Group => Captures. There will be two groups, one named Index which holds the A | B | C and then one named Data.
The match will hold our data which will look like this:
From there we will join them into a dictionary:
Here is the code to do that magic:
string data =
#"""skip"" : {""A"":""ABCD123"",""B"":""ABCD1234"",""C"":""ABCD1235""}
""need"" : {""A"":""ZABCD123"",""B"":""ZABCD1234"",""C"":""ZABCD1235""}
""ignore"" : {""A"":""SABCD123"",""B"":""SABCD1234"",""C"":""SABCD1235""}";
string pattern = #"
\x22need\x22\s *:\s *{ # Find need
( # Beginning of Captures
\x22 # Quote is \x22
(?<Index>[^\x22] +) # A into index.
\x22\:\x22 # ':'
(?<Data>[^\x22] +) # 'Z...' Data
\x22,? # ',(maybe)
)+ # End of 1 to many Captures";
var mt = Regex.Match(data,
pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
// Get the data capture into a List<string>.
var captureData = mt.Groups["Data"].Captures.OfType<Capture>()
.Select(c => c.Value).ToList();
// Join the index capture data and project it into a dictionary.
var asDictionary = mt.Groups["Index"]
.Captures.OfType<Capture>()
.Select((cp, iIndex) => new KeyValuePair<string,string>
(cp.Value, captureData[iIndex]) )
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value );

If number of fields is fixed - you can code it like:
^"need"\s*:\s*{"A":"(\w+)","B":"(\w+)","C":"(\w+)"}
Demo
If tags would be after values - like that:
{"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"} : "skip"
{"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"} : "need"
{"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"} : "ignore"
Then you could employ infinite positive look ahead with
"\w+?":"(\w+?)"(?=.*"need")
Demo
But infinite positive look behind's are prohibited in PCRE. (prohibited use of *+ operators in look behind's syntax). So not very useful in your situation

You can't capture a dynamically set number of groups, so I'd run something like this regex
"need".*{.*,?".*?":(".+?").*}
[Demo]
with a 'match_all' function, or use Agnius' suggestion

C# regular expression for finding a certain pattern in a text

I'm trying to write a program that can replace bible verses within a document with any desired translation. This is useful for older books that contain a lot of KJV referenced verses. The most difficult part of the process is coming up with a way to extract the verses within a document.
I find that most books that place bible verses within the text use a structure like "N"(BookName chapter#:verse#s), where N is the verse text, the quotations are literal and the parens are also literal. I've been having problems coming up with a regular expression to match these in a text.
The latest regular expression I'm trying to use is this: \"(.+)\"\s*\(([\w. ]+[0-9\s]+[:][\s0-9\-]+.*)\). I'm having trouble where it won't find all the matches.
Here is the regex101 of it with a sample. https://regex101.com/r/eS5oT8/1
Is there anyway to solve this using a regular expression? Any help or suggestions would be greatly appreciated.

It's worth mentioning that the site you were using to test this relies on Javascript Regular Expressions, which require the g modifier to be explicitly defined, unlike C# (which is global by default).
You can adjust your expression slightly and ensure that you escape your double-quotes properly :
// Updated expression with escaped double-quotes and other minor changes
var regex = new Regex(#"\""([^""]+)\""\s*\(([\w. ]+[\d\s]+[:][\s\d\-]+[^)]*)\)");
And then use the Regex.Matches() method to find all of the matches in your string :
// Find each of the matches and output them
foreach(Match m in regex.Matches(input))
{
// Output each match here (using Console Example)
Console.WriteLine(m.Value);
}
You can see it in action in this working example with example output seen below :

Use the "g" modifier.
g modifier: global. All matches (don't return on first match)
See the Regex Demo

you can try with example given in MSDN here is the link
https://msdn.microsoft.com/en-us/library/0z2heewz(v=vs.110).aspx
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "ablaze beagle choral dozen elementary fanatic " +
"glaze hunger inept jazz kitchen lemon minus " +
"night optical pizza quiz restoration stamina " +
"train unrest vertical whiz xray yellow zealous";
string pattern = #"\b\w*z+\w*\b";
Match m = Regex.Match(input, pattern);
while (m.Success) {
Console.WriteLine("'{0}' found at position {1}", m.Value, m.Index);
m = m.NextMatch();
}
}
}
// The example displays the following output:
// 'ablaze' found at position 0
// 'dozen' found at position 21
// 'glaze' found at position 46
// 'jazz' found at position 65
// 'pizza' found at position 104
// 'quiz' found at position 110
// 'whiz' found at position 157
// 'zealous' found at position 174

How about starting with this as a guide:
(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)
Using the options:
RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase
The expression above will give you named captures of every element in the match for easy parsing (e.g., you'll be able to pick out quote, book, chapter and verse) by looking at, e.g., match.Groups["verse"].
Full code:
var input = #"Jesus said, ""'Love your neighbor as yourself.'
There is no commandment greater than these"" (Mark 12:31).";
var bibleQuotesRegex =
#"(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)";
foreach(Match match in Regex.Matches(input, bibleQuotesRegex, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase))
{
var bibleQuote = new
{
Quote = match.Groups["quote"].Value,
Book = match.Groups["book"].Value,
Chapter = int.Parse(match.Groups["chapter"].Value),
Verse = int.Parse(match.Groups["verse"].Value)
};
//do something with it.
}

After you've added "g", also be careful if there are multiple verses without any '\n' character in between, because "(.*)" will treat them as one long match instead of multiple verses. You will want something like "([^"]*)" to prevent that.

How to parse marked up text in C#

I am trying to make a simple text formatter using MigraDoc for actually typesetting the text.
I'd like to specify formatting by marking up the text. For example, the input might look something like this:
"The \i{quick} brown fox jumps over the lazy dog^{note}"
which would denote "quick" being italicized and "note" being superscript.
To make the splits I have made a dictionary in my TextFormatter:
internal static TextFormatter()
{
FormatDictionary = new Dictionary<string, TextFormats>()
{
{#"^", TextFormats.supersript},
{#"_",TextFormats.subscript},
{#"\i", TextFormats.italic}
};
}
I'm then hoping to split using some regexes that looks for the modifier strings and matches what is enclosed in braces.
But as multiple formats can exist in a string, I need to also keep track of which regex was matched. E.g. getting a List<string, TextFormats>, (where string is the enclosed string, TextFormats is the TextFormats value corresponding to the appropriate special sequence and the items are sorted in order of appearance), which I could then iterate over applying formatting based on the TextFormats.
Thank you for any suggestions.

Consider the following Code...
string inputMessage = #"The \i{quick} brown fox jumps over the lazy dog^{note}";
MatchCollection matches = Regex.Matches(inputMessage, #"(?<=(\\i|_|\^)\{)\w*(?=\})");
foreach (Match match in matches)
{
string textformat = match.Groups[1].Value;
string enclosedstring = match.Value;
// Add to Dictionary<string, TextFormats>
}
Good Luck!

I'm not sure if callbacks are available in Dot-Net, but
If you have strings like "The \i{quick} brown fox jumps over the lazy dog^{note}" and
you want to just do the substitution as you find them.
Could use regex replace using a callback
# #"(\\i|_|\^){([^}]*)}"
( \\i | _ | \^ ) # (1)
{
( [^}]* ) # (2)
}
then in callback examine capture buffer 1 for format, replace with {fmtCodeStart}\2{fmtCodeEnd}
or you could use
# #"(?:(\\i)|(_)|(\^)){([^}]*)}"
(?:
( \\i ) # (1)
| ( _ ) # (2)
| ( \^ ) # (3)
)
{
( [^}]* ) # (4)
}
then in callback
if (match.Groups[1].sucess)
// return "{fmtCode1Start}\4{fmtCode1End}"
else if (match.Groups[2].sucess)
// return "{fmtCode2Start}\4{fmtCode2End}"
else if (match.Groups[3].sucess)
// return "{fmtCode3Start}\4{fmtCode3End}"

Regular Expression Pattern C#

I have the following string that would require me to parse it via Regex in C#.
Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345
I would like to extract our the bold chars as group objects in a groupcollection.
QWD = 1 group
RET = 1 group
214345 = 1 group
what would the message pattern be like?

It would be something like this:
string s = "Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345";
Match m = Regex.Match(s, #"^Format: rec_mnd\.rate\.current_rate\.sum\.(.+?)\.(.+?) : (\d+)$");
if( m.Success )
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Console.WriteLine(m.Groups[3].Value);
}
The question mark in the first two groups make that quantifier lazy: it will capture the least possible amount of characters. In other words, it captures until the first . it sees. Alternatively, you could use ([^.]+) in those groups, which explicitly captures everything except a period.
The last group explicitly only captures decimal digits. If your expression can have other values on the right side of the : you'd have to change that to .+ as well.

Please, make it a lot easier on yourself and label your groups to make it easier to understand what is going on in code.
RegEx myRegex = new Regex(#"rec_mnd\.rate\.current_rate\.sum\.(?<code>[A-Z]{3})\.(?<subCode>[A-Z]{3})\s*:\s*(?<number>\d+)");
var matches = myRegex.Matches(sourceString);
foreach(Match match in matches)
{
//do stuff
Console.WriteLine("Match");
Console.WriteLine("Code: " + match.Groups["code"].Value);
Console.WriteLine("SubCode: " + match.Groups["subCode"].Value);
Console.WriteLine("Number: " + match.Groups["number"].Value);
}

This should give you what you want regardless of what's between the .'s.
#"(?:.+\.){4}(.\w+)\.(\w+)\s?:\s?(\d+)"

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Overlapping rules in regex with named groups - c#

Related

Extract phone numbers and exclude extraneous characters

Find multiply groups matching in specific substring

C# regular expression for finding a certain pattern in a text

How to parse marked up text in C#

Regular Expression Pattern C#

Categories

Resources