Regex split by same character within brackets - c#

I have a like long string, like so:
(A) name1, name2, name3, name3 (B) name4, name5, name7 (via name7) ..... (AA) name47, name47 (via name 46) (BB) name48, name49
Currently I split by "(" but it picks up the via as new lines)
string[] lines = routesRaw.Split(new[] { " (" }, StringSplitOptions.RemoveEmptyEntries);
How can I split the information within the first brackets only? There is no AB, AC, AD, etc. the characters are always the same within the brackets.
Thanks.

You may use a matching approach here since the pattern you need will contain a capturing group in order to be able to match the same char 0 or more amount of times, and Regex.Split outputs all captured substrings together with non-matches.
I suggest
(?s)(.*?)(?:\(([A-Z])\2*\)|\z)
Grab all non-empty Group 1 values. See the regex demo.
Details
(?s) - a dotall, RegexOptions.Singleline option that makes . match newlines, too
(.*?) - Group 1: any 0 or more chars, but as few as possible
(?:\(([A-Z])\2*\)|\z) - a non-capturing group that matches:
\(([A-Z])\2*\) - (, then Group 2 capturing any uppercase ASCII letter, then any 0 or more repetitions of this captured letter and then )
| - or
\z - the very end of the string.
In C#, use
var results = Regex.Matches(text, #"(?s)(.*?)(?:\(([A-Z])\2*\)|\z)")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.Where(z => !string.IsNullOrEmpty(z))
.ToList();
See the C# demo online.

Related

Regex match all words enclosed by parentheses and separated by a pipe

I think an image a better than words sometimes.
My problem as you can see, is that It only matches two words by two. How can I match all of the words ?
My current regex (PCRE) : ([^\|\(\)\|]+)\|([^\|\(\)\|]+)
The goal : retrieve all the words in a separate groupe for each of them
You can use an infinite length lookbehind in C# (with a lookahead):
(?<=\([^()]*)\w+(?=[^()]*\))
To match any kind of strings inside parentheses, that do not consist of (, ) and |, you will need to replace \w+ with [^()|]+:
(?<=\([^()]*)[^()|]+(?=[^()]*\))
// ^^^^^^
See the regex demo (and regex demo #2). Details:
(?<=\([^()]*) - a positive lookbehind that matches a location that is immediately preceded with ( and then zero or more chars other than ( and )
\w+ - one or more word chars
(?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.
Another way to capture these words is by using
(?:\G(?!^)\||\()(\w+)(?=[^()]*\)) // words as units consisting of letters/digits/diacritics/connector punctuation
(?:\G(?!^)\||\()([^()|]+)(?=[^()]*\)) // "words" that consist of any chars other than (, ) and |
See this regex demo. The words you need are now in Group 1. Details:
(?:\G(?!^)\||\() - a position after the previous match (\G(?!^)) and a | char (\|), or (|) a ( char (\()
(\w+) - Group 1: one or more word chars
(?=[^()]*\)) - a positive lookahead that makes sure there is a ) char after any zero or more chars other than ( and ) to the right of the current position.
Extracting the matches in C# can be done with
var matches = Regex.Matches(text, #"(?<=\([^()]*)\w+(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Value);
// Or
var matches = Regex.Matches(text, #"(?:\G(?!^)\||\()(\w+)(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Groups[1].Value);
In c# you can also make use of the group captures using a capture group.
The matches are in named group word
\((?<word>\w+)(?:\|(?<word>\w+))*\)
\( Match (
(?<word>\w+) Match 1+ word chars in group word
(?: Non capture group
\| Match |
(?<word>\w+) Match 1+ word chars
)* Close the non capture group and optionally repeat to get all occurrences
\) Match the closing parenthesis
Code example provided by Wiktor Stribiżew in the comments:
var line = "I love (chocolate|fish|honey|more)";
var output = Regex.Matches(line, #"\((?<word>\w+)(?:\|(?<word>\w+))*\)")
.Cast<Match>()
.SelectMany(x => x.Groups["word"].Captures);
foreach (var s in output)
Console.WriteLine(s);
Output
chocolate
fish
honey
more
foreach (var s in output)
Console.WriteLine(s);
Regex demo

How can I write a Regex with matching groups for a comma separated string

I've got a random input string to validate and tokenize.
My aim is to check if my string has the following pattern
[a-zA-Z]{2}\d{2} (one or unlimited times) comma separated
So:
aa12,af43,ad46 -> is valid
,aa12,aa44 -> is NOT valid (initial comma)
aa12, -> is NOT valid ( trailing comma)
That's the first part, validation
Then, with the same regex I've got to create a group for each occurrence of the pattern (match collection)
So:
aa12,af34,tg53
is valid and must create the following groups
Group 1 -> aa12
Group 2 -> af34
Group 3 -> tg53
Is it possible to have it done with only one regex that validates and creates the groups?
I've written this
^([a-zA-Z]{2}\d{2})(?:(?:[,])([a-zA-Z]{2}\d{2})(?:[,])([a-zA-Z]{2}\d{2}))*(?:[,])([a-zA-Z]{2}\d{2})*|$
but even if it creates the groups more or less correctly, it lacks in the validation process, getting also strings that have a wrong pattern.
Any hints would be very very welcome
You can use
var text = "aa12,af43,ad46";
var pattern = #"^(?:([a-zA-Z]{2}\d{2})(?:,\b|$))+$";
var result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(x => x.Groups[1].Captures.Cast<Capture>().Select(m => m.Value))
.ToList();
foreach (var list in result)
Console.WriteLine(string.Join("; ", list));
# => aa12; af43; ad46
See the C# demo online and the regex demo.
Regex details
^ - start of string
(?:([a-zA-Z]{2}\d{2})(?:,\b|$))+ - one or more occurrences of
([a-zA-Z]{2}\d{2}) - Group 1: two ASCII letter and then two digits
(?:,\b|$) - either , followed with a word char or end of string
$ - end of string. You may use \z if you want to prevent matching trailing newlines, LF, chars.

Extract phone numbers and exclude extraneous characters

I'm trying to create a regex which will extract a complete phone number from a string (which is the only thing in the string) but leaving out any cruft like decorative brackets, etc.
The pattern I have mostly appears to work, but returns a list of matches - whereas I want it to return the phone number with the characters removed. Unfortunately, it completely fails if I add the start and end of line matchers...
^(?!\(\d+\)\s*){1}(?:[\+\d\s]*)$
Without the ^ and $ this matches the following numbers:
12345-678-901 returns three groups: 12345 678 901
+44-123-4567-8901 returns four groups: +44 123 4567 8901
(+48) 123 456 7890 returns four groups: +48 123 456 7890
How can I get the groups to be returned as a single, joined up whole?
Other than that, the only change I would like to include is to return nothing if there are any non-numeric, non-bracket, non-+ characters anywhere. So, this should fail:
(+48) 123 burger 7890
I'd keep it simple, makes it more readable and maintainable:
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return Regex.Replace(messynumber, "[^0-9+]", "");
}
If any alphameric characters are present (extend this range if you wish) return blank else replace every char that is not 0-9 or +, with nothing. This produces output like 0123456789 and +481234567 with all the brackets, spaces and hyphens etc removed too. If you want to keep those in the output, add them to the Regex
Side note: It's not immediately clear or me what you think is "cruft" that should be stripped (non a-z?) and what you think is "cruft" that should cause blank (a-z?). I struggled with this because you said (paraphrase) "non digit, non bracket, non plus should cause blank" but earlier in your examples your processing permitted numbers that had hyphens and also spaces - being strictly demanding of spec hyphens/spaces would be "cruft that causes the whole thing to return blank" too
I've assumed that it's lowercase chars from the "burger" example but as noted you can extend the range in the IF part should you need to include other chars that return blank
If you have a lot of them to do maybe pre compile a regex as a class level variable and use it in the method:
private Regex _strip = new Regex( "[^0-9+]", RegexOptions.Compiled);
public string CleanPhoneNumber(string messynumber){
if(Regex.IsMatch(messynumber, "[a-z]"))
return "";
else
return _strip.Replace(messynumber, "");
}
...
for(int x = 0; x < millionStrArray.Length; x++)
millionStrArray[x] = CleanPhoneNumber(millionStrArray[x], "");
I don't think you'll gain much from compiling the IsMatch one but you could try it in a similar pattern
Other options exist if you're avoiding regex, you cold even do it using LINQ, or looping on char arrays, stringbuilders etc. Regex is probably the easiest in terms of short maintainable code
The strategy here is to use a look ahead and kick out (fail) a match if word characters are found.
Then when there are no characters, it then captures the + and all numbers into a match group named "Phone". We then extract that from the match's "Phone" capture group and combine as such:
string pattern = #"
^
(?=[\W\d+\s]+\Z) # Only allows Non Words, decimals and spaces; stop match if letters found
(?<Phone>\+?) # If a plus found at the beginning; allow it
( # Group begin
(?:\W*) # Match but don't *capture* any non numbers
(?<Phone>[\d]+) # Put the numbers in.
)+ # 1 to many numbers.
";
var number = "+44-123-33-8901";
var phoneNumber =
string.Join(string.Empty,
Regex.Match(number,
pattern,
RegexOptions.IgnorePatternWhitespace // Allows us to comment the pattern
).Groups["Phone"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value));
// phoneNumber is `+44123338901`
If one looks a the match structure, the data it houses is this:
Match #0
[0]: +44-123-33-8901
["1"] → [1]: -8901
→1 Captures: 44, -123, -33, -8901
["Phone"] → [2]: 8901
→2 Captures: +, 44, 123, 33, 8901
As you can see match[0] contains the whole match, but we only need the captures under the "Phone" group. With those captures { +, 44, 123, 33, 8901 } we now can bring them all back together by the string.Join.

Regex match word followed by decimal from text

I want to be able to match the following examples and return array of matches
given text:
some word
another 50.00
some-more 10.10 text
another word
Matches should be (word, followed by space then decimal number (Optionally followed by another word):
another 50.00
some-more 10.10 text
I have the following so far:
string pat = #"\r\n[A-Za-z ]+\d+\.\d{1,2}([A-Za-z])?";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(input);
but it only matches first item: another 50.00
You do not account for - with [A-Za-z ] and only match some text after a newline.
You can use the following regex:
[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?
See the regex demo
The [\p{L}-]+ matches 1 or more letters and hyphens, \p{Zs}* matches 0 or more horizontal whitespace symbols, \d*\.?\d{1,2} matches a float number with 1 to 2 digits in the decimal part, and (?:\p{Zs}*[\p{L}-]+)? matches an optional word after the number.
Here is a C# snippet matching all occurrences based on Regex.Matches method:
var res = Regex.Matches(str, #"[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?")
.Cast<Match>()
.Select(p => p.Value)
.ToList();
Just FYI: if you need to match whole words, you can also use word boundaries \b:
\b[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?\b
And just another note: if you need to match diacritics, too, you may add \p{M} to the character class containing \p{L}:
[\p{L}\p{M}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}\p{M}-]+)?\b

Basic regex for 16 digit numbers

I currently have a regex that pulls up a 16 digit number from a file e.g.:
Regex:
Regex.Match(l, #"\d{16}")
This would work well for a number as follows:
1234567891234567
Although how could I also include numbers in the regex such as:
1234 5678 9123 4567
and
1234-5678-9123-4567
If all groups are always 4 digit long:
\b\d{4}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}\b
to be sure the delimiter is the same between groups:
\b\d{4}(| |-)\d{4}\1\d{4}\1\d{4}\b
If it's always all together or groups of fours, then one way to do this with a single regex is something like:
Regex.Match(l, #"\d{16}|\d{4}[- ]\d{4}[- ]\d{4}[- ]\d{4}")
You could try something like:
^([0-9]{4}[\s-]?){3}([0-9]{4})$
That should do the trick.
Please note:
This also allows
1234-5678 9123 4567
It's not strict on only dashes or only spaces.
Another option is to just use the regex you currently have, and strip all offending characters out of the string before you run the regex:
var input = fileValue.Replace("-",string.Empty).Replace(" ",string.Empty);
Regex.Match(input, #"\d{16}");
Here is a pattern which will get all the numbers and strip out the dashes or spaces. Note it also checks to validate that there is only 16 numbers. The ignore option is so the pattern is commented, it doesn't affect the match processing.
string value = "1234-5678-9123-4567";
string pattern = #"
^ # Beginning of line
( # Place into capture groups for 1 match
(?<Number>\d{4}) # Place into named group capture
(?:[\s-]?) # Allow for a space or dash optional
){4} # Get 4 groups
(?!\d) # 17th number, do not match! abort
$ # End constraint to keep int in 16 digits
";
var result = Regex.Match(value, pattern, RegexOptions.IgnorePatternWhitespace)
.Groups["Number"].Captures
.OfType<Capture>()
.Aggregate (string.Empty, (seed, current) => seed + current);
Console.WriteLine ( result ); // 1234567891234567
// Shows False due to 17 numbers!
Console.WriteLine ( Regex.IsMatch("1234-5678-9123-45678", pattern, RegexOptions.IgnorePatternWhitespace));

Categories