Regex nested parentheses - c#

I have the following string:
a,b,c,d.e(f,g,h,i(j,k)),l,m,n
Would know tell me how I could build a regex that returns me only the "first level" of parentheses something like this:
[0] = a,b,c,
[1] = d.e(f,g,h,i.j(k,l))
[2] = m,n
The goal would be to keep the section that has the same index in parentheses nested to manipulate future.
Thank you.
EDIT
Trying to improve the example...
Imagine I have this string
username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password
My goal is to turn a string into a dynamic query.
Then the fields that do not begin with "TB_" I know they are fields of the main table, otherwise I know informandos fields within parentheses, are related to another table.
But I am having difficulty retrieving all fields "first level" since I can separate them from related tables, I could go recursively recovering the remaining fields.
In the end, would have something like:
[0] = username,password
[1] = TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2))
I hope I have explained a little better, sorry.

You can use this:
(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+
With your example you obtain:
0 => username
1 => TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2))
2 => password
Explanation:
(?>\w+\.)? \w+ \( # the opening parenthesis (with the function name)
(?> # open an atomic group
\( (?<DEPTH>) # when an opening parenthesis is encountered,
# then increment the stack named DEPTH
| # OR
\) (?<-DEPTH>) # when a closing parenthesis is encountered,
# then decrement the stack named DEPTH
| # OR
[^()]+ # content that is not parenthesis
)* # close the atomic group, repeat zero or more times
\) # the closing parenthesis
(?(DEPTH)(?!)) # conditional: if the stack named DEPTH is not empty
# then fail (ie: parenthesis are not balanced)
You can try it with this code:
string input = "username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password";
string pattern = #"(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[0].Value);
}

I suggest a new strategy, R2 - do it algorithmically. While you can build a Regex that will eventually come close to what you're asking, it'll be grossly unmaintainable, and hard to extend when you find new edge cases. I don't speak C#, but this pseudo code should get you on the right track:
function parenthetical_depth(some_string):
open = count '(' in some_string
close = count ')' in some_string
return open - close
function smart_split(some_string):
bits = split some_string on ','
new_bits = empty list
bit = empty string
while bits has next:
bit = fetch next from bits
while parenthetical_depth(bit) != 0:
bit = bit + ',' + fetch next from bits
place bit into new_bits
return new_bits
This is the easiest way to understand it, the algorithm is currently O(n^2) - there's an optimization for the inner loop to make it O(n) (with the exception of String copying, which is kind of the worst part of this):
depth = parenthetical_depth(bit)
while depth != 0:
nbit = fetch next from bits
depth = depth + parenthetical_depth(nbit)
bit = bit + ',' + nbit
The string copying can be made more efficient with clever use of buffers and buffer size, at the cost of space efficiency, but I don't think C# gives you that level of control natively.

If I understood correctly your example, your are looking for something like this:
(?<head>[a-zA-Z._]+\,)*(?<body>[a-zA-Z._]+[(].*[)])(?<tail>.*)
For given string:
username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password
This expression will match
username, for group head
TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)) for group body
,password for group tail

Related

How to match any repeated chunks of characters?

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.
I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03
I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

C# Regex to obtain string up until a pattern

I've always been really bad when it comes to using regular expressions but it is something I want to seriously understand because as we all know, it is quite useful.
This is for a personal project, to keep my folders organized and neat.
I have a bunch of folders with the following naming pattern XXXXXXXX.XXXXXXX.XXXXXX.SYY.EYY.SOMETHINGELSE
There can be any amount of X repeating separated by ".", but the SYY.EYY is always there. So what I want is a regular expression to retrieve all the text represented by XXX without the "." if possible up until the SYY.EYY pattern.
I managed to detect the pattern because YY are always numbers, so doing something like \d{2} will detect it but I'm wondering if its possible to also add the rest of the pattern to that \d{2}.
Any help is appreciate it :)
If the YY is as you stated 2 digits and you want to get the text except the . up until for example S11.E22 you could make use of the \G anchor and a capturing group to get the text without a dot.
The value is in the Match.Groups property.
\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.
In parts
\G Assert position at the end of previous match (start at the beginning)
(?! Negative lookahead, assert what is directly to the right is not
S[0-9]{2}\.E[0-9]{2} Math S, 2 digits, . E and 2 digits
) Close lookahead
( Capture group 1
[^.]+ Match 1+ times any char except a dot
) Close group 1
\. Match dot literal
Regex demo | C# demo
For example
string pattern = #"\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.";
string input = #"XXXXXXXX.XXXXXXX.XXXXXX.S11.E22.SOMETHINGELSE";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
XXXXXXXX
XXXXXXX
XXXXXX
You can "replace/cut" the "." with C#.
The regex to get up until the SYY.EYY can be like this:
.SYY.EYY$
Line ends with word -> Regex: ExampleWord$
I would do something like:
var leftPart = Regex.Match(x, "^.*?(?=SYY)").Captures.First().Value;
// this now has XXXXXXXX.XXXXXXX.XXXXXX.
// And we can:
var left = leftPart.Replace(".", " "); // or any other char

Extract Multiple Occurances of Variable Length Text Without Multiple Patterns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
From the following data .xxx[val1, val2, val3] the values of val1, val2 and val3 need to be extracted.
If one uses this pattern #"\[(.*?), (.*?), (.*?)\]" the data can be extracted, but when the data string varies it fails to get all data.
Take these variable examples
.xxx[val1]
or .xxx[val1, val2, val3, val4, val5]
or finally .xxx[{1-N},].
What single regular expression pattern can achieve results on all sets of data provided as examples?
What would be the correct pattern for this?
The best practice is not to match the unknown, but design your pattern after the knowns. In similar practice, not blindly match using the .* (zero or more of anything) for backtracking can be horrendously slow; why add to complexity when it is not needed.
Frankly one should favor the + one or more usage more than * zero or more which should really be used when specific items may not appear.
the string can vary.
It appears by your example that if we were to think like a compiler, the tokens are separated by either a , or an ending ]. So let us develop a pattern with that knowledge (the knowns).
The best way to capture is to consume until a known is found. Using the not set of [^ ] pattern is best; which says match a character not in this set. Then add our total quantifier the + which says one or more. Effectively replacing the .* in your old pattern but in reverse.
var data = ".xxx[val1, val2, val3, val4, val5]";
var pattern = #"
[^[]+ # Consume anything that is *not* a brace
# but don't match it , (.xxx is the first anchor)
\[ # Starting brace consumed
( # Start of match captures
(?<Token>[^\s,\]]+) # Named Match grouping called `Token` where one or more
# of anything not a space, comma or end brace is captured.
[\s,\]]+ # Consume the token's `,` or space or final bracket.
)+ # End match captures, one or more
] # Ending brace."
;
// IgnorePatternWhitespace allows us to comment the pattern,
// does not affect parser processing.
Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace)
.Groups["Token"]
.Captures
.OfType<Capture>()
.Select(cp => cp.Value);
Result
You could capture #"\[(.*?)\]" in a first step and then split on the , which would certainly be a lot faster than using a regexp to do the same.
An easier way to do this just match everything inside [] and then split the match.
text.match(/\[(.*)\]/)[1].split(", "); //And now you have an array with var1,var2..etc
Here's a javascript example, I don't do c#, so don't want to mess it up :)
Despite a Group overwrites it's value if its repeated, it stores the whole stack of captures as a Capture Collection, returned by each group in Group.Captures Property.
Group.Captures Property
The real utility of the Captures property occurs when a quantifier is applied to a capturing group so that the group captures multiple substrings in a single regular expression. In this case, the Group object contains information about the last captured substring, whereas the Captures property contains information about all the substrings captured by the group.
Then, you can simply use this pattern:
\[(?:([^,\]]+),?\s*)+\]
Code:
string pattern = #"\[(?:([^,\]]+),?\s*)+\]";
var re = new Regex( pattern);
var text = #".xxx[val1, val2, val3]";
MatchCollection matches = re.Matches(text);
for (int mnum = 0; mnum < matches.Count; mnum++)
{ //loop matches
Match match = matches[mnum];
Console.WriteLine("Match #{0} - Value: {1}", mnum + 1, match.Value);
int captureCtr = 0;
foreach (Capture capture in match.Groups[1].Captures)
{ //loop captures for the 1st Group
Console.WriteLine(" Capture {0}: {1}",
captureCtr, capture.Value);
captureCtr += 1;
}
}
Output:
Match #1 - Value: [val1, val2, val3]
Capture 0: val1
Capture 1: val2
Capture 2: val3
ideone DEMO

getting the correct regex to print out in c#

Below is a regex statement I have been working on for quite sometime:
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
What this is supposed to be doing is taking the email out of the email below:
2.3|[0246303#up.com]
For clarification, this email comes from a table in SQL Server. There are many emails that are formatted like this in there and the regex is supposed to be getting all of that from inside the brackets. However, it is matching the entirety of this line instead of whats inside of it. So my question is, is there something wrong with my regex statement or do I have something in my code I need to add?
Your regex is storing the email address in capture group 1. Try referencing group 1 like this:
parsedRequestData.Groups[1];
Code Sample:
string requestData = "2.3|[0246303#up.com]";
Match parsedRequestData = Regex.Match(requestData, #"^.*\[(.*)\]$");
if (parsedRequestData.Success)
{
Console.WriteLine(parsedRequestData.Groups[1]);
}
Results:
0246303#up.com
Your regex is OK. All you need is to use the Group[1]
var email = Regex.Match("2.3|[0246303#up.com]", #"^.*\[(.*)\]$").Groups[1].Value;
However, it is matching the entirety of this line instead of whats inside of it.
Unless one uses named match captures, the match capture groups are indexed.
Match.Groups[0].Value is the whole match; it shows all the match captures and all the grouped matched text.
Match.Groups[{1-N}].Value is the match captures in the order of specification in the pattern for anything in a ( ) parenthesis set(s). If there is only one ( ) there will be two indexed groups; 0 as mentioned above, and 1 of the items specified to be captured to N.
You only have one ( ) set so the data you want is found in match capture group 1. Group 0 has the non match capture items along with the match capture data.
If one names the match capture such as (?<MyNameHere> ) one can also access the match via Match.Groups["MyNameHere"].Value.
Suggestion on your pattern away from the answer
Usage of * (zero or more) in patterns can be problematic in that it can significantly increase the time of the parser takes due to backtracking false scenarios.
If one knows there is text to be found, don't tell the parser zero items may happen when that is impossible, change it to + one or more. That slight change can greatly affect the parsing operations, both in time and operations.
Change ^.*\[(.*)\]$ to ^.+\[(.+)\]$.
But to even increase the efficiency of the pattern, focus on the knowns of the characters [ and ] as anchors.
Pattern Restructure To Use Anchors
^[^[]+\[([^\]]+)[\s\]]+$
Why is this pattern better? Because we will look for "[" and "]" as anchors.
Let us break it down
^ - Beginning of the pattern (a hard anchor)
[^ ]+ This is a set notation where the ^ says NOT.
[^\[]+ So we want to capture all text + (one or more) that is NOT a [. This tells the pattern to match up to our anchor [ in the text. Note that we don't have to escape it for regex parser treats all characters in a set [ ] as a literal so [^[] is valid. (To be clear this is a match but don't capture text anchor so we will not find this text in an index above the 0 index; only in 0).
\[ Our literal anchor the "[" character.
([^\]]+) This is our match capture which says match this set where any character is valid but not an "]". Here we have to escape the ] because otherwise it would signify the end of our set.
[\s\]]+ we know the end of our text there will be spaces and the "]" character, so let us match (but not to capture) any combination of spaces and a ] before the end.
$ our final anchor, the end of the file/buffer indicator (or line if the right parser rule is set).

How can I normalize/canonize a regular expression pattern?

I have a complex regular expression I've built with code. I want to normalize it to the simplest (canonical) form that will be an equivalent regular expression but without the extra brackets and so on.
I want it to be normalized so I can understand if it's correct and find bugs in it.
Here is an example for a regular expression I want to normalize:
^(?:(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))(?:(?:(?:\r\n(?:[ \t]+))*),(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))*))$
I'm with the other answers and comments so far. Even if you could define a reduced form, it's unlikely that the reduced form is going to be any more understandable than this thing, which resembles line noise on a 1200 baud modem.
If you did want to find a canonical form for regular expressions, i'd start by defining precisely what you mean by "canonical form". For example, suppose you have the regular expression [ABCDEF-I]. Is the canonical form (1) [ABCDEF-I], (2) [ABCDEFGHI] or (3) [A-I] ?
That is, for purposes of canonicalization, do you want to (1) ignore this subset of regular expressions for the purposes of canonicalization, (2) eliminate all "-" operators, thereby simplifying the expression, or (3) make it shorter?
The simplest way would be to go through every part of the regular expression specification and work out which subexpressions are logically equivalent to another form, and decide which of the two is "more canonical". Then write a recursive regular expression analyzer that goes through a regular expression and replaces each subexpression with its canonical form. Keep doing that in a loop until you find the "fixed point", the regular expression that doesn't change when you put it in canonical form.
That, however, will not necessarily do what you want. If what you want is to reorganize the regular expression to minimize the complexity of grouping or some such thing then what you might want to do is to canonicalize the regular expression so that it is in a form such that it only has grouping, union and Kleene star operators. Once it is in that form you can easily translate it into a deterministic finite automaton, and once it is in DFA form then you can run a graph simplification algorithm on the DFA to form an equivalent simpler DFA. Then you can turn the resulting simplified DFA back into a regular expression.
Though that would be fascinating, like I said, I don't think it would actually solve your problem. Your problem, as I understand it, is a practical one. You have this mess, and you want to understand that it is right.
I would approach that problem by a completely different tack. If the problem is that the literal string is hard to read, then don't write it as a literal string. I'd start "simplifying" your regular expression by making it read like a programming language instead of reading like line noise:
Func<string, string> group = s=>"(?:"+s+")";
Func<string, string> capture = s=>"("+s+")";
Func<string, string> anynumberof = s=>s+"*";
Func<string, string> oneormoreof = s=>s+"+";
var beginning = "^";
var end = "$";
var newline = #"\r\n";
var tab = #"\t";
var space = " ";
var semi = ";";
var comma = ",";
var equal = "=";
var chunked = "chunked";
var transfer = "<transfer-coding>";
var backslash = #"\\";
var escape = group(backslash + #"[\x00-\x7f]");
var or = "|";
var whitespace =
group(
anynumberof(
group(
newline +
group(
oneormoreof(#"[ \t]")))));
var legalchars =
group(
oneormoreof(#"[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]"));
var re =
beginning +
group(
whitespace +
capture(
transfer +
group(
chunked +
or +
group(
legalchars +
group(
group(
semi +
anynumberof(
group(
legalchars +
equal +
...
Once it looks like that it'll be a lot easier to understand and optimize.
I think you're getting ahead of yourself; the problems with that regex are not just cosmetic. Many of the parentheses can simply be dropped, as in (?:[ \t]+), but I suspect some of them are changing the meaning of the regex in ways you didn't intend.
For example, what's (?:|[^\x00-\x31\x127\"]) supposed to mean? With that pipe at the beginning, it's equivalent to [^\x00-\x31\x127\"]??--zero or one, reluctantly, of whatever the character class matches. Is that really what you intended?
The character class itself is highly suspect as well. It's obviously meant to match anything other than an ASCII control character or a quotation mark, but the numbers are decimal where they should be hexadecimal: [^\x00-\x1E\x7F\"]
I am not aware of any tool that can do this. I even strongly doubt there is something like a canonical form for regular expressions - they are complex enough that there are usually several and vastly different solutions.
If this expression is the output of an generator it seems much more promising to me to (unit)test the code generator.
I'd just write it in an expanded form:
^
(?:
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
(?:
(?: (?: \r\n (?:[ \t]+) )* )
,
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
)
)
$
You can quickly locate unnecessary grouping, and locate some errors. Some errors i saw:
Missing ? for the named groups. It should be (?<name> ).
No closing double quote (").
You can even use the regex in this form. If you supply the flag RegexOptions.IgnorePatternWhitespace when constructing the Regex object, any whitespace or comments (#) in the pattern will be ignored.
Proving correctness is not a good motivation for doing normalization because the normal form can be very obscure and totally irrecognizable.
To get correctness, you either 1) run a lot of tests on it 2) obtain the state machine and prove correctness by induction.

Categories