Regex: Matching all words EXCEPT those inside of parenthesis (C#) - c#

So given:
COLUMN_1, COLUMN_2, COLUMN_3, ((COLUMN_1) AS SOME TEXT) AS COLUMN_4, COLUMN_5
How would I go about getting my matches as:
COLUMN_1
COLUMN_2
COLUMN_3
COLUMN_4
COLUMN_5
I've tried:
(?<!(\(.*?\)))(\w+)(,\s*\w+)*?
But I feel like I'm way off base :( I'm using regexstorm.net for testing.
Appreciate any help :)

You need a regex that keeps track of opening and closing parentheses and makes sure that a word is only matched if a balanced set of parentheses (or no parentheses at all) follow:
Regex regexObj = new Regex(
#"\w+ # Match a word
(?= # only if it's possible to match the following:
(?> # Atomic group (used to avoid catastrophic backtracking):
[^()]+ # Match any characters except parens
| # or
\( (?<DEPTH>) # a (, increasing the depth counter
| # or
\) (?<-DEPTH>) # a ), decreasing the depth counter
)* # any number of times.
(?(DEPTH)(?!)) # Then make sure the depth counter is zero again
$ # at the end of the string.
) # (End of lookahead assertion)",
RegexOptions.IgnorePatternWhitespace);
I tried to provide a test link to regexstorm.net, but it was too long for StackOverflow. Apparently, SO also doesn't like URL shorteners, so I can't link this directly, but you should be able to recreate the link easily: http://bit[dot]ly/2cNZS0O

This should work:
(?<!\()COLUMN_[\d](?!\))
Try it: https://regex101.com/r/bC4D7n/1
Update:
Ok, then try to use this regular expression:
[\(]+[\w\s\W]+[\)]+
Demo here: https://regex101.com/r/bC4D7n/2

Matching all words except some set of them is one of the most difficult exercises you can do with regular expressions. The easy way is: just construct the finite automata that accepts your original non negated predicate about the strings it should accept, then change all the accepting states by non-accepting ones, and finally construct a regular expression that is equivalent to the automata just constructed. This is a task difficult to do, so the most easy way to deal with it is construct the regexp for the predicate you want to negate and pass your string through the regexp matcher, if it maches, just reject it.
The main problem with this is that that is easy to do with computers, but constructing a regular expression from an automata description is tedious and normally gives you not the result you want (and actually a huge result). Let me illustrate with an example:
You have asked for matching words, but from these words, you want the ones that don't appear in a set of them. Let's suppose we want the automata that matches preciselly that set of words, and suppose we have matched the first n-1 letters of that word. This string should be matched, but only if you don't get the final letter next. So the proper regexp should be a regexp that matches all the letters of the first word but the last.... Not, we can skip this test if we have a word that matches all the letters in the first word but the last two, and so successively, back to the first letter (obviously, if your regexp doesn't begin with the first letter of the word, it doesn't match anyway) Let's suppose the first word is BEGIN. A good regexp matching things that are not equal to BEGIN is something like this:
[^B]|B[^E]|BE[^G]|BEG[^I]|BEGI[^N]
a different scenario (that complicates things more) is to find a regexp that matches the string if the word BEGIN is not contained in the string. Let's part from the opposite predicate, to find a string that has the word BEGIN included
^.*BEGIN.*$
and let's construct its finite automata:
(0)---B--->(1)---E--->(2)---G--->(3)---I--->(4)---N--->((5))
^ \ | | | | ^ \
| | | | | | | |
`-+<-------+<---------+<---------+<---------' `-+
where the double parenthesis indicates an accepting state. If you just
change all the accepting states with non-accepting ones, you'll get an automata that accepts all the strings the first one didn't and viceversa.
((0))--B-->((1))--E-->((2))--G-->((3))--I-->((4))--N-->(5)
^ \ | | | | ^ \
| | | | | | | |
`-+<--------+<---------+<---------+<---------' `-+
But converting this into a simple regular expression is far from easy (you can try, if you don't believe me)
And this only with one word, so think how to match any of the words, construct the automata, and then switch the acceptance-nonacceptance status of each state.
In your case, we have something to deal with, in addition to the premise your predicate is not equivalent to the one I have formulated. My predicate is for matching expressions that have one word in it (which is the target for which regexp were conceived) but yours if for matching groups inside your regexp. If you try my example, you will find that a simple string as "" (the empty string) matches the second regexp, as the starting ((0)) state is accepting state (well, the empty string doesn't contain the word BEGIN), but you want your regexp to match words (and "" isn't a word) so we first need to define what is a word for you and construct the regular expression that matches a word:
[a-zA-Z][a-zA-Z]*
should be a good candidate. It should go in an automata definition like this:
(0)---[a-zA-Z]--->((1))---[a-zA-Z]--.
^ \ | ^ |
| * * | |
`--+<-------------' `-------------'
and you want an automata to accept both (1-must be a word, and 2-not in the set of words) (not being in the set of words is the same as not being the first word, and not being the second and not being the third... you can construct it by first constructing an automata that matches if it's the first word, or the second, or the third, ... and then negating it) construct the first automaton, the second and then construct an automaton that matches both. This, again is easy to be done with automatons for computers, but not for people.
As I said, construct an automaton from a regexp is an easy and direct thing for a computer, but not for a person. Construct a regexp from an automaton is also, but it results in huge regular expressions and because of this problem, most implementations have result in implementation of extender operators that match if some regexp doesn't and the opposite.
CONCLUSION
Use the negation operators that allow you to get to the opposite predicate about the set of strings your regexp acceptor must accept, or just simply construct a regexp to do simple things and use the boolean algebra to do the rest.

Since you have nested parentheses things get trickier. Although .NET RegEx engine provides balancing group constructs which uses stack memory, I go with a more general approach called recursive match.
Regex:
\((?(?!\(|\)).|(?R))*\)|(\w+)
Live demo
All you need is in first capturing group.
Explanation of left side of alternation:
\( # Match an opening bracket
(?(?!\(|\)) # If next character is not `(` or `)`
. # Then match it
| # Otherwise
(?R) # Recurs whole pattern
)* # As much as possible
\) # Up to corresponding closing bracket

Related

C# Regular expression to match on a character not following pairs of the same charcater

Objective: Regex Matching
For this example I'm interested in matching a "|" pipe character.
I need to match it if it's alone: "aaa|aaa"
I need to match it (the last pipe) only if it's preceded by pairs of pipe: (2,4,6,8...any even number)
Another way: I want to ignore ALL pipe pairs "||" (right to left)
or I want to select bachelor bars only (the odd man out)
string twomatches = "aaaaaaaaa||||**|**aaaaaa||**|**aaaaaa";
string onematch = "aaaaaaaaa||**|**aaaaaaa||aaaaaaaa";
string noMatch = "||";
string noMatch = "||||";
I'm trying to select the last "|" only when preceded by an even sequence of "|" pairs or in a string when a single bar exists by itself.
Regardless of the number of "|"
You may use the following regex to select just odd one pipe out:
(?<=(?<!\|)(?:\|{2})*)\|(?!\|)
See regex demo.
The regex breakdown:
(?<=(?<!\|)(?:\|{2})*) - if a pipe is preceded with an even number of pipes ((?:\|{2})* - 0 or more sequences of exactly 2 pipes) from a position that has no preceding pipe ((?<!\|))
\| - match an odd pipe on the right
(?!\|) - if it is not followed by another pipe.
Please note that this regex uses a variable-width look-behind and is very resource-consuming. I'd rather use a capturing group mechanism here, but it all depends on the actual purpose of matching that odd pipe.
Here is a modified version of the regex for removing the odd one out:
var s = "1|2||3|||4||||5|||||6||||||7|||||||";
var data = Regex.Replace(s, #"(?<!\|)(?<even_pipes>(?:\|{2})*)\|(?!\|)", "${even_pipes}");
Console.WriteLine(data);
See IDEONE demo. Here, the quantified part is moved from lookbehind to an even_pipes named capturing group, so that it could be restored with the backreference in the replaced string. Regexhero.net shows 129,046 iterations per second for the version with a capturing group and 69,206 with the original version with variable-width lookbehind.
Only use variable-width look-behind if it is absolutely necessary!
Oh, it's reopened! If you need better performance, also try this negative improved version.
\|(?!\|)(?<!(?:[^|]|^)(?:\|\|)*)
The idea here is to first match the last literal | at right side of a sequence or single | and execute a negated version of the lookbehind just after the match. This should perform considerably better.
\|(?!\|) matches literal | IF NOT followed by another pipe character (right most if sequence).
(?<!(?:[^|]|^)(?:\|\|)*) IF position right after the matched | IS NOT preceded by (?:\|\|)* any amount of literal || until a non| or ^ start.In other words: If this position is not preceded by an even amount of pipe characters.
Btw, there is no performance gain in using \|{2} over \|\| it might be better readable.
See demo at regexstorm

Regular expression performance issue

I've got a long string which contains about 100 parameters (string parameterName) matching the following pattern:
parameterName + "(Whitespace | CarriageReturn | Tabulation | Period | Underline | Digit | Letter | QuotationMark | Slash)* DATA Whitespace Hexadecimal"
I've tried to used this regular expression, but it works a way too long:
parameterName + "[\\s\\S]*DATA\\s0x[0-9A-F]{4,8}"
This messy one works a little better:
parameterName + "(\\s|\r|\n|\t|\\.|[_0-9A-z]|\"|/)*DATA\\s0x[0-9A-F]{4,8}"
I'd use ".*", however, it doesn't match "\n".
I've tried "(.|\n)", but it works even slower than "[\s\S]".
Is there any way to improve this regular expression?
You can use something like
(?>(?>[^D]+|D(?!ATA))*)DATA\\s0x[0-9A-F]{4,8}
(?> # atomic grouping (no backtracking)
(?> # atomic grouping (no backtracking)
[^D]+ # anything but a D
| # or
D(?!ATA) # a D not followed by ATA
)* # zero or more time
)
The idea
The idea is to get to the DATA without asking ourselves any question, and to not go any further and then backtrack to it.
If you use .*DATA on a string like DATA321, see what the regex engine does:
.* eats up all the string
There's no DATA to be found, so step by step the engine will backtrack and try these combinations: .* will eat only DATA32, then DATA3, then DATA... then nothing and that's when we find our match.
Same thing happens if you use .*?DATA on 123DATA: .*? will try to match nothing, then 1, then 12...
On each try we have to check there is no DATA after the place where .* stopped, and this is time consuming. With the [^D]+|D(?!ATA) we ensure we stop exactly when we need to - not before, not after.
Beware of backtracking
So why not use (?:[^D]|D(?!ATA)) instead of these weird atomic grouping?
This is all good and working fine when we have a match to be found. But what happens when we don't? Before declaring failure, the regex have to try ALL possible combinations. And when you have something like (.*)* at each character the regex engine can use both the inside * or the outside one.
Which means the number of combinations very rapidely becomes huge. We want to not try all of these: we know that we stopped at the right place, if we didn't find a match right away we never will. Hence the atomic grouping (apparently .NET doesn't support possessive quantifiers).
You can see what I mean over here: 80'000 steps to check a 15 character long string that will never match.
This is discussed more in depth (and better put than what I could ever do) in this great article by Friedl, regex guru, over here

is the regular expression "|" supposed to match everything or only the empty string?

In .NET it matches everything. In Java only empty string.
That's just the way their regex engines are implemented.
But how can I force the .NET regex engine to match | against only the empty string or the Java regex engine to match it against everything?
| is OR for regular expressions in java:
X|Y --> Either X or Y
In Java, the character | in a regular expression is an operator that matches the expressions on either side of the pipe.
Depending on what you mean by match everything, in Java you can use .*, which matches any character zero or more times.
| stands for regular OR Operation.
More information:
Source http://www.regular-expressions.info/alternation.html
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol | : cat|dog.
If you want more options, simply expand the list: cat|dog|mouse|fish.
Hope it helps! :)

Regex IsMatch taking too long to execute

I have one strange issue on my .NET project with RegEx. Please, see C# code below:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$";
const string VALUE = "Ingebrigtsen Myre (Øvre)";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(PATTERN);
if (!regex.IsMatch(VALUE)) // <--- Infinite loop here
return string.Empty;
// Some other code
I use this pattern to validate all types of names (fist names, last names, middle names, etc.). Value is a parameter, but I provided it as a constant above, because issue is not reproduced often - only with special symbols: *, (, ), etc. (sorry, but I don't have the full list of these symbols).
Can you help me to fix this infinite loop? Thanks for any help.
Added: this code is placed on the very base level of project and I don't want to do any refactoring there - I just want to have quick fix for this issue.
Added 2: I do know that it technically is not a loop - I meant that "regex.IsMatch(VALUE)" never ends. I waited for about an hour and it was still executing.
Your non-trivial regex: ^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$, is better written with comments in free-spacing mode like so:
Regex re_orig = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
( # $1: Zero or more additional parts.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
( # $2: optional quote.
' # Allow quote but only
(?!') # if not followed by quote.
)? # End $2: optional quote.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
)* # End $1: Zero or more additional parts.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
In English, this regex essentially says: "Match a string that begins with an alpha letter [a-zA-Z] followed by zero or more alpha letters, whitespaces, periods, hyphens or single quotes, but each single quote may not be immediately followed by another single quote."
Note that your above regex allows oddball names such as: "ABC---...'... -.-.XYZ " which may or may not be what you need. It also allows multi-line input and strings that end with whitespace.
The "infinite loop" problem with the above regex is that catastrophic backtracking occurs when this regex is applied to a long invalid input which contains two single quotes in a row. Here is an equivalent pattern which matches (and fails to match) the exact same strings, but does not experience catastrophic backtracking:
Regex re_fixed = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
[-\s.a-zA-Z]* # Zero or more valid name chars.
(?: # Zero or more isolated single quotes.
' # Allow single quote but only
(?!') # if not followed by single quote.
[-\s.a-zA-Z]* # Zero or more valid name chars.
)* # Zero or more isolated single quotes.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
And here it is in short form in your code context:
const string PATTERN = #"^[a-zA-Z][-\s.a-zA-Z]*(?:'(?!')[-\s.a-zA-Z]*)*$";
Look at this part of your regex:
( [-\s\.a-zA-Z]* ('(?!'))? [-\s\.a-zA-Z]* )*$
^ ^ ^ ^ ^
| | | | |
| | | | This group repeats any number of times
| | | charclass repeats any number of times
| | This group is optional
| This character class also repeats any number of times
Outer group (repeated, as seen above)
That means that as soon as your input string contains a character that's not in the character class (like the brackets and non-ASCII letter in your example), the preceding characters will be tried in a lot of permutations whose number increases exponentially with the length of the string.
To avoid that (and to allow a faster failure of the regex, use atomic groups:
const string PATTERN = #"^[a-zA-Z](?>(?>[-\s\.a-zA-Z]*)(?>'(?!'))?(?>[-\s\.a-zA-Z])*)*$";
You've got an "any number of any number" here:
...[-\s\.a-zA-Z]*)*
and because your input doesn't match, the engine backtracks to try all permutations of dividing the input up, and the number of attempts grows exponentially with the length of the input.
You can fix it simply by adding a "+" to make a possessive quantifier, which once consumed will not backtrack to find other combinations:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*+)*$";
^-- added + here
You can see a live demo (on rubular) demonstrating that adding the plus fixed the loop problem, and still matches input that doesn't have the odd characters.

What is this regex supposed to match - h*

I have a piece of code that is supposed look through a list of strings to match a regular expression whose pattern is an input from the user. Inputs such as
h*
q*
y*
seem to match anything and everything. My questions -
Is any of the above a valid regex pattern at all?
If yes, what exactly are they supposed to match?
I've gone through http://regexhero.net/reference/ but couldn't find anything that specifies such expression.
I've used http://regexhero.net/tester/ to check what my regex matches with q* as the Regular Expression and Whatever as the Target String. It gives me 9 matches!
h* means zero or more h characters
The same for the others
These patterns match any number of the specified character, including zero. Without any anchors, there are 9 places where there are zero q in whatever (between the characters and at the ends).
Out of your reference:
Ordinary characters — Characters other than . $ ^ { [ ( | ) * + ? \ match themselves.
* — Repeat 0 or more times matching as many times as possible.

Categories