Regex matching excluding a specific context - c#

I'm trying to search a string for words within single quotes, but only if those single quotes are not within parentheses.
Example string:
something, 'foo', something ('bar')
So for the given example I'd like to match foo, but not bar.
After searching for regex examples I'm able to match within single quotes (see below code snippet), but am not sure how to exclude matches in the context previously described.
string line = "something, 'foo', something ('bar')";
Match name = Regex.Match(line, #"'([^']*)");
if (name.Success)
{
string matchedName = name.Groups[1].Value;
Console.WriteLine(matchedName);
}

I would recommend using lookahead instead (see it live) using:
(?<!\()'([^']*)'(?!\))
Or with C#:
string line = "something, 'foo', something ('bar')";
Match name = Regex.Match(line, #"(?<!\()'([^']*)'(?!\))");
if (name.Success)
{
Console.WriteLine(name.Groups[1].Value);
}

The easiest way to get what you need is to use an alternation group and match and capture what you need and only match what you do not need:
\([^()]*\)|'([^']*)'
See the regex demo
Details:
\( - a (
[^()]* - 0+ chars other than ( and )
\) - a )
| - or
' - a '
([^']*) - Group 1 capturing 0+ chars other than '
' - a single quote.
In C#, use .Groups[1].Value to get the values you need. See the online demo:
var str = "something, 'foo', something ('bar')";
var result = Regex.Matches(str, #"\([^()]*\)|'([^']*)'")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Another alternative is the one mentioned by Thomas, but since it is .NET, you may use infinite-width lookbehind:
(?<!\([^()]*)'([^']*)'(?![^()]*\))
See this regex demo.
Details:
(?<!\([^()]*) - a negative lookbehind failing the match if there is ( followed with 0+ chars other than ( and ) up to
'([^']*)' - a quote, 0+ chars other than single quote captured into Group 1, and another single quote
(?![^()]*\)) - a negative lookahead that fails the match if there are 0+ chars other than ( and ) followed with ) right after the ' from the preceding subpattern.
Since you'd want to exclude ', the same code as above applies.

Related

How to extract text that lies between parentheses

I have string like (CAT,A)(DOG,C)(MOUSE,D)
i want to get the DOG value C using Regular expression.
i tried following
Match match = Regex.Match(rspData, #"\(DOG,*?\)");
if (match.Success)
Console.WriteLine(match.Value);
But not working could any one help me to solve this issue.
You can use
(?<=\(DOG,)\w+(?=\))?
(?<=\(DOG,)[^()]*(?=\))
See the regex demo.
Details:
(?<=\(DOG,) - a positive lookbehind that matches a location that is immediately preceded with (DOG, string
\w+ - one or more letters, digits, connector punctuation
[^()]* - zero or more chars other than ( and )
(?=\)) - a positive lookahead that matches a location that is immediately followed with ).
As an alternative you can also use a capture group:
\(DOG,([^()]*)\)
Explanation
\(DOG, Match (DOG,
([^()]*) Capture group 1, match 0+ chars other than ( or )
\) Match )
Regex demo | C# demo
String rspData = "(CAT,A)(DOG,C)(MOUSE,D)";
Match match = Regex.Match(rspData, #"\(DOG,([^()]*)\)");
if (match.Success)
Console.WriteLine(match.Groups[1].Value);
}
Output
C

Regex match all words enclosed by parentheses and separated by a pipe

I think an image a better than words sometimes.
My problem as you can see, is that It only matches two words by two. How can I match all of the words ?
My current regex (PCRE) : ([^\|\(\)\|]+)\|([^\|\(\)\|]+)
The goal : retrieve all the words in a separate groupe for each of them
You can use an infinite length lookbehind in C# (with a lookahead):
(?<=\([^()]*)\w+(?=[^()]*\))
To match any kind of strings inside parentheses, that do not consist of (, ) and |, you will need to replace \w+ with [^()|]+:
(?<=\([^()]*)[^()|]+(?=[^()]*\))
// ^^^^^^
See the regex demo (and regex demo #2). Details:
(?<=\([^()]*) - a positive lookbehind that matches a location that is immediately preceded with ( and then zero or more chars other than ( and )
\w+ - one or more word chars
(?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.
Another way to capture these words is by using
(?:\G(?!^)\||\()(\w+)(?=[^()]*\)) // words as units consisting of letters/digits/diacritics/connector punctuation
(?:\G(?!^)\||\()([^()|]+)(?=[^()]*\)) // "words" that consist of any chars other than (, ) and |
See this regex demo. The words you need are now in Group 1. Details:
(?:\G(?!^)\||\() - a position after the previous match (\G(?!^)) and a | char (\|), or (|) a ( char (\()
(\w+) - Group 1: one or more word chars
(?=[^()]*\)) - a positive lookahead that makes sure there is a ) char after any zero or more chars other than ( and ) to the right of the current position.
Extracting the matches in C# can be done with
var matches = Regex.Matches(text, #"(?<=\([^()]*)\w+(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Value);
// Or
var matches = Regex.Matches(text, #"(?:\G(?!^)\||\()(\w+)(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Groups[1].Value);
In c# you can also make use of the group captures using a capture group.
The matches are in named group word
\((?<word>\w+)(?:\|(?<word>\w+))*\)
\( Match (
(?<word>\w+) Match 1+ word chars in group word
(?: Non capture group
\| Match |
(?<word>\w+) Match 1+ word chars
)* Close the non capture group and optionally repeat to get all occurrences
\) Match the closing parenthesis
Code example provided by Wiktor Stribiżew in the comments:
var line = "I love (chocolate|fish|honey|more)";
var output = Regex.Matches(line, #"\((?<word>\w+)(?:\|(?<word>\w+))*\)")
.Cast<Match>()
.SelectMany(x => x.Groups["word"].Captures);
foreach (var s in output)
Console.WriteLine(s);
Output
chocolate
fish
honey
more
foreach (var s in output)
Console.WriteLine(s);
Regex demo

Regex pattern for splitting a delimited string in curly braces

I have the following string
{token1;token2;token3#somewhere.com;...;tokenn}
I need a Regex pattern, that would give a result in array of strings such as
token1
token2
token3#somewhere.com
...
...
...
tokenn
Would also appreciate a suggestion if can use the same pattern to confirm the format of the string, means string should start and end in curly braces and at least 2 values exist within the anchors.
You may use an anchored regex with named repeated capturing groups:
\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z
See the regex demo
\A - start of string
{ - a {
(?<val>[^;]*) - Group "val" capturing 0+ (due to * quantifier, if the value cannot be empty, use +) chars other than ;
(?:;(?<val>[^;]*))+ - 1 or more occurrences (thus, requiring at least 2 values inside {...}) of the sequence:
; - a semi-colon
(?<val>[^;]*) - Group "val" capturing 0+ chars other than ;
} - a literal }
\z - end of string.
.NET regex keeps each capture in a CaptureCollection stack, that is why all the values captured into "num" group can be accessed after a match is found.
C# demo:
var s = "{token1;token2;token3;...;tokenn}";
var pat = #"\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z";
var caps = new List<string>();
var result = Regex.Match(s, pat);
if (result.Success)
{
caps = result.Groups["val"].Captures.Cast<Capture>().Select(t=>t.Value).ToList();
}
Read it(similar to your problem): How to keep the delimiters of Regex.Split?.
For your RegEx testing use this: http://www.regexlib.com/RETester.aspx?AspxAutoDetectCookieSupport=1.
But RegEx is a very resource-intensive, slow operation.
In your case will be better to use the Split method of string class, for example : "token1;token2;token3;...;tokenn".Split(';');. It will return to you a collection of strings, that you want to obtain.

Split string on whitespace ignoring parenthesis

I have a string such as this
(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))
I need to split it into two such as this
ed
Karlsruhe Univ. (TH) (Germany, F.R.)
Basically, ignoring whitespace and parenthesis within a parenthesis
Is it possible to use a regex to achieve this?
If you can have more parentheses, it's better to use balancing groups:
string text = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
var charSetOccurences = new Regex(#"\(((?:[^()]|(?<o>\()|(?<-o>\)))+(?(o)(?!)))\)");
var charSetMatches = charSetOccurences.Matches(text);
foreach (Match match in charSetMatches)
{
Console.WriteLine(match.Groups[1].Value);
}
ideone demo
Breakdown:
\(( # First '(' and begin capture
(?:
[^()] # Match all non-parens
|
(?<o> \( ) # Match '(', and capture into 'o'
|
(?<-o> \) ) # Match ')', and delete the 'o' capture
)+
(?(o)(?!)) # Fails if 'o' stack isn't empty
)\) # Close capture and last opening brace
\((.*?)\)\s*\((.*)\)
you will get the two values in two match groups \1 and \2
demo here : http://regex101.com/r/rP5kG2
and this is what you get if you search and replace with the pattern \1\n\2 which also seems to be what you need exactly
string str = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
Regex re = new Regex(#"\((.*?)\)\s*\((.*)\)");
Match match = re.Match(str);
In general, No.
You can't describe recursive patterns in regular expression. ( Since it's not possible to recognize it with a finite automaton. )

Regex problems with equal sign?

In C# I'm trying to validate a string that looks like:
I#paramname='test'
or
O#paramname=2827
Here is my code:
string t1 = "I#parameter='test'";
string r = #"^([Ii]|[Oo])#\w=\w";
var re = new Regex(r);
If I take the "=\w" off the end or variable r I get True. If I add an "=\w" after the \w it's False. I want the characters between # and = to be able to be any alphanumeric value. Anything after the = sign can have alphanumeric and ' (single quotes). What am I doing wrong here. I very rarely have used regular expressions and normally can find example, this is custom format though and even with cheatsheets I'm having issues.
^([Ii]|[Oo])#\w+=(?<q>'?)[\w\d]+\k<q>$
Regular expression:
^ start of line
([Ii]|[Oo]) either (I or i) or (O or o)
\w+ 1 or more word characters
= equals sign
(?<q>'?) capture 0 or 1 quotes in named group q
[\w\d]+ 1 or more word or digit characters
\k<q> repeat of what was captured in named group q
$ end of line
use \w+ instead of \w to one character or more. Or \w* to get zero or more:
Try this: Live demo
^([Ii]|[Oo])#\w+=\'*\w+\'*
If you are being a bit more strict with using paramname:
^([Ii]|[Oo])#paramname=[']?[\w]+[']?
Here is a demo
You could try something like this:
Regex rx = new Regex( #"^([IO])#(\w+)=(.*)$" , RegexOptions.IgnoreCase ) ;
Match group 1 will give you the value of I or O (the parameter direction?)
Match group 2 will give you the name of the parameter
Match group 3 will give you the value of the parameter
You could be stricter about the 3rd group and match it as
(([^']+)|('(('')|([^']+))*'))
The first alternative matches 1 or more non quoted character; the second alternative match a quoted string literal with any internal (embedded) quotes escape by doubling them, so it would match things like
'' (the empty string
'foo bar'
'That''s All, Folks!'

Categories