Split string on whitespace ignoring parenthesis - c#

I have a string such as this
(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))
I need to split it into two such as this
ed
Karlsruhe Univ. (TH) (Germany, F.R.)
Basically, ignoring whitespace and parenthesis within a parenthesis
Is it possible to use a regex to achieve this?

If you can have more parentheses, it's better to use balancing groups:
string text = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
var charSetOccurences = new Regex(#"\(((?:[^()]|(?<o>\()|(?<-o>\)))+(?(o)(?!)))\)");
var charSetMatches = charSetOccurences.Matches(text);
foreach (Match match in charSetMatches)
{
Console.WriteLine(match.Groups[1].Value);
}
ideone demo
Breakdown:
\(( # First '(' and begin capture
(?:
[^()] # Match all non-parens
|
(?<o> \( ) # Match '(', and capture into 'o'
|
(?<-o> \) ) # Match ')', and delete the 'o' capture
)+
(?(o)(?!)) # Fails if 'o' stack isn't empty
)\) # Close capture and last opening brace

\((.*?)\)\s*\((.*)\)
you will get the two values in two match groups \1 and \2
demo here : http://regex101.com/r/rP5kG2
and this is what you get if you search and replace with the pattern \1\n\2 which also seems to be what you need exactly

string str = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
Regex re = new Regex(#"\((.*?)\)\s*\((.*)\)");
Match match = re.Match(str);

In general, No.
You can't describe recursive patterns in regular expression. ( Since it's not possible to recognize it with a finite automaton. )

Related

How to extract text that lies between parentheses

I have string like (CAT,A)(DOG,C)(MOUSE,D)
i want to get the DOG value C using Regular expression.
i tried following
Match match = Regex.Match(rspData, #"\(DOG,*?\)");
if (match.Success)
Console.WriteLine(match.Value);
But not working could any one help me to solve this issue.
You can use
(?<=\(DOG,)\w+(?=\))?
(?<=\(DOG,)[^()]*(?=\))
See the regex demo.
Details:
(?<=\(DOG,) - a positive lookbehind that matches a location that is immediately preceded with (DOG, string
\w+ - one or more letters, digits, connector punctuation
[^()]* - zero or more chars other than ( and )
(?=\)) - a positive lookahead that matches a location that is immediately followed with ).
As an alternative you can also use a capture group:
\(DOG,([^()]*)\)
Explanation
\(DOG, Match (DOG,
([^()]*) Capture group 1, match 0+ chars other than ( or )
\) Match )
Regex demo | C# demo
String rspData = "(CAT,A)(DOG,C)(MOUSE,D)";
Match match = Regex.Match(rspData, #"\(DOG,([^()]*)\)");
if (match.Success)
Console.WriteLine(match.Groups[1].Value);
}
Output
C

Regex match all words enclosed by parentheses and separated by a pipe

I think an image a better than words sometimes.
My problem as you can see, is that It only matches two words by two. How can I match all of the words ?
My current regex (PCRE) : ([^\|\(\)\|]+)\|([^\|\(\)\|]+)
The goal : retrieve all the words in a separate groupe for each of them
You can use an infinite length lookbehind in C# (with a lookahead):
(?<=\([^()]*)\w+(?=[^()]*\))
To match any kind of strings inside parentheses, that do not consist of (, ) and |, you will need to replace \w+ with [^()|]+:
(?<=\([^()]*)[^()|]+(?=[^()]*\))
// ^^^^^^
See the regex demo (and regex demo #2). Details:
(?<=\([^()]*) - a positive lookbehind that matches a location that is immediately preceded with ( and then zero or more chars other than ( and )
\w+ - one or more word chars
(?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.
Another way to capture these words is by using
(?:\G(?!^)\||\()(\w+)(?=[^()]*\)) // words as units consisting of letters/digits/diacritics/connector punctuation
(?:\G(?!^)\||\()([^()|]+)(?=[^()]*\)) // "words" that consist of any chars other than (, ) and |
See this regex demo. The words you need are now in Group 1. Details:
(?:\G(?!^)\||\() - a position after the previous match (\G(?!^)) and a | char (\|), or (|) a ( char (\()
(\w+) - Group 1: one or more word chars
(?=[^()]*\)) - a positive lookahead that makes sure there is a ) char after any zero or more chars other than ( and ) to the right of the current position.
Extracting the matches in C# can be done with
var matches = Regex.Matches(text, #"(?<=\([^()]*)\w+(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Value);
// Or
var matches = Regex.Matches(text, #"(?:\G(?!^)\||\()(\w+)(?=[^()]*\))")
.Cast<Match>()
.Select(x => x.Groups[1].Value);
In c# you can also make use of the group captures using a capture group.
The matches are in named group word
\((?<word>\w+)(?:\|(?<word>\w+))*\)
\( Match (
(?<word>\w+) Match 1+ word chars in group word
(?: Non capture group
\| Match |
(?<word>\w+) Match 1+ word chars
)* Close the non capture group and optionally repeat to get all occurrences
\) Match the closing parenthesis
Code example provided by Wiktor Stribiżew in the comments:
var line = "I love (chocolate|fish|honey|more)";
var output = Regex.Matches(line, #"\((?<word>\w+)(?:\|(?<word>\w+))*\)")
.Cast<Match>()
.SelectMany(x => x.Groups["word"].Captures);
foreach (var s in output)
Console.WriteLine(s);
Output
chocolate
fish
honey
more
foreach (var s in output)
Console.WriteLine(s);
Regex demo

Match only the nth occurrence using a regular expression

I have a string with 3 dates in it like this:
XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx
I want to select the 2nd date in the string, the 20180208 one.
Is there away to do this purely in the regex, with have to resort to pulling out the 2 match in code. I'm using C# if that matters.
Thanks for any help.
You could use
^(?:[^_]+_){2}(\d+)
And take the first group, see a demo on regex101.com.
Broken down, this says
^ # start of the string
(?:[^_]+_){2} # not _ + _, twice
(\d+) # capture digits
C# demo:
var pattern = #"^(?:[^_]+_){2}(\d+)";
var text = "XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx";
var result = Regex.Match(text, pattern)?.Groups[1].Value;
Console.WriteLine(result); // => 20180208
Try this one
MatchCollection matches = Regex.Matches(sInputLine, #"\d{8}");
string sSecond = matches[1].ToString();
You could use the regular expression
^(?:.*?\d{8}_){1}.*?(\d{8})
to save the 2nd date to capture group 1.
Demo
Naturally, for n > 2, replace {1} with {n-1} to obtain the nth date. To obtain the 1st date use
^(?:.*?\d{8}_){0}.*?(\d{8})
Demo
The C#'s regex engine performs the following operations.
^ # match the beginning of a line
(?: # begin a non-capture group
.*? # match 0+ chars lazily
\d{8} # match 8 digits
_ # match '_'
) # end non-capture group
{n} # execute non-capture group n (n >= 0) times
.*? # match 0+ chars lazily
(\d{8}) # match 8 digits in capture group 1
The important thing to note is that the first instance of .*?, followed by \d{8}, because it is lazy, will gobble up as many characters as it can until the next 8 characters are digits (and are not preceded or followed by a digit. For example, in the string
_1234abcd_efghi_123456789_12345678_ABC
capture group 1 in (.*?)_\d{8}_ will contain "_1234abcd_efghi_123456789".
You can use System.Text.RegularExpressions.Regex
See the following example
Regex regex = new Regex(#"^(?:[^_]+_){2}(\d+)"); //Expression from Jan's answer just showing how to use C# to achieve your goal
GroupCollection groups = regex.Match("XXXXX_20160207_20180208_XXXXXXX_20190408T160742_xxxxx").Groups;
if (groups.Count > 1)
{
Console.WriteLine(groups[1].Value);
}

Regex matching excluding a specific context

I'm trying to search a string for words within single quotes, but only if those single quotes are not within parentheses.
Example string:
something, 'foo', something ('bar')
So for the given example I'd like to match foo, but not bar.
After searching for regex examples I'm able to match within single quotes (see below code snippet), but am not sure how to exclude matches in the context previously described.
string line = "something, 'foo', something ('bar')";
Match name = Regex.Match(line, #"'([^']*)");
if (name.Success)
{
string matchedName = name.Groups[1].Value;
Console.WriteLine(matchedName);
}
I would recommend using lookahead instead (see it live) using:
(?<!\()'([^']*)'(?!\))
Or with C#:
string line = "something, 'foo', something ('bar')";
Match name = Regex.Match(line, #"(?<!\()'([^']*)'(?!\))");
if (name.Success)
{
Console.WriteLine(name.Groups[1].Value);
}
The easiest way to get what you need is to use an alternation group and match and capture what you need and only match what you do not need:
\([^()]*\)|'([^']*)'
See the regex demo
Details:
\( - a (
[^()]* - 0+ chars other than ( and )
\) - a )
| - or
' - a '
([^']*) - Group 1 capturing 0+ chars other than '
' - a single quote.
In C#, use .Groups[1].Value to get the values you need. See the online demo:
var str = "something, 'foo', something ('bar')";
var result = Regex.Matches(str, #"\([^()]*\)|'([^']*)'")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Another alternative is the one mentioned by Thomas, but since it is .NET, you may use infinite-width lookbehind:
(?<!\([^()]*)'([^']*)'(?![^()]*\))
See this regex demo.
Details:
(?<!\([^()]*) - a negative lookbehind failing the match if there is ( followed with 0+ chars other than ( and ) up to
'([^']*)' - a quote, 0+ chars other than single quote captured into Group 1, and another single quote
(?![^()]*\)) - a negative lookahead that fails the match if there are 0+ chars other than ( and ) followed with ) right after the ' from the preceding subpattern.
Since you'd want to exclude ', the same code as above applies.

RegEx replace query to pick out wiki syntax

I've got a string of HTML that I need to grab the "[Title|http://www.test.com]" pattern out of e.g.
"dafasdfasdf, adfasd. [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad"
I need to replace "[Title|http://www.test.com]" this with "http://www.test.com/'>Title".
What is the best away to approach this?
I was getting close with:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string p18 = #"(\[.*?|.*?\])";
MatchCollection mc18 = Regex.Matches(test, p18, RegexOptions.Singleline | RegexOptions.IgnoreCase);
foreach (Match m in mc18)
{
string value = m.Groups[1].Value;
string fulltag = value.Substring(value.IndexOf("["), value.Length - value.IndexOf("["));
Console.WriteLine("text=" + fulltag);
}
There must be a cleaner way of getting the two values out e.g. the "Title" bit and the url itself.
Any suggestions?
Replace the pattern:
\[([^|]+)\|[^]]*]
with:
$1
A short explanation:
\[ # match the character '['
( # start capture group 1
[^|]+ # match any character except '|' and repeat it one or more times
) # end capture group 1
\| # match the character '|'
[^]]* # match any character except ']' and repeat it zero or more times
] # match the character ']'
A C# demo would look like:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string adjusted = Regex.Replace(test, #"\[([^|]+)\|[^]]*]", "$1");

Categories