Tokenize a string using multiple conditions

Tokenize a string using multiple conditions - c#

For the string below:
var str = "value0 'value 1/5' 'x ' value2";
Is there a way I can parse that string such that I get
arr[0] = "value0";
arr[1] = "value 1/5";
arr[2] = "x ";
arr[3] = "value2";
The order of values that might come with single quotes is arbitrary. Case does not matter.
I can get all values between single quotes using a regex like
"'(.*?)'"
but I need the order of those values relative other non-single-quoted values.

Use
'(?<val>.*?)'|(?<val>\S+)
See regex proof
EXPLANATION
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \2
C# code:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"'(?<val>.*?)'|(?<val>\S+)";
string input = #"value0 'value 1/5' 'x ' value2";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups["val"].Value);
}
}
}

In C# you can reuse the same named capture group, so you could use an alternation | using the same group name for both parts.
'(?<val>[^']+)'|(?<val>\S+)
The pattern matches:
' Match a single quote
(?<val>[^']+) Capture in group val matching 1+ times any char except ' to not match an empty string
' Match a single quote
| Or
(?<val>\S+) Capture in group val matching 1+ times any non whitespace char
See a .NET regex demo or a C# demo
For example
string pattern = #"'(?<val>[^']+)'|(?<val>\S+)";
var str = "value0 'value 1/5' 'x ' value2";
foreach (Match m in Regex.Matches(str, pattern))
{
Console.WriteLine(m.Groups["val"].Value);
}
Output
value0
value 1/5
x
value2

Related

Regex for multiple matches .net c#

I'm workin on a regex:
Regex regex = new Regex(#"(?<=[keyb])(.*?)(?=[\/keyb])");
With this regex im gettig everything between tags [keyb] and [/keyb]
example: [keyb]hello budy[/keyb]
output: hello buddy
What about if I want to get everything between [keyb][/keyb] and also [keyb2][/keyb2] ?
example: [keyb]hello budy[/keyb] [keyb2]bye buddy[/keyb2]
output: hello buddy
bye buddy

Use
\[(keyb|keyb2)]([\w\W]*?)\[/\1]
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
keyb 'keyb'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
keyb2 'keyb2'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
] ']'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
[\w\W]*? any character of: word characters (a-z,
A-Z, 0-9, _), non-word characters (all
but a-z, A-Z, 0-9, _) (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------
C# code:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"\[(keyb|keyb2)]([\w\W]*?)\[/\1]";
string input = #" [keyb]hello budy[/keyb] [keyb2]bye buddy[/keyb2]";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("Match found: {0}", m.Groups[2].Value);
}
}
}
Results:
Match found: hello budy
Match found: bye buddy

var pattern=#"\[keyb.*?](.+?)\[\/keyb.*?\]";

REGEX Expression C#. Split string by whitespace outside the quotation marks

I'm trying to define a regular expression for the Split function in order to obtain all substring split by a whitespace omitting those whitespaces that are into single quotation marks.
Example:
key1:value1 key2:'value2 value3'
i Need these separated values:
key1:value1
key2:'value2 value3'
I'm tried to perform this in different ways:
Regex.Split(q, #"(\s)^('\s')").ToList();
Regex.Split(q, #"(\s)(^'.\s.')").ToList();
Regex.Split(q, #"(?=.*\s)").ToList();
What i am wrong with this code?
Could you please help me with this?
Thanks in advance

A working example:
(\w+):(?:(\w+)|'([^']+)')
(\w+) # key: 1 or more word chars (captured)
: # literal
(?: # non-captured grouped alternatives
(\w+) # value: 1 or more word chars (captured)
| # or
'([^']+)' # 1 or more not "'" enclosed by "'" (captured)
) # end of group
Demo
Your try:
(\s)^('\s')
^ means beginning of line, \s is a white-space characters. If you want to use the not-operator, this only works in a character class [^\s] -> 1 character not a white-space.

var st = "key1:value1 key2:'value2 value3'";
var result = Regex.Matches(st, #"\w+:\w+|\w+:\'[^']+\'");
foreach (var item in result)
Console.WriteLine(item);
The result should be:
key1:value1
key2:'value2 value3'

Try following :
static void Main(string[] args)
{
string input = "key1:value1 key2:'value2 value3'";
string pattern = #"\s*(?'key'[^:]+):((?'value'[^'][^\s]+)|'(?'value'[^']+))";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine("Key : '{0}', Value : '{1}'", match.Groups["key"].Value, match.Groups["value"].Value);
}
Console.ReadLine();
}

C# regex extract string enclosed into single quotes

I've the following string that I need to parse using RegEx.
abc = 'def' and size = '1 x(3\" x 5\")' and (name='Sam O\'neal')
This is an SQL filter, which I'd like to split into tokens using the following separators:
(, ), >,<,=, whitespace, <=, >=, !=
After the string is parsed, I'd like the output to be:
abc,
=,
def,
and,
size,
=,
'1 up(3\" x 5\")',
and,
(,
Sam O\'neal,
),
I've tried the following code:
string pattern = #"(<=|>=|!=|=|>|<|\)|\(|\s+)";
var tokens = new List<string>(Regex.Split(filter, pattern));
tokens.RemoveAll(x => String.IsNullOrWhiteSpace(x));
I'm not sure how to keep the string in single quotes as a one token. I'm new to Regex and would appreciate any help.

Your pattern needs an update with yet another alternative branch: '[^'\\]*(?:\\.[^'\\]*)*'.
It will match:
' - a single quote
[^'\\]* - 0+ chars other than ' and \
(?: - a non-capturing group matching sequences of:
\\. - any escape sequence
[^'\\]* - 0+ chars other than ' and \
)* - zero or more occurrences
' - a single quote
In C#:
string pattern = #"('[^'\\]*(?:\\.[^'\\]*)*'|<=|>=|!=|=|>|<|\)|\(|\s+)";
See the regex demo
C# demo:
var filter = #"abc = 'def' and size = '1 x(3"" x 5"")' and (name='Sam O\'neal')";
var pattern = #"('[^'\\]*(?:\\.[^'\\]*)*'|<=|>=|!=|=|>|<|\)|\(|\s+)";
var tokens = Regex.Split(filter, pattern).Where(x => !string.IsNullOrWhiteSpace(x));
foreach (var tok in tokens)
Console.WriteLine(tok);
Output:
abc
=
'def'
and
size
=
'1 x(3" x 5")'
and
(
name
=
'Sam O\'neal'
)

Split string on whitespace ignoring parenthesis

I have a string such as this
(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))
I need to split it into two such as this
ed
Karlsruhe Univ. (TH) (Germany, F.R.)
Basically, ignoring whitespace and parenthesis within a parenthesis
Is it possible to use a regex to achieve this?

If you can have more parentheses, it's better to use balancing groups:
string text = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
var charSetOccurences = new Regex(#"\(((?:[^()]|(?<o>\()|(?<-o>\)))+(?(o)(?!)))\)");
var charSetMatches = charSetOccurences.Matches(text);
foreach (Match match in charSetMatches)
{
Console.WriteLine(match.Groups[1].Value);
}
ideone demo
Breakdown:
\(( # First '(' and begin capture
(?:
[^()] # Match all non-parens
|
(?<o> \( ) # Match '(', and capture into 'o'
|
(?<-o> \) ) # Match ')', and delete the 'o' capture
)+
(?(o)(?!)) # Fails if 'o' stack isn't empty
)\) # Close capture and last opening brace

\((.*?)\)\s*\((.*)\)
you will get the two values in two match groups \1 and \2
demo here : http://regex101.com/r/rP5kG2
and this is what you get if you search and replace with the pattern \1\n\2 which also seems to be what you need exactly

string str = "(ed) (Karlsruhe Univ. (TH) (Germany, F.R.))";
Regex re = new Regex(#"\((.*?)\)\s*\((.*)\)");
Match match = re.Match(str);

In general, No.
You can't describe recursive patterns in regular expression. ( Since it's not possible to recognize it with a finite automaton. )

RegEx replace query to pick out wiki syntax

I've got a string of HTML that I need to grab the "[Title|http://www.test.com]" pattern out of e.g.
"dafasdfasdf, adfasd. [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad"
I need to replace "[Title|http://www.test.com]" this with "http://www.test.com/'>Title".
What is the best away to approach this?
I was getting close with:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string p18 = #"(\[.*?|.*?\])";
MatchCollection mc18 = Regex.Matches(test, p18, RegexOptions.Singleline | RegexOptions.IgnoreCase);
foreach (Match m in mc18)
{
string value = m.Groups[1].Value;
string fulltag = value.Substring(value.IndexOf("["), value.Length - value.IndexOf("["));
Console.WriteLine("text=" + fulltag);
}
There must be a cleaner way of getting the two values out e.g. the "Title" bit and the url itself.
Any suggestions?

Replace the pattern:
\[([^|]+)\|[^]]*]
with:
$1
A short explanation:
\[ # match the character '['
( # start capture group 1
[^|]+ # match any character except '|' and repeat it one or more times
) # end capture group 1
\| # match the character '|'
[^]]* # match any character except ']' and repeat it zero or more times
] # match the character ']'
A C# demo would look like:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string adjusted = Regex.Replace(test, #"\[([^|]+)\|[^]]*]", "$1");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Tokenize a string using multiple conditions - c#

Related

Regex for multiple matches .net c#

REGEX Expression C#. Split string by whitespace outside the quotation marks

C# regex extract string enclosed into single quotes

Split string on whitespace ignoring parenthesis

RegEx replace query to pick out wiki syntax

Categories

Resources