Regex matching multiple field value in a single line

Regex matching multiple field value in a single line - c#

i wish to match a multiple field value delimited by a colon in a single line, but each field and value text contains space
e.g.
field1 : value1a value1b
answer
match1: Group1=field1, Group2=value1a value1b
or
field1 : value1a value1b field2 : value2a value2b
answer
match1: Group1=field1, Group2=value1a value1b
match2: Group1=field2, Group2=value2a value2b
the best i can do right now is (\w+)\s*:\s*(\w+)
Regex regex = new Regex(#"(\w+)\s*:\s*(\w+)");
Match m = regex.Match("field1 : value1a value1b field2 : value2a value2b");
while (m.Success)
{
string f = m.Groups[1].Value.Trim();
string v = m.Group2[2].Value.Trim();
}
i guess look ahead may help, but i don't know how to make it
thank you

You may try
(\w+)\s*:\s*((?:(?!\s*\w+\s*:).)*)
(\w+) group 1, any consecutive words
\s*:\s* a colon with any space around
(...) group 2
(?:...)* a non capture group, repeats any times
(?!\s*\w+\s*:). negative lookahead with a character ahead, the following character must not form a word surrounds by any space followed by a colon. Thus the group 2 never consumes any words before a colon
See the test cases

You can use a regex based on a lazy dot:
var matches = Regex.Matches(text, #"(\w+)\s*:\s*(.*?)(?=\s*\w+\s*:|$)");
See the C# demo online and the .NET regex demo (please mind that regex101.com does not support .NET regex flavor).
As you see, no need using a tempered greedy token. The regex means:
(\w+) - Group 1: any one or more letters/digits/underscore
\s*:\s* - a colon enclosed with zero or more whitespace chars
(.*?) - Group 2: any zero or more chars other than a newline, as few as possible
(?=\s*\w+\s*:|$) - up to the first occurrence of one or more word chars enclosed with zero or more whitesapces or end of string.
Full C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "field1 : value1a value1b field2 : value2a value2b";
var matches = Regex.Matches(text, #"(\w+)\s*:\s*(.*?)(?=\s*\w+\s*:|$)");
foreach (Match m in matches)
{
Console.WriteLine("-- MATCH FOUND --\nKey: {0}, Value: {1}",
m.Groups[1].Value, m.Groups[2].Value);
}
}
}
Output:
-- MATCH FOUND --
Key: field1, Value: value1a value1b
-- MATCH FOUND --
Key: field2, Value: value2a value2b

Related

Building a regular expression in C#

How to check the following text in C# with Regex:
key_in-get { 43243225543543543 };
or
key_in_set { password123 : 34980430943834 };
I tried to build a regular expression, but I failed after few hours.
Here is my code:
string text1 = "key_in-get { 322389238237 };";
string text2 = "key_in-set { password123 : 322389238237 };";
string pattern = "key_in-(get|set) { .* };";
var result1 = Regex.IsMatch(text, pattern);
Console.Write("Is valid: {0} ", result1);
var result2 = Regex.IsMatch(text, pattern);
Console.Write("Is valid: {0} ", result2);
I have to check if there is "set" or "get".
If the pattern finds "set" then it can only accept following pattern "text123 : 123456789", and if it finds "get" then should accept only "123456789".

You can use
key_in-(?:get|(set)) {(?(1) \w+ :) \w+ };
key_in-(?:get|(set))\s*{(?(1)\s*\w+\s*:)\s*\w+\s*};
key_in-(?:get|(set))\s*{(?(1)\s*\w+\s*:)\s*\d+\s*};
See the regex demo. The second one allows any amount of any whitespace between the elements and the third one allows only digits after : or as part of the get expression.
If the whole string must match, add ^ at the start and $ at the end of the pattern.
Details:
key_in- - a substring
(?:get|(set)) - get or set (the latter is captured into Group 1)
\s* - zero or more whitespaces
{ - a { char
(?(1)\s*\w+\s*:) - a conditional construct: if Group 1 matched, match one or more word chars enclosed with zero or more whitespaces and then a colon
\s*\w+\s* - one or more word chars enclosed with zero or more whitespaces
}; - a literal substring.

In the pattern that you tried key_in-(get|set) { .* }; you are matching either get or set followed by { until the last occurrence of } which could possibly also match key_in-get { }; };
As an alternative solution, you could use an alternation | specifying each of the accepted parts for the get and the set.
key_in-(?:get\s*{\s*\w+|set\s*{\s*\w+\s*:\s*\w+)\s*};
The pattern matches
key_in- Match literally
(?: Non capture group
get\s*{\s*\w+ Match get, { between optional whitespace chars and 1+ word chars
| Or
set\s*{\s*\w+\s*:\s*\w+ Match set, { between optional whitespace chars and word chars on either side with : in between.
) Close non capture group
\s*}; Match optional whitespace chars and };
Regex demo

Problem with brackets in regular expression in C#

can anybody help me with regular expression in C#?
I want to create a pattern for this input:
{a? ab 12 ?? cd}
This is my pattern:
([A-Fa-f0-9?]{2})+
The problem are the curly brackets. This doesn't work:
{(([A-Fa-f0-9?]{2})+)}
It just works for
{ab}

I would use {([A-Fa-f0-9?]+|[^}]+)}
It captures 1 group which:
Match a single character present in the list below [A-Fa-f0-9?]+
Match a single character not present in the list below [^}]+

If you allow leading/trailing whitespace within {...} string, the expression will look like
{(?:\s*([A-Fa-f0-9?]{2}))+\s*}
See this regex demo
If you only allow a single regular space only between the values inside {...} and no space after { and before }, you can use
{(?:([A-Fa-f0-9?]{2})(?: (?!}))?)+}
See this regex demo. Note this one is much stricter. Details:
{ - a { char
(?:\s*([A-Fa-f0-9?]{2}))+ - one or more occurrences of
\s* - zero or more whitespaces
([A-Fa-f0-9?]{2}) - Capturing group 1: two hex or ? chars
\s* - zero or more whitespaces
} - a } char.
See a C# demo:
var text = "{a? ab 12 ?? cd}";
var pattern = #"{(?:([A-Fa-f0-9?]{2})(?: (?!}))?)+}";
var result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(x => x.Groups[1].Captures.Cast<Capture>().Select(m => m.Value))
.ToList();
foreach (var list in result)
Console.WriteLine(string.Join("; ", list));
// => a?; ab; 12; ??; cd

If you want to capture pairs of chars between the curly's, you can use a single capture group:
{([A-Fa-f0-9?]{2}(?: [A-Fa-f0-9?]{2})*)}
Explanation
{ Match {
( Capture group 1
[A-Fa-f0-9?]{2} Match 2 times any of the listed characters
(?: [A-Fa-f0-9?]{2})* Optionally repeat a space and again 2 of the listed characters
) Close group 1
} Match }
Regex demo | C# demo
Example code
string pattern = #"{([A-Fa-f0-9?]{2}(?: [A-Fa-f0-9?]{2})*)}";
string input = #"{a? ab 12 ?? cd}
{ab}";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
a? ab 12 ?? cd
ab

Get the middle part of a filename using regex

I need a regex that can return up to 10 characters in the middle of a file name.
filename: returns:
msl_0123456789_otherstuff.csv -> 0123456789
msl_test.xml -> test
anythingShort.w1 -> anythingSh
I can capture the beginning and end for removal with the following regex:
Regex.Replace(filename, "(^msl_)|([.][[:alnum:]]{1,3}$)", string.Empty); *
but I also need to have only 10 characters when I am done.
Explanation of the regex above:
(^msl_) - match lines that start with "msl_"
| - or
([.] - match a period
[[:alnum]]{1,3} - followed by 1-3 alphanumeric characters
$) - at the end of the line

Note [[:alnum:]] can't work in a .NET regex, because it does not support POSIX character classes. You may use \w (to match letters, digits, underscores) or [^\W_] (to match letters or digits).
You can use your regex and just keep the first 10 chars in the string:
new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray())
See the C# demo online:
var strings = new List<string> { "msl_0123456789_otherstuff.csv", "msl_test.xml", "anythingShort.w1" };
foreach (var s in strings)
{
Console.WriteLine("{0} => {1}", s, new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray()));
}
Output:
msl_0123456789_otherstuff.csv => 0123456789
msl_test.xml => test
anythingShort.w1 => anythingSh

Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.
If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.
Then match 1-10 times a word character except the _ followed by matching optional word characters until the .
^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$
.NET regex demo (Click on the table tab)
A bit broader match could be using \S instead of \w and match until the last dot:
^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$
See another regex demo | C# demo
string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = #"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}
Output
0123456789
test
anythingSh

RegEx for parsing repeated groups

The source string contain tags like this:
>>>tagA
contents 1
<<<tagA
...
>>>tagB
contents 2
<<<tagB
...
I need to extract tag names and contents inside them. This is what I've got but still not working:
(?<=(>>>(?<tagName>.+)$))(?<contents2>.*?)(?=(<<<.+)$)
It results to two matches but the tagName in the second match captured multiple lines:
tagA
contents 1
<<<tagA
What am I doing wrong?

You may use
>>>(?<tagName>.+?)[\r\n]+(?s:(?<contents>.*?))<<<
See the regex demo
Details
>>> - a >>> substring
(?<tagName>.+?) - Group "tagName": any 1+ chars as few as possible
[\r\n]+ - one or more CR or LF symbols
(?s:(?<contents>.*?)) - Group "contents": an inline modifier group matching any 0+ chars, but as few as possible
<<< - a <<< substring.
In C#:
var matches = Regex.Matches(s, #">>>(?<tagName>.+?)[\r\n]+(?s:(?<contents>.*?))<<<");
See the C# demo:
var s = ">>>tagA\ncontents 1\n<<<tagA\n...\n>>>tagB\ncontents 2\n<<<tagB\n...";
var matches = Regex.Matches(s, #">>>(?<tagName>.+?)[\r\n]+(?s:(?<contents>.*?))<<<");
foreach (Match m in matches) {
Console.WriteLine(m.Groups["tagName"].Value);
Console.WriteLine(m.Groups["contents"].Value);
}
Output:
tagA
contents 1
tagB
contents 2

Here, we would likely start with a simple expression which is bounded with >>> and <<<, maybe something similar to:
>>>(.+)\s*(.+)\s*<<<.+
which we are having our desired data in these two capturing groups:
(.+)
and we would script the rest of our problem.
Demo
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #">>>(.+)\s*(.+)\s*<<<.+";
string input = #">>>tagA
contents 1
<<<tagA
>>>tagB
contents 2
<<<tagB
>>>tagC
contents 2
<<<tagC
";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Get particular parts from a string

I'm trying to get particular parts from a string. I have to get the part which starts after '#' and contains only letters from the Latin alphabet.
I suppose that I have to create a regex pattern, but I don't know how.
string test = "PQ#Alderaa1:30000!A!->20000";
var planet = "Alderaa"; //what I want to get
string test2 = "#Cantonica:3000!D!->4000NM";
var planet2 = "Cantonica";
There are some other parts which I have to get, but I will try to get them myself. (starts after ':' and is an Integer; may be "A" (attack) or "D" (destruction) and must be surrounded by "!" (exclamation mark); starts after "->" and should be an Integer)

You could get the separate parts using capturing groups:
#([a-zA-Z]+)[^:]*:(\d+)!([AD])!->(\d+)
That will match:
#([a-zA-Z]+) Match # and capture in group 1 1+ times a-zA-Z
[^:]*: Match 0+ times not a : using a negated character class, then match a : (If what follows could be only optional digits, you might also match 0+ times a digit [0-9]*)
(\d+) Capture in group 2 1+ digits
!([AD])! Match !, capture in group 3 and A or D, then match !
->(\d+) Match -> and capture in group 4 1+ digits
Demo | C# Demo

You can use this regex, which uses a positive look behind to ensure the matched text is preceded by # and one or more alphabets get captured using [a-zA-Z]+ and uses a positive look ahead to ensure it is followed by some optional text, a colon, then one or more digits followed by ! then either A or D then again a !
(?<=#)[a-zA-Z]+(?=[^:]*:\d+![AD]!)
Demo
C# code demo
string test = "PQ#Alderaa1:30000!A!->20000";
Match m1 = Regex.Match(test, #"(?<=#)[a-zA-Z]+(?=[^:]*:\d+![AD]!)");
Console.WriteLine(m1.Groups[0].Value);
test = "#Cantonica:3000!D!";
m1 = Regex.Match(test, #"(?<=#)[a-zA-Z]+(?=[^:]*:\d+![AD]!)");
Console.WriteLine(m1.Groups[0].Value);
Prints,
Alderaa
Cantonica

You already have a good answers but I would like to add a new one to show named capturing groups.
You can create a class for your planets like
class Planet
{
public string Name;
public int Value1; // name is not cleat from context
public string Category; // as above: rename it
public string Value2; // same problem
}
Now you can use regex with named groups
#(?<name>[a-z]+)[^:]*:(?<value1>\d+)!(?<category>[^!]+)!->(?<value2>[\da-z]+)
Demo
Usage:
var input = new[]
{
"PQ#Alderaa1:30000!A!->20000",
"#Cantonica:3000!D!->4000NM",
};
var regex = new Regex("#(?<name>[a-z]+)[^:]*:(?<value1>\\d+)!(?<category>[^!]+)!->(?<value2>[\\da-z]+)",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
var planets = input
.Select(p => regex.Match(p))
.Select(m => new Planet
{
Name = m.Groups["name"].Value, // here and further we can access to part of input string by name
Value1 = int.Parse(m.Groups["value1"].Value),
Category = m.Groups["category"].Value,
Value2 = m.Groups["value2"].Value
})
.ToList();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex matching multiple field value in a single line - c#

Related

Building a regular expression in C#

Problem with brackets in regular expression in C#

Get the middle part of a filename using regex

RegEx for parsing repeated groups

Get particular parts from a string

Categories

Resources