Regex to extract substrings in C# - c#

I have a string as:
string subjectString = #"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+('VBNJK.PVI.10JK.PCV'))*('ASFGED.Height Density.1JKHB01.PCV')/476)";
My expected output is:
Hgh.Green.two.190ITY.PCV
VBNJK.PVI.10JK.PCV
ASFGED.Height Density.1JKHB01.PCV
Here's what I have tried:
Regex regexObj = new Regex(#"'[^\\]*.PCV");
Match matchResults = regexObj.Match(subjectString);
string val = matchResults.Value;
This works when the input string is :"#"(((43*('\\uth\Hgh.Green.two.190ITY.PCV')*9.8)/100000+"; but when the string grows and the number of substrings to be extracted is more than 1 , I am getting undesired results .
How do I extract three substrings from the original string?

It seems you want to match word and . chars before .PCV.
Use
[\w\s.]*\.PCV
See the regex demo
To force at least 1 word char at the start use
\w[\w\s.]*\.PCV
Optionally, if needed, add a word boundary at the start: #"\b\w[\w\s.]*\.PCV".
To force \w match only ASCII letters and digits (and _) compile the regex object with RegexOptions.ECMAScript option.
Here,
\w - matches any letter, digit or _
[\w\s.]* - matches 0+ whitespace, word or/and . chars
\. - a literal .
PCV - a PCV substring.
Sample usage:
var results = Regex.Matches(str, #"\w[\w\s.]*\.PCV")
.Cast<Match>()
.Select(m=>m.Value)
.ToList();

Related

Regex to match words between underscores after second occurence of underscore

so i would like to get words between underscores after second occurence of underscore
this is my string
ABC_BC_BE08_C1000004_0124
I've assembled this expresion
(?<=_)[^_]+
well it matches what i need but only skips the first word since there is no underscore before it. I would like it to skip ABC and BC and just get the last three strings, i've tried messing around but i am stuck and cant make it work. Thanks!
You can use a non-regex approach here with Split and Skip:
var text = "ABC_BC_BE08_C1000004_0124";
var result = text.Split('_').Skip(2);
foreach (var s in result)
Console.WriteLine(s);
Output:
BE08
C1000004
0124
See the C# demo.
With regex, you can use
var result = Regex.Matches(text, #"(?<=^(?:[^_]*_){2,})[^_]+").Cast<Match>().Select(x => x.Value);
See the regex demo and the C# demo. The regex matches
(?<=^(?:[^_]*_){2,}) - a positive lookbehind that matches a location that matches the following patterns immediately to the left of the current location:
^ - start of string
(?:[^_]*_){2,} - two or more ({2,}) sequences of any zero or more chars other than _ ([^_]*) and then a _ char
[^_]+ - one or more chars other than _
Usign .NET there is also a captures collection that you might use with a regex and a repeated catpure group.
^[^_]*_[^_]*(?:_([^_]+))+
The pattern matches:
^ Start of string
[^_]*_[^_]* Match any char except an _, match _ and again any char except _
(?: Non capture group
_([^_]+) Match _ and capture 1 or more times any char except _ in group 1
)+ Close the non capture group and repeat 1 or more times
.NET regex demo | C# demo
For example:
var pattern = #"^[^_]*_[^_]*(?:_([^_]+))+";
var str = "ABC_BC_BE08_C1000004_0124";
var strings = Regex.Match(str, pattern).Groups[1].Captures.Select(c => c.Value);
foreach (String s in strings)
{
Console.WriteLine(s);
}
Output
BE08
C1000004
0124
If you want to match only word characters in between the underscores, another option for a pattern could be using a negated character class [^\W_] excluding the underscore from the word characters in between:
^[^\W_]*_[^\W_]*(?:_([^\W_]+))+

Get the middle part of a filename using regex

I need a regex that can return up to 10 characters in the middle of a file name.
filename: returns:
msl_0123456789_otherstuff.csv -> 0123456789
msl_test.xml -> test
anythingShort.w1 -> anythingSh
I can capture the beginning and end for removal with the following regex:
Regex.Replace(filename, "(^msl_)|([.][[:alnum:]]{1,3}$)", string.Empty); *
but I also need to have only 10 characters when I am done.
Explanation of the regex above:
(^msl_) - match lines that start with "msl_"
| - or
([.] - match a period
[[:alnum]]{1,3} - followed by 1-3 alphanumeric characters
$) - at the end of the line
Note [[:alnum:]] can't work in a .NET regex, because it does not support POSIX character classes. You may use \w (to match letters, digits, underscores) or [^\W_] (to match letters or digits).
You can use your regex and just keep the first 10 chars in the string:
new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray())
See the C# demo online:
var strings = new List<string> { "msl_0123456789_otherstuff.csv", "msl_test.xml", "anythingShort.w1" };
foreach (var s in strings)
{
Console.WriteLine("{0} => {1}", s, new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray()));
}
Output:
msl_0123456789_otherstuff.csv => 0123456789
msl_test.xml => test
anythingShort.w1 => anythingSh
Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.
If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.
Then match 1-10 times a word character except the _ followed by matching optional word characters until the .
^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$
.NET regex demo (Click on the table tab)
A bit broader match could be using \S instead of \w and match until the last dot:
^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$
See another regex demo | C# demo
string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = #"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}
Output
0123456789
test
anythingSh

Regex match with Arabic

i have a text in Arabic and i want to use Regex to extract numbers from it. here is my attempt.
String :
"ما المجموع:
1+2"
Match match = Regex.Match(text, "المجموع: ([^\\r\\n]+)", RegexOptions.IgnoreCase);
it will always return false. and groups.value will always return null.
expected output:
match.Groups[1].Value //returns (1+2)
The regex you wrote matches a word, then a colon, then a space and then 1 or more chars other than backslash, r and n.
You want to match the whole line after the word, colon and any amount of whitespace chars:
var text = "ما المجموع:\n1+2";
var result = Regex.Match(text, #"المجموع:\s*(.+)")?.Groups[1].Value;
Console.WriteLine(result); // => 1+2
See the C# demo
Other possible patterns:
#"المجموع:\r?\n(.+)" // To match CRLF or LF line ending only
#"المجموع:\n(.+)" // To match just LF ending only
Also, if you run the regex against a long multiline text with CRLF endings, it makes sense to replace .+ wit [^\r\n]+ since . in a .NET regex matches any chars but newlines, LF, and thus matches CR symbol.

Regex match word followed by decimal from text

I want to be able to match the following examples and return array of matches
given text:
some word
another 50.00
some-more 10.10 text
another word
Matches should be (word, followed by space then decimal number (Optionally followed by another word):
another 50.00
some-more 10.10 text
I have the following so far:
string pat = #"\r\n[A-Za-z ]+\d+\.\d{1,2}([A-Za-z])?";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(input);
but it only matches first item: another 50.00
You do not account for - with [A-Za-z ] and only match some text after a newline.
You can use the following regex:
[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?
See the regex demo
The [\p{L}-]+ matches 1 or more letters and hyphens, \p{Zs}* matches 0 or more horizontal whitespace symbols, \d*\.?\d{1,2} matches a float number with 1 to 2 digits in the decimal part, and (?:\p{Zs}*[\p{L}-]+)? matches an optional word after the number.
Here is a C# snippet matching all occurrences based on Regex.Matches method:
var res = Regex.Matches(str, #"[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?")
.Cast<Match>()
.Select(p => p.Value)
.ToList();
Just FYI: if you need to match whole words, you can also use word boundaries \b:
\b[\p{L}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}-]+)?\b
And just another note: if you need to match diacritics, too, you may add \p{M} to the character class containing \p{L}:
[\p{L}\p{M}-]+\p{Zs}*\d*\.?\d{1,2}(?:\p{Zs}*[\p{L}\p{M}-]+)?\b

Regex -> only letters and end with a dot

I'm trying to select all the tokens that contain only letters or only letters and end with a dot.
Example of valid words : "abc", "abc."
Invalid "a.b" "a2"
i've tried this
string[] tokens = text.Split(' ');
var words = from token in tokens
where Regex.IsMatch(token,"^[a-zA-Z]+.?$")
select token;
^[a-zA-Z]+ - only letters one or more times and start with letter
.?$ = ends with 0 or 1 dot ?? not sure about this
In regex, an unescaped . pattern matches any character (including digits). Thus, your regex would undesirably match tokens such as "a2".
You need to escape your dot character as \..
string[] tokens = text.Split(' ');
var words = from token in tokens
where Regex.IsMatch(token,#"^[a-zA-Z]+\.?$")
select token;
Edit: Furthermore, you can amalgamate your Split(' ') logic into your regex by using lookbehind and lookahead. This might improve efficiency, although it does reduce legibility a bit.
var words = Regex.Matches(text, #"(?<=\ |^)[a-zA-Z]+\.?(?=\ |$)")
.OfType<Match>()
.Select(m => m.Value);
The (?<=\ |^) lookbehind means that the match must be preceded by a space or start-of-string.
The (?=\ |$) lookahead means that the match must be succeeded by a space or end-of-string.
You need to escape .
^[a-zA-Z]+\.?$
Otherwise, . is a special character that matches (almost) all characters--not just periods.

Categories