RegEx match on any of multiple groups - c#

I'm not sure if this is possible, but I would like to match on multiple regex groups
(^[0-9]) (^[$][0-9]) (^[$]{2}[0-9])
It would match the string if the first character is number, or if the first character is a $ followed by a number, or if the first two characters are a $ followed by a number.
Example strings that would match:
15271%
$3C001%
$$8244150928223C001%
Can this be done in one go, or would I have to check each match individually?
Any help is appreciated. Thanks!

You can make make use of the pipe symbole | to achieve that. It basically behaves like an "or" in your regex pattern.
For example:
(banana|apple)
would match both "banana" and "apple".
In your case, you can also use a pattern like this
(\${0,2}\d.+)
to match all options: without $, with one $ and with two $.

You could use:
^\d.*|^\$\d.*|^\$\$\d.*
try {
if (Regex.IsMatch(subjectString, #"\A(?:^\d.*|^\$\d.*|^\$\$\d.*)\z", RegexOptions.Multiline)) {
// Successful match
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Related

Remove optional last parenthesis

I'm trying to parse file name and to remove potential number in parenthesis (when having multiple file with same base name), but only the last one
Here are some expected results:
Test ==> Test
Test (1) ==> Test
Test (1) (2) ==> Test (1)
Test (123) (232) ==> Test (123)
Test (1) foo ==> Test (1) foo
I tried to use this regex : (.*)( ?\(\d+\))+, but the test 1 fails.
I also tried : (.*)( ?\(\d+\))? but only the 1st test succeed.
I suspect there's something wrong with quantifiers in the regex, but I didn't find exactly what.
How to fix my regex ?
My guess is that you might likely want to design an expression similar to:
^(.*?)\s*(\(\s*\d+\)\s*)?$
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"^(.*?)\s*(\(\s*\d+\)\s*)?$";
string input = #"Test
Test (1)
Test (1) (2)
Test (1) (2) (3)
Test (1) (2) (3) (4)
";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
Just use a neg. lookahead:
\s*\([^()]+\)(?!.*\([^()]+\))
See a demo on regex101.com.
More verbose this is
\s* # whitespaces, eventually
\([^()]+\) # (...)
(?!.*\([^()]+\)) # neg. lookahead, no (...) must follow
As an alternative you could use an end of string / line anchor:
Regular Expression
\s*\(\d+\)$
Visualisation
Example usage
string resultString = null;
try {
resultString = Regex.Replace(subjectString, #"\s*\(\d+\)$", "", RegexOptions.Multiline);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Human Readable
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the opening parenthesis character \(
Match a single character that is a “digit” (any decimal number in any Unicode script) \d+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the closing parenthesis character \)
Assert position at the end of a line (at the end of the string or before a line break character) (line feed) $
You can avoid Regular Expressions all together, if you simply want the second to you could do:
string example = #"Test (1) (2) (3) (4)";
public string GetPathName(string input)
{
var position = input.LastIndexOf('(');
if(position == -1)
return input;
return example.Substring(0, position);
}
You know that the left parenthesis will always be at the start of the ending name, so why not find the index to that, then grab the rest from position zero? I know you requested Regular Expression, but if you do not need it why over engineer for it?
You could use your first pattern (.*)( ?\(\d+\))+ and replace with the first capturing group only.
To optimize it a bit, you could remove the quantifier + after the last group and omit the second capturing group.
Then this will remove the last parenthesis with a number between by matching until the end of the string and then backtrack until the last occurrence of parenthesis with a digit.
In the replacement use the first capturing group:
^(.*) \(\d+\)
Explanation
^ Start of string
(.*) Capture group 1, match any char 0+ times
(\d+) Match space, ( 1+ digits )
.NET Regex demo | C# demo

C# Watin Find.ByText with Regex

I have the following problem here:
I'm trying to get a element from a webpage using Watin's Find.ByText. However, I fail to use regex in C#.
This statement will return the desired element.
return this.Document.Element(Find.ByText("781|262"));
When I try to use regex, I get back the whole page.
return this.Document.Element(Find.ByText(new Regex(#"781\|262")));
I am trying to get this element:
<td>781|262</td>
I also tried
return this.Document.Element(Find.ByText(Predicate));
private bool Predicate(string s)
{
return s.Equals("781|262");
}
The above works, while this does not:
private bool Predicate(string s)
{
return new Regex(#"781\|262").IsMatch(s);
}
I now realized, in the predicate s is the whole page content. I guess the issue is with Document.Element.
Any help appreciated, thank you.
Try with :
return this.Document.Element(Find.ByText(new Regex("781\\|262")));
or
return this.Document.Element(Find.ByText(new Regex("781|262")));
Choose the one that fits your needs, I don't know if the character "\" is significant for you.
You don't need the string to be a verbatim string in order to instantiate the regex class.
Well, I did not realize the Regex will also match the body/html element too, since the pattern is obviously also included in them. I had to specify that the text must begin and end with the pattern by using ^ and $, so it only matches the desired element:
^781\u007c262$
\u007c matches |, I used this since MSDN documentation also did.
The final code:
<td>781|262</td>
return Document.TableCell(Find.ByText(new Regex(#"^\d{3}\|\d{3}$")));
Document.TableCell to speedup the search by only trying Regex on td elements.
# is used to prevent C# from interpreting the \ as escape sequence.
^ is used to only match elements with text beginning with the following pattern
\d{3} match didit 0-9 3 times
\| match | literally
\d{3} match digit 0-9 3 times
$ the element must also end with this pattern

.NET Regex: negate previous character for the first character in string

Consider following string
"Some" string with "quotes" and \"pre-slashed\" quotes
Using regex, I want to find all the double quotes with no slash before them. So I want the regex to find four matches for the example sentence
This....
[^\\]"
...would find only three of them. I suppose that's because of the regex's state machine which is first validating the command to negate the presence of the slash.
That means I need to write a regex with some kind of look-behind, but I don't know how to work with these lookaheads and lookbehinds...im not even sure that's what I'm looking for.
The following attempt returns 6, not 4 matches...
"(?<!\\)
"(?<!\\")
Is what you're looking for
If you want to match "Some" and "quotes", then
(?<!\\")(?!\\")"[a-zA-Z0-9]*"
will do
Explanation:
(?<!\\") - Negative lookbehind. Specifies a group that can not match before your main expression
(?!\\") - Negative lookahead. Specifies a group that can not match after your main expression
"[a-zA-Z0-9]*" - String to match between regular quotes
Which means - match anything that doesn't come with \" before and \" after, but is contained inside double quotes
You almost got it, move the quote after the lookbehind, like:
(?<!\\)"
Also be ware of cases like
"escaped" backslash \\"string\"
You can use an expression like this to handle those:
(?<!\\)(?:\\\\)*"
Try this
(?<!\\)(?<qs>"[^"]+")
Explanation
<!--
(?<!\\)(?<qs>"[^"]+")
Options: case insensitive
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
Match the character “\” literally «\\»
Match the regular expression below and capture its match into backreference with name “qs” «(?<qs>"[^"]+")»
Match the character “"” literally «"»
Match any character that is NOT a “"” «[^"]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “"” literally «"»
-->
code
try {
if (Regex.IsMatch(subjectString, #"(?<!\\)(?<qs>""[^""]+"")", RegexOptions.IgnoreCase)) {
// Successful match
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

regular expression ".*[^a-zA-Z0-9_].*"

As I am trying to read more about regular expressions in C#, I just want to make sure of my conclusion that I made.
for the following expression ".*[^a-zA-Z0-9_].* ", the " .* " at the beginning and end are useless, is that right ? because as I understood, that ".*" means zero or more occurrence of any character, but being followed by "[^a-zA-Z0-9_]" which means any character other than any combination of letters and digits case insensitive, makes ".*" useless to be added before and after "[^a-zA-Z0-9_]", is that right ?
Here is the code I am using to check if the expressions matches
// Here we call Regex.Match.
Match match = Regex.Match("anytest#", ".*[^a-z A-Z0-9_].*");
//Match match = Regex.Match("anytest#", "[^a-z A-Z0-9_]");
// Here we check the Match instance.
if (match.Success)
Console.WriteLine("error");
else
Console.WriteLine("no error");
.*[^a-zA-Z0-9_].* will match the entire input as long as there is a non-alphanumeric/underscore somewhere in the input. [^a-zA-Z0-9_] will match only a single non-alphanumeric/underscore character (most likely the last one, if you're using the default greedy matching) if it is somewhere in the input. Which one you want depends on the input and what you want to do once you find out if a non-alphanumeric/underscore character exists in the input.
The only difference would be whether the "margin characters" will be included in the result or not.
For:
ab41--_71j
It will match:
1--_7
And without the .* at beginning and end it will match:
--_
Any string will match the .*[^a-zA-Z0-9_].* regex at least once as long as it has at least one character that isn't a-zA-Z0-9_
From your currently last comment in your answer, I understand that you actually use:
^[a-zA-Z0-9]*$
This will match only if all characters are digit/letters.
If it doesn't match, then the string is invalid.
If you also want to allow the _ character, then use:
^[a-zA-Z0-9_]*$
Which can even be shortened to:
^\w$
In general, it is better to make regex's Validate rather than Invalidate strings. It just makes more sense and is more intuitive.
So my validation would look like:
if (Regex.IsMatch("anytest#", "^\\w$"))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
Another option that is probably faster:
if ("anytest#".ToCharArray().All(c => char.IsLetterOrDigit(c) || c == '_'))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
And if you don't want '_' to be included, it can even look nicer;
if ("anytest#".ToCharArray().All(char.IsLetterOrDigit))
{
Console.WriteLine("Success");
}
else
{
Console.WriteLine("Error");
}
No, because there are other characters than a-Z and 0-9.
That regex matches all strings that start with any characters followed not by a-zA-Z0-9 and end with any characters. Or just a string that does not contain a-zA-Z0-9 at all.
If you leave the .* then you just have a regex that matches a charatcer that does not contain a-zA-Z0-9 at all.
.*[^a-zA-Z0-9_].* matches for instance: ABC_ß_ABC
[^a-zA-Z0-9_] matches for instance: ß (and this regex just matches 1 character)
Input 1 : ABC_ß_ABC
Input 2 : ß
Regex 1: .*[^a-zA-Z0-9_].*
Regex 2: [^a-zA-Z0-9_]
Both the inputs match both the regex,
For input 1
Regex 1 matches 9 characters
Regex 2 matches only 1 character
Only include those tokens in the Regex that you are actually looking for. In your case you didn't actually care whether there are any other characters before or after the excluding character class you specified. Adding .* before and after that doesn't change the success of the match, but makes matching more complicated. A Regex matches anywhere already, unless you specifically anchor it somehow, e.g. using ^ at the start.

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex
I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5
Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.
How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.
Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.
I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.
It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);
Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.
I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

Categories