C# regex for specific character or nothing - c#

Short version:
How do I match a single, specific character or nothing within a longer, potentially repeating, pattern?
Long version:
I'm forming a regex to count the occurrences of string 'word' in strings which have the specific format of;
a hyphen followed by an integer number (any length) followed by a hyphen followed by the string 'word' followed by a hyphen, potentially repeating.
E.g.
'-0-word-' (1 match)
'-10-word-' (1 match)
'-999-word-' (1 match)
'-1-word-1-word-' (2 matches)
'-1-word-1-word-222-word-' (3 matches) etc.
If the pattern repeats then I think the leading hyphen has to be optional as it is already the trailing hyphen for the previous match.
The best I have come up with so far is;
[-]?\d+-word-
which gives 3 matches for
'-1-word-1-word-222-word-'
but it also gives 3 matches for
'-1-word-1-word-X222-word-'
because the leading hyphen is optional and the 'X' is ignored. I want the leading hyphen to be only a hyphen or nothing. I want to make sure the whole string is rejected (no matches) if the format is not correct.
Thanks for your help!

^-\d+-word(?:-\d+-word)*-$
Try this.See demo.
https://regex101.com/r/wU7sQ0/20

If you want to count the number of occurences and to check the string format at the same time, you can do this:
String input = "-1-word-1-word-222-word-";
String pattern = #"\A(-[0-9]+-word)+-\z";
Match m = Regex.Match(input, pattern);
if (m.Success) {
Console.WriteLine(m.Groups[1].Captures.Count);
}
When you repeat a capture group, each captures are stored, and you can access them with the Captures attribute.

Please try this regex demo for the word occurences.
Pattern is "-\d+-word(-\d+-word)*-"
http://regexr.com/3agkb

Related

How to perform a RegEx replace only if another a separate filter is matched using .NET?

Given a string (a path) that matches /dir1/, I need to replace all spaces with dashes.
Ex: /dir1/path with spaces should become /dir1/path-with-spaces.
This could easily be done in 2 steps...
var rgx = new Regex(#"^\/dir1\/");
var path = "/dir1/path with spaces";
if (rgx.isMatch(path))
{
path = (new Regex(#" |\%20")).Replace(path, "-");
}
Unfortunately for me, the application is already built with a simple RegEx replace and cannot be modified, so I need to have the RegEx do the work. I thought I had found the answer here:
regex: how to replace all occurrences of a string within another string, if the original string matches some filter
And was able create and test (?:\G(?!^)|^(?=\/dir1\/.*$)).*?\K( |\%20), but then I learned it does not work in this app because the \K is an unrecognized escape sequence (not supported in .NET).
I also tried a positive lookbehind, but I wasn't able to get it to replace all the spaces (only the last if the match was greedy or the first if not greedy). I could put in enough checks to handle the max number of spaces, but as soon as I check for 10 spaces, someone will pass in a path with 11 spaces.
Is there a RegEx only solution for this problem that will work in the .NET engine?
You can leverage the unlimited width lookbehind pattern in .NET:
Regex.Replace(path, #"(?<=^/dir1/.*?)(?: |%20)", "-")
See the regex demo
Regex details
(?<=^/dir1/.*?) - a positive lookbehind that matches a location that is immediately preceded with /dir1/ and then any zero or more chars other than a newline char, as few as possible
(?: |%20) - either a space or %20 substring.

How to match a string between <>?

I tried \w+\:(\w+\-?\.?(\d+)?) but that is not correct
I have following text
<staticText:HelloWorld>_<xmlNode:Node.03>_<date:yyy-MM-dd>_<time:HH-mm-ss-fff>
The end result I want is something like the following
["staticText:HelloWorld", "xmlNode:Node.03","date:yyy-MM-dd","time:HH-mm-ss-fff"]
You could use the following regex.
<(.*?)>
Then have a look at how groups work to retrieve the result.
Regex rx = new Regex("<(.*?)>");
string text = "<staticText:HelloWorld>_<xmlNode:Node.03>_<date:yyy-MM-dd>_<time:HH-mm-ss-fff>";
MatchCollection matches = rx.Matches(text);
Console.WriteLine(matches.Count);
foreach(Match match in matches){
var groups = match.Groups;
Console.WriteLine(groups[1]);
}
This line should be able to match the content:
<(.*?)>
It will catch the arrows at the end which you don't seem to want, but you could remove them after words without regex.
You should consider a website like https://regexr.com - it helps exponentially in writing regex by allowing you to paste your cases and see how it works with them.
Matches any string within the <>. Hope this helps.
<(.*?)>
Your pattern does not match the 3rd and the 4th part of the example data because in this part \w+\-?\.?(\d+)? the dash and the digits match only once and are not repeated.
For your example data, you might use a character class [\w.-]+to match the part after the colon to make the match a bit more broad:
<(\w+\:[\w.-]+)>
Regex demo | C# demo
Or to make it more specific, specify a pattern for either the Node.03 part and for the year month date hour etc parts using a repeated pattern.
<(\w+\:\w+(?:\.\d+|\d+(?:-\d+)+)?)>
Explanation
< Match <
( Capturing group
\w+\:\w+ Match 1+ word chars, : and 1+ word chars
(?: Non capturing group
\.\d+ Match . and 1+ digits
| Or
\d+(?:-\d+)+ Match 1+ digits and repeat 1+ times matching - and 1+ digits
)? Close non capturing group and make it optional
) Close capturing group
>
Regex demo | C# Demo

Regex that validate a string consist of 3-4 char followed by semicolon

I would like to make a regex that validate a string is in this format:
".xml;.mp4;.webm;.wmv;.ogg"
file format separated with semicolon.
what is the best way to do this?
We can try using the pattern ^(?:\.[A-Za-z0-9]{3,4})(?:;\.[A-Za-z0-9]{3,4})*$:
Regex regex = new Regex(#"^(?:\.[A-Za-z0-9]{3,4})(?:;\.[A-Za-z0-9]{3,4})*$");
Match match = regex.Match(".xml;.mp4;.webm;.wmv;.ogg");
if (match.Success)
{
Console.WriteLine("MATCH");
}
Explanation:
^ from the start of the string
(?:\.[A-Za-z0-9]{3,4}) match a dot followed by 3-4 alphanumeric characters
(?:;\.[A-Za-z0-9]{3,4})* then match semicolon, followed by dot and 3-4 alphanumeric
characters, that quantity zero or more times
$ match the end of the string
Side note: I used ?: inside the terms in parentheses, which in theory should tell the regex engine not to capture these terms. This might improve performance, though perhaps at the cost of the pattern being slightly less readable.
Something like this, but need to check for only one format (if list has only one format, will it be followed by semicolon).
^(?:\.[a-zA-Z0-9]+;)*\.[a-zA-Z0-9]+$

Regex that removes the 2 trailing letters from a string not preceded with other letters

This is in C#. I've been bugging my head but not luck so far.
So for example
123456BVC --> 123456BVC (keep the same)
123456BV --> 123456 (remove trailing letters)
12345V -- > 12345V (keep the same)
12345 --> 12345 (keep the same)
ABC123AB --> ABC123 (remove trailing letters)
It can start with anything.
I've tried #".*[a-zA-Z]{2}$" but no luck
This is in C# so that I always return a string removing the two trailing letters if they do exist and are not preceded with another letter.
Match result = Regex.Match(mystring, pattern);
return result.Value;
Your #".*[a-zA-Z]{2}$" regex matches any 0+ characters other than a newline (as many as possible) and 2 ASCII letters at the end of the string. You do not check the context, so the 2 letters are matched regardless of what comes before them.
You need a regex that will match the last two letters not preceded with a letter:
(?<!\p{L})\p{L}{2}$
See this regex demo.
Details:
(?<!\p{L}) - fails the match if a letter (\p{L}) is found before the current position (you may use [a-zA-Z] if you only want to deal with ASCII letters)
\p{L}{2} - 2 letters
$ - end of string.
In C#, use
var result = Regex.Replace(mystring, #"(?<!\p{L})\p{L}{2}$", string.Empty);
If you're looking to remove those last two letters, you can simply do this:
string result = Regex.Replace(originalString, #"[A-Za-z]{2}$", string.Empty);
Remember that in regex $ means the end of the input or the string before a newline.

Regex search for string like "$12,56,45" using c#

I want it to search string like "$12,56,450" using Regex in c#, but it doesn't match the string
Here is my code:
string input="Total earn for the year $12,56,450";
string pattern = #"\b(?mi)($12,56,450)\b";
Regex regex = new Regex(pattern);
if (regex.Match(input).Success)
{
return true;
}
This Regex will do the job, (?mi)(\$\d{2},\d{2},\d{3}), and here's a Regex 101 to prove it.
Now let's break it down a little:
\$ matches the literal $ at the beginning of the string
\d{2} matches any two digits
, matches the literal ,
\d{2} matches any two digits
, matches the literal ,
\d{3} matches any three digits
Now, for the purposes of the demonstration I removed the word boundaries, \b, but I'm also pretty confident you don't need them anyway. See, word boundaries aren't generally necessary for such a finite string match. Consider their definition:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
You need to escape $ and some other special regex caracters.
try this #"\b(?mi)(\$12,56,450)\b";
if you want you can use \d to match a digit, and use \d{2,3} to match a digit with size 2 or 3.

Categories