Regular Expression oddity, why does this happen?

Regular Expression oddity, why does this happen? - c#

This simple regular expression matches the text of Movie. Am I wrong in reading this as "Q repeated zero or more times"? Why does it match, shouldn't it return false?
public class Program
{
private static void Main(string[] args)
{
Regex regex = new Regex("Q*");
string input = "Movie";
if (regex.IsMatch(input))
{
Console.WriteLine("Yup.");
}
else
{
Console.WriteLine("Nope.");
}
}
}

As you are saying correctly, it means “Q repeated zero or more times”. I this case, it’s zero times, so you are essentially trying to match "" in your input string. As IsMatch doesn’t care where it matches, it can match the empty string anywhere within your input string, so it returns true.
If you want to make sure that the whole input string has to match, you can add ^ and $: "^Q*$".
Regex regex = new Regex("^Q*$");
Console.WriteLine(regex.IsMatch("Movie")); // false
Console.WriteLine(regex.IsMatch("QQQ")); // true
Console.WriteLine(regex.IsMatch("")); // true

You are right in reading this regex as Q repeated 0 or more times. The thing with that is the 0. When you try a regex, it will try to find any successful match.
The only way for the regex to match the string is to try matching an empty string (0 times), which appears anywhere in-between the matches, and if you didn't know that before, yes, regex can match empty strings between characters. You can try:
(Q*)
To get a capture group and use .Matches and Groups[1].Value to see what has been captured. You'll see that it's an empty string.
Usually, if you want to check the existence of a character, you don't use regex, but use .Contains. Otherwise, if you do want to use regex, you'd drop the quantifier, or use one which matches at least one particular character.

Related

Why isn't this C# regular expression working?

I tried to write an expression to validate the following pattern:
digit[0-9] at 1 time exactly
"dot"
digit[0-9] 1-2 times
"dot"
digit[0-9] 1-3 times
"dot"
digit[0-9] 1-3 times or “hyphen”
For example these are legal numbers:
1.10.23.5
1.10.23.-
these aren't:
10.10.23.5
1.254.25.3
I used RegexBuddy to write the next pattern:
[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-
In RegexBuddy all seems perfect but in my code I am getting true about illegal numbers (like 10.1.1.1)
I wrote the next method for validating this pattern:
public static bool IsVaildEc(string ec)
{
try
{
if (String.IsNullOrEmpty(ec))
return false;
string pattern = #"[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-";
Regex check = new Regex(pattern);
return check.IsMatch(ec);
}
catch (Exception ex)
{
//logger
}
}
What am I doing wrong?

You regex isn't anchored to the start and end of the string, therefore it also matches a substring (e. g. 0.1.1.1 in the string 10.1.1.1).
As you can see, RegexBuddy matches a substring in the first "illegal" number. It correctly fails to match the second number because the three digits in the second octet can't be matched at all:
string pattern = #"^(?:[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-)$";
will fix that problem.
Then, your regex is needlessly complicated. The following does the same but simpler:
string pattern = #"^[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.(?:[0-9]{1,3}|-)$";

try:
#"^[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-"
you are not starting from the beggining of the text

If you match against the "10.1.1.1" the "0.1.1.1" part of your string would be a correct number and therefor return true.
Matching against
#"^[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-"
with the ^ sign at the beginning means that you want to match from the beginning.

You are missing the ^ char in the start of the regex.
Try this regex:
^[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}|[0-9]\.[0-9]{1,2}\.[0-9]{1,3}\.-
This C# Regex Cheat Sheet can be handy

what is a good pattern to processes each individual regex match through a method

I'm trying to figure out a pattern where I run a regex match on a long string, and each time it finds a match, it runs a replace on it. The thing is, the replace will vary depending on the matched value. This new value will be determined by a method. For example:
var matches = Regex.Match(myString, myPattern);
while(matches.Success){
Regex.Replace(myString, matches.Value, GetNewValue(matches.Groups[1]));
matches = matches.NextMatch();
}
The problem (i think) is that if I run the Regex.Replace, all of the match indexes get messed up so the result ends up coming out wrong. Any suggestions?

If you replace each pattern with a fixed string, Regex.replace does that for you. You don't need to iterate the matches:
Regex.Replace(myString, myPattern, "replacement");
Otherwise, if the replacement depends upon the matched value, use the MatchEvaluator delegate, as the 3rd argument to Regex.Replace. It receives an instance of Match and returns string. The return value is the replacement string. If you don't want to replace some matches, simply return match.Value:
string myString = "aa bb aa bb";
string myPattern = #"\w+";
string result = Regex.Replace(myString, myPattern,
match => match.Value == "aa" ? "0" : "1" );
Console.WriteLine(result);
// 0 1 0 1
If you really need to iterate the matches and replace them manually, you need to start replacement from the last match towards the first, so that the index of the string is not ruined for the upcoming matches. Here's an example:
var matches = Regex.Matches(myString, myPattern);
var matchesFromEndToStart = matches.Cast<Match>().OrderByDescending(m => m.Index);
var sb = new StringBuilder(myString);
foreach (var match in matchesFromEndToStart)
{
if (IsGood(match))
{
sb.Remove(match.Index, match.Length)
.Insert(match.Index, GetReplacementFor(match));
}
}
Console.WriteLine(sb.ToString());
Just be careful, that your matches do not contain nested instances. If so, you either need to remove matches which are inside another match, or rerun the regex pattern to generate new matches after each replacement. I still recommend the second approach, which uses the delegates.

If I understand your question correctly, you want to perform a replace based on a constant Regular Expression, but the replacement text you use will change based on the actual text that the regex matches on.
The Captures property of the Match Class (not the Match method) returns a collection of all the matches with your regex within the input string. It contains information like the position within the string, the matched value and the length of the match. If you iterate over this collection with a foreach loop you should be able to treat each match individually and perform some string manipulations where you can dynamically modify the replacement value.

I would use something like
Regex regEx = new Regex("some.*?pattern");
string input = "someBLAHpattern!";
foreach (Match match in regEx.Matches(input))
{
DoStuffWith(match.Value);
}

Using Regex to determine if string contains a repeated sequence of a particular substring with comma separators and nothing else

I want to find if a string contains a repeated sequence of a known substring (with comma separators) and nothing else and return true if this is the case; otherwise false. For example: the substring is "0,8"
String A: "0,8,0,8,0,8,0,8" returns true
String B: "0,8,0,8,1,0,8,0" returns false because of '1'
I tried using the C# string functions Contains but it does not suit my requirements. I am totally new to regular expression but I feel it should be powerful enough to do this. What RegEx should I use to do this?

The pattern for a string containing nothing but a repeated number of a given substring (possibly zero of them, resulting in an empty string) is \A(?:substring goes here)*\z. The \A matches the beginning of the string, the \z the end of the string, and the (?:...)* matches 0 or more copies of anything matching the thing between the colon and the close parenthesis.
But your string doesn't actually match \A(?:0,8)*\z, because of the extra commas; an example that would match is "0,80,80,80,8". You need to account for the commas explicitly with something like \A0,8(?:,0,8)*\z.
You can build such a thing in C# thus:
string OkSubstring = "0,8";
string aOk = "0,8,0,8,0,8,0,8";
string bOK = "0,8,0,8,1,0,8,0";
Regex OkRegex = new Regex( #"\A" + OkSubstring + "(?:," + OkSubstring + #")*\z" );
OkRegex.isMatch(aOK); // True
OkRegex.isMatch(bOK); // False
That hard-codes the comma-delimiter; you could make it more general. Or maybe you just need the literal regex. Either way, that's the pattern you need.
EDIT Changed the anchors per Mike Samuel's suggestion.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5

The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

regular expression that matches a string which comprises of only specific letters

I've tried several regex combinations to figure out this, but some or the condition fails,
I have an input string, that could only contain a given set of defined characters
lets say A , B or C in it.
how do I match for something like this?
ABBBCCC -- isMatch True
AAASDFDCCC -- isMatch false
ps. I'm using C#

^[ABC]+$
Should be enough: that is using a Character class or Character Set.
The Anchors '^' and '$' would be there only to ensure the all String contains only those characters from start to end.
Regex.Match("ABACBA", "^[ABC]+$"); // => matches
Meaning: a Character Set will not guarantee the order of he characters matched.
Regex.Match("ABACBA", "^A+B+C+$"); // => false
Would guarantee the order

I think you are looking for this:
Match m = Regex.Match("abracadabra", "^[ABC]*$");
if (m.Success) {
// Macth
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression oddity, why does this happen? - c#

Related

Why isn't this C# regular expression working?

what is a good pattern to processes each individual regex match through a method

Using Regex to determine if string contains a repeated sequence of a particular substring with comma separators and nothing else

How to find repeatable characters

regular expression that matches a string which comprises of only specific letters

Categories

Resources