Using '*' instead of '+' in a regex check

Using '*' instead of '+' in a regex check - c#

I have a regex check:
Match matchLeft = Regex.Match(Name.Substring(subName.Length), #"\d*");
This basically checks for the first digits at the end of the subName. Now, I have noticed that with the use of * in the regex (* = 0 or more), if the next characters are not digits, it will return nothing. If they are however, it will return the string of digits.
But
If I use #"\d+" instead, it will look for 1 or more digits, and return the first instance of digits, regardless of there position after the substring.
So if I had a string ("abcdef123") and a substring ("abc"):
#"\d*" would match null
#"\d+" would match "123"
Alternatively, if the substring was "abcdef", both would match "123".
So my question is - why does the use of * return nothing if the directly following characters are not digits? Will this occur every time?

When you get the substring you end up with def123. The following are true:
\d+ tries to get at least one match in the string and will greedily match more. It must traverse the string to find the first match, arriving at the 123.
On the other hand, \d* will start at the beginning of the string and will successfully match the start of the string with zero digits. Even though it is greedy, it is completely satisfied with matching zero digits. It is a successful match and is zero-width.
You can change this behavior by making it \d*$ to anchor at the end of the matched string.

I think you answered your question yourself. This behavior is default and will occur every time.
See Quantifier Cheat Sheet
A+ One or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
Since \d* can match an empty string it will match an empty string as regex engine always tries to return a valid match, and can even match empty substrings at the beginning, end and between characters in a string.

Related

Regular expression to match 0 or more occurrences not working

This question may sound stupid. But I have tried several options and none of them worked.
I have the following in a string variable.
string myText="*someText*someAnotherText*";
What I mean by above is that, there can be 0 or more characters before "someText". There can be 0 or more characters after "someText" and before "someAnotherText". Finally, there can be 0 or more occurrences of any character before "someAnotherText".
I tried the following.
string res= Regex.Replace(searchFor.ToLower(), "*", #"\S*");
It didn't work.
Then I tried the following.
string res= Regex.Replace(searchFor.ToLower(), "*", #"\*");
Even that didn't work.
Can someone help pls ?
Even though I have mentioned "*" to indicate 0 or more occurrences, it says that I haven't mentioned the number of occurrences.

Unlike the DOS wildcard character, the * character in a regular expression means repeat the previous item (character, group, whatever) 0 or more times. In your regular expression the first * has no preceding character, the second one follows the t character, so will repeat that any number of times.
To get '0 or more of any character' you need to use the composition .* where . is 'any character' and * is '0 or more times'.
In other words to search for someText followed any number of characters later by someAnotherText you would use the following Regex:
var re = new Regex(#"someText.*someAnotherText");
Note that unless you specify otherwise by putting start/end specifiers in (^ for start of string, $ for end) the Regex will match any substring of the test string.
Tests for the above, all returning true:
re.IsMatch("This is someText, followed by someAnotherText with text after.");
re.IsMatch("someTextsomeAnotherText");
re.IsMatch("start:someTextsomAnotherText:end");
And so on.
In Regex terms * is a quantifier. Other quantifiers are:
? Match 0 or 1
+ Match 1 or more
{n} Match 'n' times
{n,} Match at least 'n' times
{n,m} Match 'n' to 'm' times
All apply to the preceding term in the Regex.
Placing a ? after another quantifier (including ?) will convert it to lazy form, where it will match as few items as it can. This will allow following terms to also match the terms you specified.

The regular expression to match 0 or more occurrences of any character is
.*
where . matches any single character and * matches zero or more occurrences of it.
(This answer is a quick reference simplification of the current answer.)

Explain the Regex mentioned

Can any one please explain the regex below, this has been used in my application for a very long time even before I joined, and I am very new to regex's.
/^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$/
As far as I understand
this regex will validate
- for a minimum of 6 chars to a maximum of 10 characters
- will escape the characters like ^ and $
also, my basic need is that I want a regex for a minimum of 6 characters with 1 character being a digit and the other one being a special character.

^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
^ is called an "anchor". It basically means that any following text must be immediately after the "start of the input". So ^B would match "B" but not "AB" because in the second "B" is not the first character.
.* matches 0 or more characters - any character except a newline (by default). This is what's known as a greedy quantifier - the regex engine will match ("consume") all of the characters to the end of the input (or the end of the line) and then work backwards for the rest of the expression (it "gives up" characters only when it must). In a regex, once a character is "matched" no other part of the expression can "match" it again (except for zero-width lookarounds, which is coming next).
(?=.{6,10}) is a lookahead anchor and it matches a position in the input. It finds a place in the input where there are 6 to 10 characters following, but it does not "consume" those characters, meaning that the following expressions are free to match them.
(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) is another lookahead anchor. It matches a position in the input where the following text contains four letters ([a-zA-Z] matches one lowercase or uppercase letter), but any number of other characters (including zero characters) may be between them. For example: "++a5b---C#D" would match. Again, being an anchor, it does not actually "consume" the matched characters - it only finds a position in the text where the following characters match the expression.
(?=.*\d.*\d) Another lookahead. This matches a position where two numbers follow (with any number of other characters in between).
.* Already covered this one.
$ This is another kind of anchor that matches the end of the input (or the end of a line - the position just before a newline character). It says that the preceding expression must match characters at the end of the string. When ^ and $ are used together, it means that the entire input must be matched (not just part of it). So /bcd/ would match "abcde", but /^bcd$/ would not match "abcde" because "a" and "e" could not be included in the match.
NOTE
This looks like a password validation regex. If it is, please note that it's broken. The .* at the beginning and end will allow the password to be arbitrarily longer than 10 characters. It could also be rewritten to be a bit shorter. I believe the following will be an acceptable (and slightly more readable) substitute:
^(?=(.*[a-zA-Z]){4})(?=(.*\d){2}).{6,10}$
Thanks to #nhahtdh for pointing out the correct way to implement the character length limit.

Check Cyborgx37's answer for the syntax explanation. I'll do some explanation on the meaning of the regex.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
The first .* is redundant, since the rest are zero-width assertions that begins with any character ., and .* at the end.
The regex will match minimum 6 characters, due to the assertion (?=.{6,10}). However, there is no upper limit on the number of characters of the string that the regex can match. This is because of the .* at the end (the .* in the front also contributes).
This (?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) part asserts that there are at least 4 English alphabet character (uppercase or lowercase). And (?=.*\d.*\d) asserts that there are at least 2 digits (0-9). Since [a-zA-Z] and \d are disjoint sets, these 2 conditions combined makes the (?=.{6,10}) redundant.
The syntax of .*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z] is also needlessly verbose. It can be shorten with the use of repetition: (?:.*[a-zA-Z]){4}.
The following regex is equivalent your original regex. However, I really doubt your current one and this equivalent rewrite of your regex does what you want:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).*$
More explicit on the length, since clarity is always better. Meaning stay the same:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).{6,}$
Recap:
Minimum length = 6
No limit on maximum length
At least 4 English alphabet, lowercase or uppercase
At least 2 digits 0-9

REGEXPLANATION
/.../: slashes are often used to represent the area where the regex is defined
^: matches beginning of input string
.: this can match any character
*: matches the previous symbol 0 or more times
.{6,10}: matches .(any character) somewhere between 6 and 10 times
[a-zA-Z]: matches all characters between a and z and between A and Z
\d: matches a digit.
$: matches the end of input.
I think that just about does it for all the symbols in the regex you've posted

For your regex request, here is what you would use:
^(?=.{6,}$)(?=.*?\d)(?=.*?[!##$%&*()+_=?\^-]).*
And here it is unrolled for you:
^ // Anchor the beginning of the string (password).
(?=.{6,}$) // Look ahead: Six or more characters, then the end of the string.
(?=.*?\d) // Look ahead: Anything, then a single digit.
(?=.*?[!##$%&*()+_=?\^-]) // Look ahead: Anything, and a special character.
.* // Passes our look aheads, let's consume the entire string.
As you can see, the special characters have to be explicitly defined as there is not a reserved shorthand notation (like \w, \s, \d) for them. Here are the accepted ones (you can modify as you wish):
!, #, #, $, %, ^, &, *, (, ), -, +, _, =, ?
The key to understanding regex look aheads is to remember that they do not move the position of the parser. Meaning that (?=...) will start looking at the first character after the last pattern match, as will subsequent (?=...) look aheads.

Does regex + symbol apply to previous element only?

In order to match all strings beginning with 04 and only containing digits, will the following work?
Regex.IsMatch(str, "^04[0-9]+$")
Or is another set of brackets necessary?
Regex.IsMatch(str, "^04([0-9])+$")

In Regex:
[character_group]
Matches any single character in character_group.
\d
Matches any decimal digit.
+
Matches the previous element one or more times.
(subexpression)
Captures the matched subexpression and assigns it a ordinal number.
^
The match must start at the beginning of the string or line.
$
The match must occur at the end of the string or before \n at the end of the line or string.
so that this code could be helpful:
Regex.IsMatch(str, "^04\d+$")
and all of your code works correctly.

Your first regex is correct, but the second one isn't. It matches the same things as the first regex, but it does a lot of unnecessary work in the process. Check it out:
Regex.IsMatch("04123", #"^04([0-9])+$")
In this example, the 1 is captured in group #1, only to be overwritten by 2 and again by 3. It's almost never a good idea to add a quantifier to a capturing group. For a detailed explanation, read this.
But maybe it's precedence rules you're asking about. Quantifiers have higher precedence than concatenation, so there's no need to isolate the character class with parentheses (if that's what you're doing).

Nothing else but Regex for matching the string

I want to check whether there is string starting from number and then optional character with the help of the regex.So what should be the regex for matching the string which must be started with number and then character might be there or not.Like there is string "30a" or "30" it should be matched.But if there is "a" or some else character or sereis of characters, string should not be matched.

Sounds like there should be able to be any number of numeric characters at the beginning followed by optional other characters. To match any other character after a series of numbers at the beginning I would use:
\d+.*
To match only alpha numeric characters after the mandatory numeric beginning I would use:
\d+\w*
Note: as pointed out by Dav, if you add a ^ to the start of the expression and a $ to the end of the expression like this ^\d+\w*$ you will ensure the whole string matches. However if you leave those off, you will be able to search the input string for what you need. It just depends on what your needs are.

^\d.*
The ^ matches the start of the string, \d matches a single digit, and then the .* matches any number of additional characters.
Thus, the net result is that it will only match if the string begins with a digit.

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?

If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.

In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using '*' instead of '+' in a regex check - c#

Related

Regular expression to match 0 or more occurrences not working

Explain the Regex mentioned

Does regex + symbol apply to previous element only?

Nothing else but Regex for matching the string

why do these regex tests let certain characters pass?

Categories

Resources