Finding characters between parentheses with a .NET Regex

Finding characters between parentheses with a .NET Regex - c#

How can I get 98 from the expression $RetailTransaction.IsContainsTender(98) using Regex?

As usual in such circumstances you should first ask yourself how the data will look like (with more than a single example) and what to expect from it.
The easiest route may be just the regex
\d+
But this will fail if there are more digits in the line than the ones you want.
You could take parentheses into account:
(?<=\()\d+(?=\))
This uses a lookbehind and lookahead assertion so that the number is the complete match (and not tucked away in a capturing group).
You can also use other context, e.g. the method name:
(?<=IsContainsTender\()\d+(?=\))
All of these things can make the regex more robust against unwanted data that might accidentally match, but that's a tradeoff only you can make because for some reason I have only a single example to work with here. If all you ever need is to match a 98, then 98 is a valid regex and does what you want with above example. Hence my plea that you should think harder about cases you want to match and cases that might give you trouble with overly simplistic approaches.

Related

Password complexity regex with number or special character

I've got some regex that'll check incoming passwords for complexity requirements. However it's not robust enough for my needs.
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,20})
It ensures that a password meets minimum length, contains characters of both cases and includes a number.
However I need to modify this so that it can contain a number and/or an allowed special character. I've even been given a list of allowed special characters.
I have two problems, delimiting the special characters and making the first condition do an and/or match for number or special.
I'd really appreciate advice from one of the regex gods round these parts.
The allowed special characters are: #%+\/'!#$^?:.(){}[]~-_

If I understand your question correctly, you're looking for a possibility to require another special character. This could be done as follows (see the last lookahead):
((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!§$%&/(/)]).{8,20})
See a demo for this approach here on regex101.com.
However, you can make your expression even better with further approvements: the dot-star (.*) brings you down the line and backtracks afterwards. If you have a password of say 10 characters and you want to make sure, four lookaheads need to be fulfilled, you'll need at least 40 steps (even more as the engine needs to backtrack).
To optimize your expression, you could use the exact opposite of your required characters, thus making the engine come to an end faster. Additionally, as already pointed out in the comments, do not limit your maximum password length.
In the language of regular expressions, this would come down to:
((?=\D*\d)(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[!§$%&/(/)]).{8,})
With the first approach, 63 steps are needed, while the optimized version only needs 29 steps (the half!). Regarding your second question, allowing a digit or a special character, you could simply use an alternation (|) like so:
((?:(?=\D*\d)|(?=.*[!§$%&/(/)]))(?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z]).{8,})
Or put the \d in the brackets as well, like so:
((?=[^a-z]*[a-z])(?=[^A-Z]*[A-Z])(?=.*[\d!§$%&/(/)]).{8,})
This one would consider to be ConsideredAgoodPassw!rd and C0nsideredAgoodPassword a good password.

Verify that my Regex works as expected

It's my first regex for production code, until now I've always avoided to write them myself and now I'm a bit worried if it really works as it is expected to. I made a lot of attempts trying to break it, but I really don't want to rely on this, especially when I have zero experience.
My regex should match exactly this pattern
first character must be one of the letters (not case sensitive) - K,C,M,X,S,W
second character must be a digit from 0-9
a hyphen -
4 alphanumeric characters (A-Z or 0-9) (not case sensitive) and
one letter (A-Z) (not case sensitive).
And that's it. It can't be shorter, it can't be longer, it must match exactly this pattern. What I have for now is this:
string RegExPattern = #"^(K|C|M|X|S|W){1}[0-9]{1}[-]{1}[A-Z0-9]{4}[A-Z]{1}$";
if (!Regex.IsMatch(txtCode.Text, RegExPattern, RegexOptions.IgnoreCase))
{
MessageBox.Show("Fail");
return false;
}
Is there any tool, or some other way to verify the behavior of a regex and is this regex correct for the matching pattern I explained above?

Yes, that is correct.
However, all the {1} are redundant, you can make a set of the first character insted of using the | operator, and you don't need a set for the dash:
string RegExPattern = #"^[KCMXSW][0-9]-[A-Z0-9]{4}[A-Z]$";
There are tools for writing and testing regular expressions, but you can only use them to test any variations in the input that you can think of, and it seems that you have already done that.

Nice tool to verify and develop regular expressions: http://www.debuggex.com. Nevertheless I'd advise you to concrete your regular expressions with the bunch of unit tests.

The best tool is a suite of unit tests, or a single test that iterates over several dozen chunks of text.
Create a text file that has a whole bunch of lines of text that are similar to the data this pattern will be used against. Make sure that some lines match, and some lines that won't match different parts of the rule (eg: a pattern that matches everything but the first character, one that matches everything but the last character, one with only 2 or 3 characters rather than four, etc.
Then, write a small program that reads each line of text and runs your expression against it. Have it print the line numbers of the lines that match, and then compare that list of numbers against your expected results.

How does non-backtracking subexpression work "(?>exp)"

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)

As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)

The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.

Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.

Is my C# Reg-ex correct?

Is this Regex correct if I have to match a string which is atleast 7 characters long, not more than 20 characters, has atleast 1 number, and atleast 1 letter? It has no other constraints.
[0-9]+[A-Za-z]+{7,20}
Thanks

No, it's not. The quantifier {7,20} doesn't apply to a token (repetition in regexes is done with quantifiers, like *, +, ? or the more general {n,m} – you cannot use more than one quantifier on a single token [in this case [a-zA-Z]]; *? is a quantifier on its own and thus doesn't play by above rules). You'll need something like the following:
^(?=.*\d)(?=.*[a-zA-Z]).{7,20}$
This has two lookaheads making sure of at least one digit and at least one letter:
(?=.*\d)
(?=.*[a-zA-Z])
Lookarounds are zero-width assertions; they do not consume characters in the string so they are merely matching a position. But they make sure that the expression inside of them would match at the current point. In this case this expression would match arbitrarily many characters and then would require a digit or a letter, respectively.
The actual match itself,
.{7,20}
just makes sure the length matches. What characters are used is irrelevant because we made sure of that constraints above already.
Finally the whole expression is anchored in that a start-of-string and end-of-string anchor are inserted at the start and end:
^...$
This makes sure that the match really encompasses the whole string. While not strictly necessary in this case (it would match the whole string anyway in all valid cases) it's often a good idea to include because usually regexes match only substrings and this can lead to subtle problems where validation regexes match even though they should fail. E.g. using \d+ to make sure a string consists only of digits would match the string a4b which puzzles beginners quite often.
I also changed that the order of letters and numbers doesn't matter. Your regex looks like it tries to impose a definite order where all numbers need to come before all letters which usually isn't what's wanted here.

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

The question of how to find every match when they might overlap was asked in Overlapping matches in Regex. However, as far as I can see, the answers there does not cover a more general case.
How can we find all substrings that begin with "a" and end with "z"? For example, given "akzzaz", it should find "akz", "akzz", "az" and "akzzaz".
Since there may be more than one match starting at the same position, ("akz" and "akzz") and also there may be more than one match ending at the same position ("az" and "akzzaz") I cannot see how using a lookahead or lookbehind helps as in the mentioned link. (Also, please bear in mind that in the general case "a" and "z" might be more complex regular expressions)
I use C#, so, in case it matters, having any feature specific to .Net Regular Expressions is OK.

Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.
I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.
The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.

For your current problem, string.startwith and string.endwith would do be a better job. Regular Expression is not necessarily faster in all cases.

Try this regular expression
a[akz]+z - in case a, k and z are the only characters
a[a-z]+z - in case of any alphabet

I think it's worth noting that there is actually a way for a regex to return more than one match at the same time. Although this doesn't answer your question, I think this would be a good place to mention this for others who may run into a similar situation.
The regex below for example would return all the right substrings of a string with a single match and has them in different capturing groups:
(?=(\w+)).
This regex uses capturing groups inside a zero-width assertion and for each match at position i(each character) the capturing group is a substring of length n-i.
Doing anything that would require the regex engine to stay in the same place after a match is probably overkill for a regular expression approach.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding characters between parentheses with a .NET Regex - c#

How can I get 98 from the expression $RetailTransaction.IsContainsTender(98) using Regex?

Related

Password complexity regex with number or special character

Verify that my Regex works as expected

How does non-backtracking subexpression work "(?>exp)"

Is my C# Reg-ex correct?

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

Categories

Resources