Regular Expression for given scenario - c#

I already have an email address regular expression FROM RFC 2822 FORMAT
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
but want to modify it to include the following some new conditions:
at least one full stop
at least one # character
no consecutive full stops
must not start/end with special characters i.e. should only start/end with [0-9a-zA-Z]
should still follow RFC specification for regular expression rules.
Currently the above one allows the email to start with special characters. Also it is allowing two consecutive full stops (except for domain name which is fine, so test#test..com fails and its correct).
Thanks.

^[a-zA-Z0-9]+(?:\.?[\w!#$%&'*+/=?^`{|}~\-]+)*#[a-zA-Z0-9](?:\.?[\w\-]+)+\.[A-Za-z0-9]+$
No .. and at least 1 . and 1 #.
Also starts/ends with letters/numbers.
The ^ (start) and $ (end) were just added to match a whole string, not just a substring. But you could replace those by a word boundary \b.
An alternative where the special characters aren't hardcoded:
^(?!.*[.]{2})[a-zA-Z0-9][^#\s]*?#[a-zA-Z0-9][^#\s]*?\.[A-Za-z0-9]+$

Related

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information
There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Regular Expression to deny input of repeated characters

I want a regular expression which allows the uses to enter the following values. Minimum of Four and max of 30 characters and first character should be Upper Case.
Eg: John, Smith, Anderson, Emma
And I don't want the user to input the following types of values
Jooohnnnnnn, Smmmmith, Aaaanderson, Emmmmmmmmma
Can any one provide me with a regular expression? I search for quite some time but can't find working RegEx.
I need it for my ASP.net MVC application Model validation.
Thanks
Edited: I don't know how to check for repeated characters I just tried the following
#"^[A-Z]{1}[a-zA-Z ]{2,29}$"
The rules that I would like to add are
1. First character Upper case
2. 4-30 characters
3. No repeats of characters. Not greater than 2
To perform a check on your regex you can use a negative look ahead:
^(?!.*(.)\1{2})[A-Z][a-zA-Z ]{3,29}$
The look ahead (?!...) will fail the whole regex if what's inside it matches.
To look for repeated patterns, we use a capture group: (.)\1{2}. We capture the first character, then check if it is followed by (at least) two identical characters with the backreference \1.
See demo here.
Here is what you are looking for:
^ (?# Starting of name)
(?=[A-Z]) (?# Ensure it starts with capital A-Z without consuming the text)
(?i:([a-z]) (?# Following letters ignoring case)
(?!\1{2,}) (?# Letter cant be followed by previous letter more than twice)
){3,30} (?# Allow condition to be repeated 3 to 30 times)
$
Visual representation would look like follow:

Noncapturing along with capturing match

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?
So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

Regex to match two or more consecutive characters

Using regular expressions I want to match a word which
starts with a letter
has english alpahbets
numbers, period(.), hyphen(-), underscore(_)
should not have two or more consecutive periods or hyphens or underscores
can have multiple periods or hyphens or underscore
For example,
flin..stones or flin__stones or flin--stones
are not allowed.
fl_i_stones or fli_st.ones or flin.stones or flinstones
is allowed .
So far My regular expression is ^[a-zA-Z][a-zA-Z\d._-]+$
So My question is how to do it using regular expression
You can use a lookahead and a backreference to solve this. But note that right now you are requiring at least 2 characters. The starting letter and another one (due to the +). You probably want to make that + and * so that the second character class can be repeated 0 or more times:
^(?!.*(.)\1)[a-zA-Z][a-zA-Z\d._-]*$
How does the lookahead work? Firstly, it's a negative lookahead. If the pattern inside finds a match, the lookahead causes the entire pattern to fail and vice-versa. So we can have a pattern inside that matches if we do have two consecutive characters. First, we look for an arbitrary position in the string (.*), then we match single (arbitrary) character (.) and capture it with the parentheses. Hence, that one character goes into capturing group 1. And then we require this capturing group to be followed by itself (referencing it with \1). So the inner pattern will try at every single position in the string (due to backtracking) whether there is a character that is followed by itself. If these two consecutive characters are found, the pattern will fail. If they cannot be found, the engine jumps back to where the lookahead started (the beginning of the string) and continue with matching the actual pattern.
Alternatively you can split this up into two separate checks. One for valid characters and the starting letter:
^[a-zA-Z][a-zA-Z\d._-]*$
And one for the consecutive characters (where you can invert the match result):
(.)\1
This would greatly increase the readability of your code (because it's less obscure than that lookahead) and it would also allow you to detect the actual problem in pattern and return an appropriate and helpful error message.

.NET regex matching

Broadly: how do I match a word with regex rules for a)the beginning, b)the whole word, and c)the end?
More specifically: How do I match an expression of length >= 1 that has the following rules:
It cannot have any of: ! # #
It cannot begin with a space or =
It cannot end with a space
I tried:
^[^\s=][^!##]*[^\s]$
But the ^[^\s=] matching moves past the first character in the word. Hence this also matches words that begin with '!' or '#' or '#' (eg: '#ab' or '#aa'). This also forces the word to have at least 2 characters (one beginning character that is not space or = -and- one non-space character in the end).
I got to:
^[^\s=(!##)]\1*$
for a regex matching the first two rules. But how do I match no trailing spaces in the word with allowing words of length 1?
Cameron's solution is both accurate and efficient (and should be used for any production code where speed needs to be optimized). The answer presented here is less efficient, but demonstrates a general approach for applying logic using regular expressions.
You can use multiple positive and negative lookahead regex assertions (all applied at one location in the target string - typically the beginning), to apply multiple logical constraints for a match. The commented regex below demonstrates how easy this is to do for this example case. You do need to understand how the regex engine actually matches (and doesn't match), to come up with the correct expressions, but its not hard once you get the hang of it.
foundMatch = Regex.IsMatch(subjectString, #"
# Match 'word' meeting multiple logical constraints.
^ # Anchor to start of string.
(?=[^!##]*$) # It cannot have any of: ! # #, AND
(?![ =]) # It cannot begin with a space or =, AND
(?!.*\S$) # It cannot end with a space, AND
.{1,} # length >= 1 (ok to match special 'word')
\z # Anchor to end of string.
",
RegexOptions.IgnorePatternWhitespace);
This application of "regex-logic" is frequently used for complex password validation.
Your first attempt was very close. You only need to exclude more characters for the first and last parts, and make the last two parts optional:
^[^\s=!##](?:[^!##]*[^\s!##])?$
This ensures that all three sections will not include any of !##. Then, if the word is more than one character long, it will need to end with a not-space, with only select characters filling the space in-between. This is all enforced properly because of the ^ and $ anchors.
I'm not quite sure what your second example matched, since the () should be taken as literal characters when embedded within a character class, not as a capturing group.

Categories