Validating email address with single character domain-names with a regex - c#

I have a regex that I am using to validate email addresses. I like this regex because it is fairly relax and has proven to work quite well.
Here is the regex:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#[^\.][\w\.\-]+\.[A-Za-z]{2,}>?
Ok great, basically all reasonably valid email addresses that you can throw at it will validate. I know that maybe even some invalid ones will fall through but that is ok for my specific use-case.
Now it happens to be the case that joe#x.com does not validate. And guess what x.com is actually a domain name that exists (owned by paypall).
Looking at the regex part that validates the domain name:
#[^\.][\w\.\-]+
It looks like this should be able to parse the x.com domain name, but it doesn't. The culprit is the part that checks that a domain name can not begin with a dot (such as test#.test.com)
#[^\.]
If I remove the [^.] part of my regex the domain x.com validates but now the regex allows domains names beginning with a dot, such as .test.com; this is a little bit too relax for me ;-)
So my question is how can the negative character list part affect my single character check, basically the way I am reading the regex is: "make sure this string does not start with a dot", but apparantly it does more.
Any help would be appreciated.
Regards,
Waseem

As Luis suggested, you can use [^\.][\w\.\-]* to match the domtain name, however it will now also match addresses like john#x.....com and john##.com. You might want to make sure that there is only one period at a time, and that the first character after the # is more restricted than just not being a period.
Match the domain name and the period (and subdomains and their periods) using:
([\w\-]+\.)+
So your pattern would be:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#([\w\-]+\.)+[A-Za-z]{2,}>?

If you change [^\.][\w\.\-]+ to [^\.][\w\.\-]*, it will work as you expect!
The reason is: [^\.] will match a single character which is not a dot (in your case, the "x" on "x.com", then you will try to reach 1 or more characters, and then a dot. You will match the dot after the x, and there are no more dots to match. The * will match 0 or more characters after the first one, which is what you want.

Change the quantifier +, meaning one or more, to *, meaning zero or more.

Change #[^\.][\w\.\-]+ to #[^\.][\w\.\-]*
The reason you need this is that [^\.] says match a single character that is not a dot. Now there are no more characters left so the [\w\.\-]+ has nothing to match, even though the plus sign requires a minimum of one character. Changing the plus to a star fixes this.

Look at the broader context in your pattern:
#[^\.][\w\.\-]+\.[A-Za-z]{2,}
So for joe#x.com,
[^.] matches x
[\w.-]+ matches .
\. needs a dot but finds c
Change this part to #[^.][\w-]*\.[A-Za-z]{2,}

Related

Regex - optional suffix triggered by dash

I have a simple format that I'm already validating, but want to allow the users to validate that format, but have an optional dash + whatever they want at the end while still validating the first part. I want the dash to be a trigger that tells the Regex that it can accept whatever comes after.
So if my existing regex is something like:
^\d{7}
then I want to be able to update my regex to pass these:
1234567-Covid19
1234567-Scenario
1234567-AnyString
but not these since they are missing the dash:
12345678
1234567*AnyString
1245567AnyString
Any help is much appreciated!
I fiddled around with a regex tester a bit and came up with the below to satisfy my initial question:
^\d{7}(-[A-Za-z0-9]{1,10})?
This optionally matches a dash + 1-10 additional alphanumeric characters.

ASP.net core RegularException attribute - multiple conditions

I have two regex that should be matched:
"^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$"
and
".*(g[o0]+gle).*"
The first one accept any alpha numeric character (with few more extras). Like helloworld123. The second one should reject any string that contain the word "google" (in diffrent forms - like: gooo0gle).
Allowed:
hello
helloworld
helloworld123
Disallowed:
hellogoogle
google
...
I want to use the RegularExpression to match this string. Thought about something like:
[RegularExpression("^[a-z0-9\\!#\\$\\^&\\-\\+%\\=_\\(\\)\\{\\}\\<\\>'\";\\:/\\.,~`\\|\\\\]+$|.*(g[o0]+gle).*"]
But it's not working since the second part (.*(g[o0]+gle).*) should be NOT.
How to do it right?
Thanks.
You can use your second regex by placing it in a negative look ahead and use the first regex as character set and combine both to get following regex that you can use,
^(?!.*g[o0]+gle)[-a-z0-9!#$^&+%=_(){}<>'";:\/.,~`|]+$
Here, this (?!.*g[o0]+gle) negative look ahead will reject any strings that contains google or any variation as supported by your regex, and this character set [-a-z0-9!#$^&+%=_(){}<>'";:\/.,~|]+` will match one or more characters allowed by it.
Also, you don't need to escape most special characters while they are in character set, hence I have unescaped most of them except / and also always place the hyphen - either as the very first character or very last character in the character set, else depending upon the regex dialects, you may see weird behavior.
Regex Demo

Noncapturing along with capturing match

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?
So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

Regex for ANY string except "www"? (subdomain)

I was wondering if someone out there could help me with a regex in C#. I think it's fairly simple but I've been wracking my brain over it and not quite sure why I'm having such a hard time. :)
I've found a few examples around but I can't seem to manipulate them to do what I need.
I just need to match ANY alphanumeric+dashes subdomain string that is not "www", and just up to the "."
Also, ideally, if someone were to type "www.subdomain.domain.com" I would like the www to be ignored if possible. If not, it's not a huge issue.
In other words, I would like to match:
(test).domain.com
(test2).domain.com
(wwwasdf).domain.com
(asdfwww).domain.com
(w).domain.com
(wwwwww).domain.com
(asfd-12345-www-bananas).domain.com
www.(subdomain).domain.com
And I don't want to match:
(www).domain.com
It seems to me like it should be easy, but I'm having troubles with the "not match" part.
For what it's worth, this is for use in the IIS 7 URL Rewrite Module, to rewrite for all non-www subdomains.
Thanks!
Is the remainder of the domain name constant, like .domain.com, as in your examples? Try this:
\b(?!www\.)(\w+(?:-\w+)*)(?=\.domain\.com\b)
Explanation:
\w+(?:-\w+)* matches a generic domain-name component as you described (but a little more rigorously).
(?=\.domain\.com\b) makes sure it's the first subdomain (i.e., the last one before the actual domain name).
\b(?!www\.) makes sure it isn't www. (without the \b, it could skip over the first w and match just the ww.).
In my tests, this regex matches precisely the parts you highlighted in your examples, and does not match the www. in either of the last two examples.
EDIT: Here's another version which matches the whole name, capturing the pieces in different groups:
^((?:\w+(?:-\w+)*\.)*)((?!www\.)\w+(?:-\w+)*)(\.domain\.com)$
In most cases, group $1 will contain an empty string because there's nothing before the subdomain name, but here's how it breaks down www.subdomain.domain.com:
$1: "www."
$2: "subdomain"
$3: ".domain.com"
^www\.
And invert the logic for this bit, so if it matches, then your string does not meet your requirements.
This works:
^(?!www\.domain\.com)(?:[a-z\-\.]+\.domain\.com)$
Or, with the necessary backslashes for Java (or C#?) strings:
"^(?!www\\.domain\\.com)(?:[a-z\\-\\.]+\\.domain\\.com)$"
There may be a more concise way (i.e. only typing domain.com once), but this works ..
Just substitute the original with everything after the www, if present (pseudocode):
str = re.sub("(www\.)?(.+)", "\2", str)
Or if you just want to match those which are "wrong" use this:
(www\.([^.]+)\.([^.]+))
And if you must match all those which are good use this:
(([^w]|w[^w]|ww[^w]|www[^.]|www\.([^.]+)\.([^.]+)\.).+)
Just thinking aloud here:
^(?:www\.)?([^\.]+)\.([^\.]+)\.
where...
(?:www\.)? looks for a possible "www" at the start, non-capturing
([^\.]+)\. looks for the sub-domain (anything except a dot at least once until a dot)
([^\.]+)\. looks for the domain, ending with a dot (anything except a dot at least once until a dot)
Note: This expression will not work with double sub-domains:
www.subsub.sub.domain.com
This:
^(?:www\.)?([^.]*)
It matches exactly what you put in parentheses in your question. You will find your answers sitting in group(1). You have to anchor it to the beginning of the line. Use this:
^(?:www\.)?(.*)
If you want everything in the URL except the "www.". One example you did not include in your test cases was "alpha.subdomain.domain.com". In the event you need to match everything, except "www.", that is not in the "domain.com" part of the string, use this:
^(?:www\.)?(.+)((?:\.(?:[^./\?]+)){2})
It will solve all of your cases, but in addition, will also return "alpha.subdomain" from my additional test case. And, for an encore, places ".domain.com" in group 2 and will not match beyond that if there are directories or parameters in the url.
I verified all of these responses here.
Finally, for the sake of overkill, if you want to reject addresses that begin with "www.", you can use negative lookbehind:
^....(?<!www\.).*
Thought i'd share this.
(\\.[A-z]{2,3}){1,2}$
Removes any '.com.au' '.co.uk' from the end. Then you can do an additional lookup to detect whether a URL contains a subdomain.
E.g.
subdaomin1.sitea.com.au
subdaomin2.siteb.co.uk
subdaomin3.sitec.net.au
all become:
subdomain1.sitea
subdomain2.siteb
subdomain3.sitec

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

Categories