I have been trying to find a good RegEx for email validation.
I have already gone through Comparing E-mail Address Validating Regular Expressions and that didn't suffice all my validation needs.
I have Google/Bing(ed) and scan the top 50 odd results including regular expressions info article and other stuff.
So finally i used the System.Net.Mail.MailAddress class to validate my email address. Since, if this fails, my email won't get sent to the user.
I want to customize the validation as used by the constructor of the class.
So how do I go ahead and get the validation/RegEx that the MailAddress class is using?
No it does not use a RegEx, but rather a complicated process that would take way too long to explain here. How do I know? I looked at the implementation using the .NET Reflector. And so can you :D
http://www.red-gate.com/products/reflector/ (it's free)
Thanks Reflector... forgot you were still free!
Reflected the System.Net.Mail.MailAddress...
Found that it used a void ParseValue(string address)
and void GetParts(string address) methods to primary check the mail address format.
//Edited
Surprised, no RegEx was involved!
According to the Reflector, the class doesn't use regular expressions at all.
Related
This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 3 years ago.
I used the following pattern to validate my email field.
return Regex.IsMatch(email,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-0-9a-z]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
It uses the following reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/how-to-verify-that-strings-are-in-valid-email-format
My requirement is to have maximum number of 64 characters for user part, and max length for whole email string is 254 characters. The pattern in the reference only allow max 134 characters. Can someone give clear explanation of the meaning for the pattern? What is the right pattern to achieve my goal?
The code you cited is over-engineered, all you need to verify an email is to check for an at symbol and for a dot. If you need anything more precise, you are probably at a point where you actually need to email the recipient and ask for their confirmation that they hold the email, something that is simpler than a complex regex, and which provides much more precision.
Such a regex would simply be:
.+#.+\..+
Commentated below
.+ At least one of any character
# The at symbol
.+ At least one character
\. The . symbol
.+ At least one character
Of course this means that some emails might be accepted as false positives, like tomas#company.c when the user intended tomas#company.com , but even if you design the most robust of regexes, one that checks against a list of accepted TLDs, you will never catch tomas#company.co, and you might insert positive falses like tomas#company.blockchain when a new TLD is released and your code isn't updated.
So just keep it simple.
If you wanted to avoid using regex (which is, in my opinion, difficult to decipher), you could use the .Split() method on the email string using the "#" symbol as your delimiter. Then, you can check the string lengths of the two components from there.
Several years back, I wrote an email validation attribute in C# that should recognize most of that subset of syntactically valid email addresses that have the form local-part#domain — I say "most" because I didn't bother to try do deal with things like punycode, IPv4 address literals (dotted quads), or IPv6 address literals.
I'm sure there's lots of other edge cases I missed as well. But it worked well enough for our purposes at the time.
Use it in good health: C# Email Address validation
Before you go down the road of writing you own, you might want to read through the multiple relevant RFCs and try to understand the vagaries of what constitutes a "valid" email address (it's not what you think), and (2) stop trying to validate an RFC 822 email address. About the only way to "validate" an email address is to send mail to it and see if it bounces or not. Which doesn't mean that anybody is home at that address, or that that mailbox won't disappear next week.
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/
https://jackfoxy.github.io/FsRegEx/emailregex.html
Jeffrey Friedl's book Mastering Regular Expressions has a [more-or-less?] complete regular expression to match syntactically valid email addresses. It's 6,598 characters long.
Did you know that postmaster#. is a legal email address? It theoretically gets you to the postmaster of the root DNS server.
Or that [theoretically] "bang path" email addresses like MyDepartmentServer!MainServer!BigRouter!TheirDepartmentServer!SpecificServer!jsmith are valid. Here you define the actual path through the network that the email should take. Helps if you know the network topology involved.
Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.
Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.
Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.
I've got a regular expression that I am using to check against a string to see if it an email address:
#"^((([\w]+\.[\w]+)+)|([\w]+))#(([\w]+\.)+)([A-Za-z]{1,3})$"
This works fine for all the email addresses I've tested, provided the bit before '#' is at least four characters long.
Works:
web1#domain.co.uk
Doesn't work:
web#domain.co.uk
How can I change the regex to allow prefixes of less than 4 characters??
The 'standard' regex used in asp.net mvc account models for email validation is as follows:
#"^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$"
It allows 1+ characters before the #
I believe the best way to check a valid email address is to make the user type it twice and then send him an email and challenge the fact that he received it using a validation link.
Check your regex againt a list of weird valid email addresses and you will see regexes are not perfect for email validation tasks.
I recommend not using a regex to validate email (for reasons outlined here) http://davidcel.is/blog/2012/09/06/stop-validating-email-addresses-with-regex/
If you can't sent a confirmation email a good alternative in C# is to try creating a MailAddress and check if it fails.
If you're using ASP.NET you can use a CustomValidator to call this validation method.
bool isValidEmail(string email)
{
try
{
MailAddress m = new MailAddress(email);
return true;
}
catch
{
return false;
}
}
You can use this regex as an alternative:
^([a-z0-9_\.-]+)#([\da-z\.-]+)\.([a-z\.]{2,6})$
Its description can be found here.
About your regex, the starting part (([\w]+\.[\w]+)+) forces the email address to have four characters at the beginning. Emending this part
would do the work for you.
The little trick used in the validated answer i.e. catching exceptions on
new MailAddress(email);
doesn't seem very satisfying as it considers "a#a" as a valid adress in fact it does't raise an exception for almost any string matching the regex "*.#.*" which is clearly too permissive for example
new MailAddress("¦#°§¬|¢#¢¬|")
doesn't raise an exception.
Thus I clearly would go for regex matching
This example is quite satisfying
https://msdn.microsoft.com/en-us/library/01escwtf%28v=vs.110%29.aspx
You can also try this one
^[a-zA-Z0-9._-]*#[a-z0-9._-]{2,}\.[a-z]{2,4}$
I was wondering if anybody has found a solution that validates an email that includes unicode characters as in from a unicode domain? I have searched at length and have yet to find a solution that works.
Fully validating an email address through a regex is hard. Really hard. This is one that is fully compliant with RFC822. Even if you create a perfect regex that correct validates all email addresses, that doesn't stop me from entering hi#hi.com (If you're trying to make sure that I enter a valid email address) or from accidentally misspelling my username (If you're trying to make sure that I enter my email address correctly).
Just send a link in an email saying, "click here to validate your email address."
I had the same issue and came up with an intelligent solution \p{L}.
Please check it out:
private static bool IsEmailValid(string email) {
System.Text.RegularExpressions.Regex re = new Regex(#"^[\p{L}0-9!$'*+\-_]+(\.[\p{L}0-9!$'*+\-_]+)*#[\p{L}0-9]+(\.[\p{L}0-9]+)*(\.[\p{L}]{2,})$", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
return re.IsMatch(email);
}
Ok, so the only email validation I ever found that was truly awesome (instead of just OK) is part of the Zend Framework. Of course that means PHP, hopefully though, you can look at how they do it and emulate some of their better ideas: http://pastebin.com/SvZPBp31 Or just look up Zend_Validate_EmailAddress sourcecode.
sorry that this isn't in C# syntax / language.
Like has been pointed out, validating e-mail addresses through a regular expression is a hard problem. You can get close with a fairly simple one, but there are many, many cases that it will fail to catch. I'm all for sending an email to a supposed email address as #Nick ODell suggests (after doing some basic sanity checking, like, does it contain an # sign, does the domain name portion exist and have one or more of MX/A/AAAA RRs, and the likes) and including a verification link.
That said, if by Unicode domain you mean a Punycode-encoded host name label, those should be covered by any half-way competent validation regexp, as in encoded form those are just xn-- followed by the regular set [a-z0-9-] (case insensitive comparison).
I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps