Why is this Email regex so slow on Mvc?

Why is this Email regex so slow on Mvc? - c#

I am currently building a system using Asp.net, c#, Mvc2 which uses the following regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$
This is an e-mail regex that validates a 'valid' e-mail address format. My code is as follows:
if (!Regex.IsMatch(model.Email, #"^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$"))
ModelState.AddModelError("Email", "The field Email is invalid.");
The Regex works fine for validating e-mails however if a particularly long string is passed to the regex and it is invalid it causes the system to keep on 'working' without ever resolving the page. For instance, this is the data that I tried to pass:
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
The above string causes the system to essentially lock up. I would like to know why and if I can use a regex that accomplishes the same thing in maybe a simpler manner. My target is that an incorrectly formed e-mail address like for instance the following isn't passed:
host.#.host..com

You have nested repetition operators sharing the same characters, which is liable to cause catastrophic backtracking.
For example: ([-.\w]*[0-9a-zA-Z])*
This says: match 0 or more of -._0-9a-zA-Z followed by a single 0-9a-zA-Z, one or more times.
i falls in both of these classes.
Thus, when run on iiiiiiii... the regex is matching every possible permuation of (several "i"s followed by one "i") several times (which is a lot of permutations).
In general, validating email addresses with a regular expression is hard.

Related

Email validation C# asp.net [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 3 years ago.
I used the following pattern to validate my email field.
return Regex.IsMatch(email,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-0-9a-z]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
It uses the following reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/how-to-verify-that-strings-are-in-valid-email-format
My requirement is to have maximum number of 64 characters for user part, and max length for whole email string is 254 characters. The pattern in the reference only allow max 134 characters. Can someone give clear explanation of the meaning for the pattern? What is the right pattern to achieve my goal?

The code you cited is over-engineered, all you need to verify an email is to check for an at symbol and for a dot. If you need anything more precise, you are probably at a point where you actually need to email the recipient and ask for their confirmation that they hold the email, something that is simpler than a complex regex, and which provides much more precision.
Such a regex would simply be:
.+#.+\..+
Commentated below
.+ At least one of any character
# The at symbol
.+ At least one character
\. The . symbol
.+ At least one character
Of course this means that some emails might be accepted as false positives, like tomas#company.c when the user intended tomas#company.com , but even if you design the most robust of regexes, one that checks against a list of accepted TLDs, you will never catch tomas#company.co, and you might insert positive falses like tomas#company.blockchain when a new TLD is released and your code isn't updated.
So just keep it simple.

If you wanted to avoid using regex (which is, in my opinion, difficult to decipher), you could use the .Split() method on the email string using the "#" symbol as your delimiter. Then, you can check the string lengths of the two components from there.

Several years back, I wrote an email validation attribute in C# that should recognize most of that subset of syntactically valid email addresses that have the form local-part#domain — I say "most" because I didn't bother to try do deal with things like punycode, IPv4 address literals (dotted quads), or IPv6 address literals.
I'm sure there's lots of other edge cases I missed as well. But it worked well enough for our purposes at the time.
Use it in good health: C# Email Address validation
Before you go down the road of writing you own, you might want to read through the multiple relevant RFCs and try to understand the vagaries of what constitutes a "valid" email address (it's not what you think), and (2) stop trying to validate an RFC 822 email address. About the only way to "validate" an email address is to send mail to it and see if it bounces or not. Which doesn't mean that anybody is home at that address, or that that mailbox won't disappear next week.
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/
https://jackfoxy.github.io/FsRegEx/emailregex.html
Jeffrey Friedl's book Mastering Regular Expressions has a [more-or-less?] complete regular expression to match syntactically valid email addresses. It's 6,598 characters long.
Did you know that postmaster#. is a legal email address? It theoretically gets you to the postmaster of the root DNS server.
Or that [theoretically] "bang path" email addresses like MyDepartmentServer!MainServer!BigRouter!TheirDepartmentServer!SpecificServer!jsmith are valid. Here you define the actual path through the network that the email should take. Helps if you know the network topology involved.

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Email Regex that DOES include unicode domains

I was wondering if anybody has found a solution that validates an email that includes unicode characters as in from a unicode domain? I have searched at length and have yet to find a solution that works.

Fully validating an email address through a regex is hard. Really hard. This is one that is fully compliant with RFC822. Even if you create a perfect regex that correct validates all email addresses, that doesn't stop me from entering hi#hi.com (If you're trying to make sure that I enter a valid email address) or from accidentally misspelling my username (If you're trying to make sure that I enter my email address correctly).
Just send a link in an email saying, "click here to validate your email address."

I had the same issue and came up with an intelligent solution \p{L}.
Please check it out:
private static bool IsEmailValid(string email) {
System.Text.RegularExpressions.Regex re = new Regex(#"^[\p{L}0-9!$'*+\-_]+(\.[\p{L}0-9!$'*+\-_]+)*#[\p{L}0-9]+(\.[\p{L}0-9]+)*(\.[\p{L}]{2,})$", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
return re.IsMatch(email);
}

Ok, so the only email validation I ever found that was truly awesome (instead of just OK) is part of the Zend Framework. Of course that means PHP, hopefully though, you can look at how they do it and emulate some of their better ideas: http://pastebin.com/SvZPBp31 Or just look up Zend_Validate_EmailAddress sourcecode.
sorry that this isn't in C# syntax / language.

Like has been pointed out, validating e-mail addresses through a regular expression is a hard problem. You can get close with a fairly simple one, but there are many, many cases that it will fail to catch. I'm all for sending an email to a supposed email address as #Nick ODell suggests (after doing some basic sanity checking, like, does it contain an # sign, does the domain name portion exist and have one or more of MX/A/AAAA RRs, and the likes) and including a verification link.
That said, if by Unicode domain you mean a Punycode-encoded host name label, those should be covered by any half-way competent validation regexp, as in encoded form those are just xn-- followed by the regular set [a-z0-9-] (case insensitive comparison).

Regular Expression to validate a URL or domain name.

Can someone please let me know what is wrong with my regular expression? I’m trying to just validate the beginning to URLs, mainly just host names (i.e. www.yahoo.com).
Regular Expression: ^(((ht|f)tp(s?))\:\/\/)?(www.)?([a-zA-Z0-9\-\.]{1,63})+\.([a-zA-Z]{2,5})$
Testing Values:
test.com – passes
test.c2om – fails
test.test.com – passes
test.test.c2om – fails
test.test.test.com – passes
test.test.test.c2om – INVALID REGEX PATTERN
This should return false, but instead returns nothing, both using javascript and c#… If you remove the {1,63} restriction on the size of the subdomain, it works…

You've created a catastrophic pattern - The engine will try to match ([a-zA-Z0-9\-\.]{1,63})+ in many ways until it fails. A simple solution is to remove {1,63}, as you've noted, it doesn't seem to be adding anything anyway.
Another option is to use the dots as anchors, so you cannot backtrack between them (this only gives you one way to match the text, and assumably, what you're trying to do):
([a-zA-Z0-9\-]{1,63}\.)*[a-zA-Z0-9\-]{1,63}
Keep in mind that isnt very correct anymore to assume all-ASCII-English letters in domain names. For example http://אתר.קום is a legal (and working) url.

Optimistic RegEx Matching for User Text Entry

I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?

What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.

Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.