Validate and correct invalid emailid formats in C#

Validate and correct invalid emailid formats in C# - c#

Is there a way we can validate and correct invalid format emailids in C#.I got a function which can only validate but not correction.Some emailds like "abc#def.com." can be corrected.I`m fetching all emailids from database and sending them a mail,if I just remove invalid emailids,the person may loose info,so instead of removing I thought of correcting the mailid and send him the mail.
Is there a way?Or a function to do this.???
Thanks in advance.

If you have the email address as a string, then you can manipulate the string. In your example, that would be removal of the trailing period. Other than this simple example, I suggest that you think long and hard about how useful this will be. What is the context? Can you pass the mail address back to a user to get the correct address, as opposed to your best guess?Adding code will clarify your question. From your question, I don't know why you assume you can only validate, as opposed to correcting the mail address string.

You could check wheather the Mail domain exists for example like this, you can check if the Email ends with an unvalid char like "." or "," and remove this if found but you can not really "correct" wrong Emails by trying to change each char and check if the Email exists or not, and its not desired sicne you would find probably for each change you make an exissting Email adress which is not the one you really wish to reach.

No. There is no way to do this. You may have a built-in guess system that will take care of common mistakes though.
For instance, if I type my email id as abc#gmali.com, you may change it to abc#gmail.com. This still does not guarantee that the email is is now correct.
Assume I had an email id as abc#gmail.com and intentionally I typed in asd#gmail.com. Now, there is no way you can correct it. With the same intention, if I type asd#gmial.com, your code might make it correct email id as asd#gmail.com which still is incorrect.
Essentially what you are looking for is called client side validation. What ever front end you have, place validation that check if email address is correct as per syntax. For verifying if the user has given his real email, send a mail to the given address with activation link and ask them to click on it if they want to use applciation.
Edit:
If you need to just format the emails in database, you can check for common mistakes using queries/external executable. These will validate the data against a valid format which then, can be changed. What are the options you have, technology wise, for doing this?

Related

Email validation C# asp.net [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 3 years ago.
I used the following pattern to validate my email field.
return Regex.IsMatch(email,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-0-9a-z]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
It uses the following reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/how-to-verify-that-strings-are-in-valid-email-format
My requirement is to have maximum number of 64 characters for user part, and max length for whole email string is 254 characters. The pattern in the reference only allow max 134 characters. Can someone give clear explanation of the meaning for the pattern? What is the right pattern to achieve my goal?

The code you cited is over-engineered, all you need to verify an email is to check for an at symbol and for a dot. If you need anything more precise, you are probably at a point where you actually need to email the recipient and ask for their confirmation that they hold the email, something that is simpler than a complex regex, and which provides much more precision.
Such a regex would simply be:
.+#.+\..+
Commentated below
.+ At least one of any character
# The at symbol
.+ At least one character
\. The . symbol
.+ At least one character
Of course this means that some emails might be accepted as false positives, like tomas#company.c when the user intended tomas#company.com , but even if you design the most robust of regexes, one that checks against a list of accepted TLDs, you will never catch tomas#company.co, and you might insert positive falses like tomas#company.blockchain when a new TLD is released and your code isn't updated.
So just keep it simple.

If you wanted to avoid using regex (which is, in my opinion, difficult to decipher), you could use the .Split() method on the email string using the "#" symbol as your delimiter. Then, you can check the string lengths of the two components from there.

Several years back, I wrote an email validation attribute in C# that should recognize most of that subset of syntactically valid email addresses that have the form local-part#domain — I say "most" because I didn't bother to try do deal with things like punycode, IPv4 address literals (dotted quads), or IPv6 address literals.
I'm sure there's lots of other edge cases I missed as well. But it worked well enough for our purposes at the time.
Use it in good health: C# Email Address validation
Before you go down the road of writing you own, you might want to read through the multiple relevant RFCs and try to understand the vagaries of what constitutes a "valid" email address (it's not what you think), and (2) stop trying to validate an RFC 822 email address. About the only way to "validate" an email address is to send mail to it and see if it bounces or not. Which doesn't mean that anybody is home at that address, or that that mailbox won't disappear next week.
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/
https://jackfoxy.github.io/FsRegEx/emailregex.html
Jeffrey Friedl's book Mastering Regular Expressions has a [more-or-less?] complete regular expression to match syntactically valid email addresses. It's 6,598 characters long.
Did you know that postmaster#. is a legal email address? It theoretically gets you to the postmaster of the root DNS server.
Or that [theoretically] "bang path" email addresses like MyDepartmentServer!MainServer!BigRouter!TheirDepartmentServer!SpecificServer!jsmith are valid. Here you define the actual path through the network that the email should take. Helps if you know the network topology involved.

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Email Regex that DOES include unicode domains

I was wondering if anybody has found a solution that validates an email that includes unicode characters as in from a unicode domain? I have searched at length and have yet to find a solution that works.

Fully validating an email address through a regex is hard. Really hard. This is one that is fully compliant with RFC822. Even if you create a perfect regex that correct validates all email addresses, that doesn't stop me from entering hi#hi.com (If you're trying to make sure that I enter a valid email address) or from accidentally misspelling my username (If you're trying to make sure that I enter my email address correctly).
Just send a link in an email saying, "click here to validate your email address."

I had the same issue and came up with an intelligent solution \p{L}.
Please check it out:
private static bool IsEmailValid(string email) {
System.Text.RegularExpressions.Regex re = new Regex(#"^[\p{L}0-9!$'*+\-_]+(\.[\p{L}0-9!$'*+\-_]+)*#[\p{L}0-9]+(\.[\p{L}0-9]+)*(\.[\p{L}]{2,})$", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
return re.IsMatch(email);
}

Ok, so the only email validation I ever found that was truly awesome (instead of just OK) is part of the Zend Framework. Of course that means PHP, hopefully though, you can look at how they do it and emulate some of their better ideas: http://pastebin.com/SvZPBp31 Or just look up Zend_Validate_EmailAddress sourcecode.
sorry that this isn't in C# syntax / language.

Like has been pointed out, validating e-mail addresses through a regular expression is a hard problem. You can get close with a fairly simple one, but there are many, many cases that it will fail to catch. I'm all for sending an email to a supposed email address as #Nick ODell suggests (after doing some basic sanity checking, like, does it contain an # sign, does the domain name portion exist and have one or more of MX/A/AAAA RRs, and the likes) and including a verification link.
That said, if by Unicode domain you mean a Punycode-encoded host name label, those should be covered by any half-way competent validation regexp, as in encoded form those are just xn-- followed by the regular set [a-z0-9-] (case insensitive comparison).

Email Auto-Filtering

I would like to use the masked edit box to let people type in input a valid email address,
What is the best mask i should use in MaskedEditExtender.Mask ?

There are two different things: the mask and the validation.
Obviously, you cannot use one of the predefined MaskTypes (number, date,...) and neither a mask with a fixed length (and a very long mask with enough underscores for every possible email address will be ugly and confuse the user). My answer would be: don't use a mask.
However, you can still validate the email address setting the ValidationExpression to your favorite email regex (like the one that comes with the Regex validator control). Note that no "perfect" email validation regex exists, but you can warn the user if there are apparent errors.

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.

A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.

Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(

Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."

If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!

SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.