Related
This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 3 years ago.
I used the following pattern to validate my email field.
return Regex.IsMatch(email,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-0-9a-z]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
It uses the following reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/how-to-verify-that-strings-are-in-valid-email-format
My requirement is to have maximum number of 64 characters for user part, and max length for whole email string is 254 characters. The pattern in the reference only allow max 134 characters. Can someone give clear explanation of the meaning for the pattern? What is the right pattern to achieve my goal?
The code you cited is over-engineered, all you need to verify an email is to check for an at symbol and for a dot. If you need anything more precise, you are probably at a point where you actually need to email the recipient and ask for their confirmation that they hold the email, something that is simpler than a complex regex, and which provides much more precision.
Such a regex would simply be:
.+#.+\..+
Commentated below
.+ At least one of any character
# The at symbol
.+ At least one character
\. The . symbol
.+ At least one character
Of course this means that some emails might be accepted as false positives, like tomas#company.c when the user intended tomas#company.com , but even if you design the most robust of regexes, one that checks against a list of accepted TLDs, you will never catch tomas#company.co, and you might insert positive falses like tomas#company.blockchain when a new TLD is released and your code isn't updated.
So just keep it simple.
If you wanted to avoid using regex (which is, in my opinion, difficult to decipher), you could use the .Split() method on the email string using the "#" symbol as your delimiter. Then, you can check the string lengths of the two components from there.
Several years back, I wrote an email validation attribute in C# that should recognize most of that subset of syntactically valid email addresses that have the form local-part#domain — I say "most" because I didn't bother to try do deal with things like punycode, IPv4 address literals (dotted quads), or IPv6 address literals.
I'm sure there's lots of other edge cases I missed as well. But it worked well enough for our purposes at the time.
Use it in good health: C# Email Address validation
Before you go down the road of writing you own, you might want to read through the multiple relevant RFCs and try to understand the vagaries of what constitutes a "valid" email address (it's not what you think), and (2) stop trying to validate an RFC 822 email address. About the only way to "validate" an email address is to send mail to it and see if it bounces or not. Which doesn't mean that anybody is home at that address, or that that mailbox won't disappear next week.
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/
https://jackfoxy.github.io/FsRegEx/emailregex.html
Jeffrey Friedl's book Mastering Regular Expressions has a [more-or-less?] complete regular expression to match syntactically valid email addresses. It's 6,598 characters long.
Did you know that postmaster#. is a legal email address? It theoretically gets you to the postmaster of the root DNS server.
Or that [theoretically] "bang path" email addresses like MyDepartmentServer!MainServer!BigRouter!TheirDepartmentServer!SpecificServer!jsmith are valid. Here you define the actual path through the network that the email should take. Helps if you know the network topology involved.
I'm working on ticketing system, where staff can send email to customers using smtp and when customer replies back i fetch it using imap and add it back to ticket.
Right now i'm adding ticketid in the subject line,so when an email comes in i can append it to existing ticket.
but at times customers remove subject line and reply which creates a new ticket.
Can anyone advice me how to get around it. I think Zendesk appends ticketid in from email address, not sure if that would work.
Most systems have a comment in the replies telling the customer NOT to edit the subject line.
However if they remove the subject, you can instead search the email content (if they included the previous reply with the information of the subject line) perhaps the first 200 lines if you want to limit it, using a regular expression to pattern match your subject line text string and extract the ticket number from that.
Embed the ticket ID in all of the subject, the message-id and the body, and look for it in the subject, the References field and the body of the replies.
If you want to be extra careful and/or have many Outlook users as customers, you can even embed it in Thread-Index, which will make Outlook respond with the necessary details. Thread-Index normally has space for 160 bits of entropy, you can embed a ticket ID and still have >100 bits left over. http://rant.gulbrandsen.priv.no/aox/thread-index describes the format in enough detail.
An example:
To: customer#example.com
From: support#example.net
Subject: Blah blah [Ticket 34112]
Message-ID: <ticket-34112-7582349573489#example.net>
Thread-Index: 4242034112E0OYfxS/CjgSLFGePpiQAdZqFQACzEh/AAmOpSkA==
This relates to ticket 34112; please keep this line in your reply.
Blah.
If you get back a thread-index that starts with 424242 the next six digits are the issue, if you get back a References line that contains ticket-34112 you know the issue number, et cetera.
I'm retrieving raw text (includes header, and message) from a POP server. I need to capture everything after the header which is terminated by a blank line between it and the user message.
At the same time I'm wanting to ignore anything from original messages if it's a reply. The start of a reply for the emails I'm parsing start with
------Original Message------
An example email might look like this
Return-Path: ...
...
More Email Metadata: ...
Hello from regex land, I'm glad to hear from you.
------Original Message------
Metadata: ...
...
Hey regex dude, can you help me? Thanks!
Sincerely, Me.
I need to extract "Hello from regex land, I'm glad to hear from you." and any other text/lines prior to the original message.
I'm using this regex right now (C# in multiline mode)and it seems to work except it's capturing ------Original Message------ if the body is blank. I'd rather just have a blank string instead.
^\s*$\n(.*)(\n------Original Message------)?
Edit
I haven't down voted anyone and if you happen to downvote, it's usually helpful to include comments.
The reason for this is that you have an extra \n inside the parenthesis. If the body is blank, there is no extra newline there. Therefore, try this:
^\s*$\r\n(.*)(^------Original Message------$)?
If you don’t want the newline at the end of the body, you can still use string.Trim() on the matched part.
Note: This assumes that the input uses \r\n line terminators (which is required in e-mail headers according to the MIME standard).
Why don't you not use DotnetOpenMail? Using a regex to do this is a wrong approach, you'd be better off using a dedicated email handler instead....
You need to replace (\n------Original Message------) with (?=(\n------Original Message------)) lookahead to not return that part, just to ensure it's there
Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);
I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.
I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:
When you have the thread:
If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.
If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.
No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.
When you don't have the thread:
If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:
a line (as seen in outlook).
Angle Brackets
"---Original Message---"
"On such-and-such day, so-and-so wrote:"
Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!
First of all, this is a tricky task.
You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.
I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.
new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);
To remove quotation in the end:
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Here is my small collection of test responses (samples divided by --- ):
From: test#test.com [mailto:test#test.com]
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <test#test.com>
> text
----
test#test.com wrote:
> text
----
test#test.com wrote: text
text
----
2009/1/13 <test#test.com>
> text
----
test#test.com wrote: text
text
----
2009/1/13 <test#test.com>
> text
> text
----
2009/1/13 <test#test.com>
> text
> text
----
test#test.com wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, test#test.com <test#test.com> wrote:
> text
> text
Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:
def extract_reply(text, address)
regex_arr = [
Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
Regexp.new("from:\s*$", Regexp::IGNORECASE)
]
text_length = text.length
#calculates the matching regex closest to top of page
index = regex_arr.inject(text_length) do |min, regex|
[(text.index(regex) || text_length), min].min
end
text[0, index].strip
end
It's worked pretty well so far.
By far the easiest way to do this is by placing a marker in your content, such as:
--- Please reply above this line ---
As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.
Facebook can do this, but unless your project has a big budget, you probably can't.
Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.
Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.
There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.
Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.
Here is my C# version of #hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.
public string ExtractReply(string text, string address)
{
var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
new Regex("from:\\s*$", RegexOptions.IgnoreCase),
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
};
var index = text.Length;
foreach(var regex in regexes){
var match = regex.Match(text);
if(match.Success && match.Index < index)
index = match.Index;
}
return text.Substring(0, index).Trim();
}
If you control the original message (e.g. notifications from a web application), you can put a distinct, identifiable header in place, and use that as the delimiter for the original post.
It should be fairly easy these days, given text/html content type works for you (with Outlook being an exception; see details below). Here is a table with with the real testing results of parsing options in various desktop email clients:
Mail client
Reply message format
HTML can be parsed easily and reliably
HTML tags to be deleted
Plain text quote marker
web.de
always html
yes
<div name="quote">
- (always html)
Thunderbird
same as in the original message
yes
<div class="moz-cite-prefix">, <blockquote type="cite">
"On 26.10.2022 12:37, John Doe wrote:"
Gmail
both
yes
<div class="gmail_quote">
"On Thu, Oct 27, 2022 at 1:39 PM John Doe john#inbox.test wrote:"
Outlook 2016, 2019
same as in the original message
Probably impossible due to use of some weird Word processor
unknown
Plain text-only message: "-----Original Message-----", multipart: 3 blank lines with some space followed by "From: John Doe john#inbox.test"
Apple
unknown
yes
<blockquote type="cite">
"> On 22. Dec 2021, at 12:50, John Doe john#inbox.test wrote:"
This is a good solution. Found it after searching for so long.
One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.
//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),
Cheers
It is old post, however, not sure if you are aware github has a Ruby lib extracting the reply. If you use .NET, I have a .NET one at https://github.com/EricJWHuang/EmailReplyParser
If you use SigParser.com's API, it will give you an array of all the broken out emails in a reply chain from a single email text string. So if there are 10 emails, you'll get the text for all 10 of the emails.
You can view the detailed API spec here.
https://api.sigparser.com/