parsing email text reply/forward

parsing email text reply/forward - c#

I am creating a web based email client using c# asp.net.
What is confusing is that various email clients seem to add the original text in alot of different ways when replying by email.
What I was wondering is that, if there is some sort of standardized way, to disambiguate this process?
Thank you
-Theo

I was thinking:
public String cleanMsgBody(String oBody, out Boolean isReply)
{
isReply = false;
Regex rx1 = new Regex("\n-----");
Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");
String txtBody = oBody;
while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
while (txtBody.Contains(" ")) txtBody = txtBody.Replace(" ", " ");
if (isReply = (isReply || rx1.IsMatch(txtBody)))
txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx2.IsMatch(txtBody)))
txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx3.IsMatch(txtBody)))
txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better
return txtBody;
}

There isn't a standardized way, but a sensible heuristic will get you a good distance.
Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.
It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

Not really, no.
The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.
As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:
Plain text, "rich text", HTML will all have a different way of separating the reply from the original
In Outlook I can choose from the following options when replying to a message:
Do not include
Attach original message
Include original message text
Include and indent original message text
Prefix each line of the original message
On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.

Some heuristics you can try are
-Any number of > characters
-Looking for "wrote: " (be very careful with this one)
Also you can try relating the Message ID field with the In Reply To field
And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)

Related

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Extracting data from text using templates

I'm building a web service which receives emails from a number of CRM-systems. Emails typically contain a text status e.g. "Received" or "Completed" as well as a free text comment.
The formats of the incoming email are different, e.g. some systems call the status "Status: ZZZZZ" and some "Action: ZZZZZ". The free text sometimes appear before the status and somethings after. Status codes will be mapped to my systems interpretation and the comment is required too.
Moreover, I'd expect that the the formats change over time so a solution that is configurable, possibly by customers providing their own templates thru a web interface would be ideal.
The service is built using .NET C# MVC 3 but I'd be interested in general strategies as well as any specific libraries/tools/approaches.
I've never quite got my head around RegExp. I'll make a new effort in case it is indeed the way to go. :)

I would go with regex:
First example, if you had only Status: ZZZZZ- like messages:
String status = Regex.Match(#"(?<=Status: ).*");
// Explanation of "(?<=Status: ).*" :
// (?<= Start of the positive look-behind group: it means that the
// following text is required but won't appear in the returned string
// Status: The text defining the email string format
// ) End of the positive look-behind group
// .* Matches any character
Second example if you had only Status: ZZZZZ and Action: ZZZZZ - like messages:
String status = Regex.Match(#"(?<=(Status|Action): ).*");
// We added (Status|Action) that allows the positive look-behind text to be
// either 'Status: ', or 'Action: '
Now if you want to give the possibility to the user to provide its own format, you could come up with something like:
String userEntry = GetUserEntry(); // Get the text submitted by the user
String userFormatText = Regex.Escape(userEntry);
String status = Regex.Match(#"(?<=" + userFormatText + ").*");
That would allow the user to submit its format, like Status:, or Action:, or This is my friggin format, now please read the status -->...
The Regex.Escape(userEntry) part is important to ensure that the user doesn't break your regex by submitting special character like \, ?, *...
To know if the user submits the status value before or after the format text, you have several solutions:
You could ask the user where his status value is, and then build you regex accordingly:
if (statusValueIsAfter) {
// Example: "Status: Closed"
regexPattern = #"(?<=Status: ).*";
} else {
// Example: "Closed:Status"
regexPattern = #".*(?=:Status)"; // We use here a positive look-AHEAD
}
Or you could be smarter and introduce a system of tags for the user entry. For instance, the user submits Status: <value> or <value>=The status and you build the regex by replacing the tags string.

Safely insert line breaks into HTML

I have an application allows a user to copy paste html into a form. This html gets sent as an email, and the email server will not allow more than 1000 characters per line. So, I'd like to insert line breaks (\r\n) into the html after the user has hit submit. How can I do this without changing the content?
My idea is this:
html.replace('<', '\r\n<');
But is that guaranteed to not change the result? Is '<' not allowed in attributes?
Edit: I'm actually thinking this will not work because the html could have a script block with something like if(x < 3). I guess what I need is an html pretty printer that works in either js or C#.

If you Base64 encode the content, then you can break up the content into however many lines you want.

Email MIME standard uses transfer encoding techniques to solve this problem. Ideally you would be using a mail library that takes care of this for you, so you can insert lines of any length.
Using the System.Net.Mail.MailMessage class in C#, you should be able to construct a normal message and it will transfer-encode it for you. If that doesn't work, you can also construct a multi-part message with a single System.Net.Mail.AlternativeView and set the transfer-encoding explicitly.
Here is a sample I am currently using (note it has a character encoding bug, so your body text must be a unicode string):
private void Send(string body, bool isHtml, string subject, string recipientAddress, string recipientName, string fromAddress)
{
using (var message = new MailMessage(new MailAddress(fromAddress),
new MailAddress(recipientAddress, recipientName)))
{
message.Subject = subject;
var alternateView = AlternateView.CreateAlternateViewFromString(body, message.BodyEncoding,
isHtml ? "text/html" : "text/plain");
alternateView.TransferEncoding = TransferEncoding.QuotedPrintable;
message.AlternateViews.Add(alternateView);
var client = new SmtpClient();
client.Send(message);
}
}

You're getting into dangerous territory attempting to parse HTML with a replace function. The easiest method would be to just display a warning box on the form that tells the user that lines cannot be longer than 1000 characters, and return an error message if they attempt to submit content with lines over that length.
Otherwise, you could insert a linebreak after X number of characters, and insert some special markup (like <!--AUTO-LINEBREAK-->, or similar) that informs whoever is receiving the e-mail that an automatic line break was inserted.

Add normal line breaks where you think they should be. For example:
Off the top of my head, find all <p>, <table>, <tr>,<td>,<br>, and <div> tags and add a \r\n right before them.
Once that is done, loop through all the lines one more time. If there are any that are still 1000+ characters long, I would insert a \r\n in the whitespace.
Also, you should be removing any script tags from the HTML email body. Having script tags can cause all types of problems (marked as spam, marked as a virus, blocked, etc..).

I am not sure how you are delivering your email... if it is handed off to a php script that then send it to a mail server or uses the mail() method, then this link might help.
http://php.net/manual/en/function.wordwrap.php
If not, can you clarify your question a bit?
Another simply thought, is that you could use:
html.replace('','\r\n');
or:
html.replace('',''+String.fromCharCode(13));//inserts a carriage return
However, since the will ideally be parsed in the browser, inserting "\r\n" may not be effective and may actually just display as "\r\n"....
Hope any of this is helpful.

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.

A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.

Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(

Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."

If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!

SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

Parse email content from quoted reply

I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:
When you have the thread:
If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.
If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.
No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.
When you don't have the thread:
If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:
a line (as seen in outlook).
Angle Brackets
"---Original Message---"
"On such-and-such day, so-and-so wrote:"
Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!

First of all, this is a tricky task.
You should collect typical responses from different e-mail clients and prepare correct regular expressions (or whatever) to parse them. I've collected responses from outlook, thunderbird, Gmail, Apple mail, and mail.ru.
I am using regular expressions to parse responses in the following manner: if an expression did not match, I try to use the next one.
new Regex("From:\\s*" + Regex.Escape(_mail), RegexOptions.IgnoreCase);
new Regex("<" + Regex.Escape(_mail) + ">", RegexOptions.IgnoreCase);
new Regex(Regex.Escape(_mail) + "\\s+wrote:", RegexOptions.IgnoreCase);
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase);
new Regex("from:\\s*$", RegexOptions.IgnoreCase);
To remove quotation in the end:
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Here is my small collection of test responses (samples divided by --- ):
From: test#test.com [mailto:test#test.com]
Sent: Tuesday, January 13, 2009 1:27 PM
----
2008/12/26 <test#test.com>
> text
----
test#test.com wrote:
> text
----
test#test.com wrote: text
text
----
2009/1/13 <test#test.com>
> text
----
test#test.com wrote: text
text
----
2009/1/13 <test#test.com>
> text
> text
----
2009/1/13 <test#test.com>
> text
> text
----
test#test.com wrote:
> text
> text
<response here>
----
--- On Fri, 23/1/09, test#test.com <test#test.com> wrote:
> text
> text

Thank you, Goleg, for the regexes! Really helped. This isn't C#, but for the googlers out there, here's my Ruby parsing script:
def extract_reply(text, address)
regex_arr = [
Regexp.new("From:\s*" + Regexp.escape(address), Regexp::IGNORECASE),
Regexp.new("<" + Regexp.escape(address) + ">", Regexp::IGNORECASE),
Regexp.new(Regexp.escape(address) + "\s+wrote:", Regexp::IGNORECASE),
Regexp.new("^.*On.*(\n)?wrote:$", Regexp::IGNORECASE),
Regexp.new("-+original\s+message-+\s*$", Regexp::IGNORECASE),
Regexp.new("from:\s*$", Regexp::IGNORECASE)
]
text_length = text.length
#calculates the matching regex closest to top of page
index = regex_arr.inject(text_length) do |min, regex|
[(text.index(regex) || text_length), min].min
end
text[0, index].strip
end
It's worked pretty well so far.

By far the easiest way to do this is by placing a marker in your content, such as:
--- Please reply above this line ---
As you have no doubt noticed, parsing out quoted text is not a trivial task as different email clients quote text in different ways. To solve this problem properly you need to account for and test in every email client.
Facebook can do this, but unless your project has a big budget, you probably can't.
Oleg has solved the problem using regexes to find the "On 13 Jul 2012, at 13:09, xxx wrote:" text. However, if the user deletes this text, or replies at the bottom of the email, as many people do, this solution will not work.
Likewise if the email client uses a different date string, or doesn't include a date string the regex will fail.

There is no universal indicator of a reply in an e-mail. The best you can do is try to catch the most common and parse new patterns as you come across them.
Keep in mind that some people insert replies inside the quoted text (My boss for example answers questions on the same line as I asked them) so whatever you do, you might lose some information you would have liked to keep.

Here is my C# version of #hurshagrawal's Ruby code. I don't know Ruby really well so it could be off, but I think I got it right.
public string ExtractReply(string text, string address)
{
var regexes = new List<Regex>() { new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase),
new Regex("<" + Regex.Escape(address) + ">", RegexOptions.IgnoreCase),
new Regex(Regex.Escape(address) + "\\s+wrote:", RegexOptions.IgnoreCase),
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline),
new Regex("-+original\\s+message-+\\s*$", RegexOptions.IgnoreCase),
new Regex("from:\\s*$", RegexOptions.IgnoreCase),
new Regex("^>.*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)
};
var index = text.Length;
foreach(var regex in regexes){
var match = regex.Match(text);
if(match.Success && match.Index < index)
index = match.Index;
}
return text.Substring(0, index).Trim();
}

If you control the original message (e.g. notifications from a web application), you can put a distinct, identifiable header in place, and use that as the delimiter for the original post.

It should be fairly easy these days, given text/html content type works for you (with Outlook being an exception; see details below). Here is a table with with the real testing results of parsing options in various desktop email clients:
Mail client
Reply message format
HTML can be parsed easily and reliably
HTML tags to be deleted
Plain text quote marker
web.de
always html
yes
<div name="quote">
- (always html)
Thunderbird
same as in the original message
yes
<div class="moz-cite-prefix">, <blockquote type="cite">
"On 26.10.2022 12:37, John Doe wrote:"
Gmail
both
yes
<div class="gmail_quote">
"On Thu, Oct 27, 2022 at 1:39 PM John Doe john#inbox.test wrote:"
Outlook 2016, 2019
same as in the original message
Probably impossible due to use of some weird Word processor
unknown
Plain text-only message: "-----Original Message-----", multipart: 3 blank lines with some space followed by "From: John Doe john#inbox.test"
Apple
unknown
yes
<blockquote type="cite">
"> On 22. Dec 2021, at 12:50, John Doe john#inbox.test wrote:"

This is a good solution. Found it after searching for so long.
One addition, as mentioned above, this is case wise, so the above expressions did not correctly parse my gmail and outlook (2010) responses, for which I added the following two Regex(s). Let me know for any issues.
//Works for Gmail
new Regex("\\n.*On.*<(\\r\\n)?" + Regex.Escape(address) + "(\\r\\n)?>", RegexOptions.IgnoreCase),
//Works for Outlook 2010
new Regex("From:.*" + Regex.Escape(address), RegexOptions.IgnoreCase),
Cheers

It is old post, however, not sure if you are aware github has a Ruby lib extracting the reply. If you use .NET, I have a .NET one at https://github.com/EricJWHuang/EmailReplyParser

If you use SigParser.com's API, it will give you an array of all the broken out emails in a reply chain from a single email text string. So if there are 10 emails, you'll get the text for all 10 of the emails.
You can view the detailed API spec here.
https://api.sigparser.com/

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.