I have an application allows a user to copy paste html into a form. This html gets sent as an email, and the email server will not allow more than 1000 characters per line. So, I'd like to insert line breaks (\r\n) into the html after the user has hit submit. How can I do this without changing the content?
My idea is this:
html.replace('<', '\r\n<');
But is that guaranteed to not change the result? Is '<' not allowed in attributes?
Edit: I'm actually thinking this will not work because the html could have a script block with something like if(x < 3). I guess what I need is an html pretty printer that works in either js or C#.
If you Base64 encode the content, then you can break up the content into however many lines you want.
Email MIME standard uses transfer encoding techniques to solve this problem. Ideally you would be using a mail library that takes care of this for you, so you can insert lines of any length.
Using the System.Net.Mail.MailMessage class in C#, you should be able to construct a normal message and it will transfer-encode it for you. If that doesn't work, you can also construct a multi-part message with a single System.Net.Mail.AlternativeView and set the transfer-encoding explicitly.
Here is a sample I am currently using (note it has a character encoding bug, so your body text must be a unicode string):
private void Send(string body, bool isHtml, string subject, string recipientAddress, string recipientName, string fromAddress)
{
using (var message = new MailMessage(new MailAddress(fromAddress),
new MailAddress(recipientAddress, recipientName)))
{
message.Subject = subject;
var alternateView = AlternateView.CreateAlternateViewFromString(body, message.BodyEncoding,
isHtml ? "text/html" : "text/plain");
alternateView.TransferEncoding = TransferEncoding.QuotedPrintable;
message.AlternateViews.Add(alternateView);
var client = new SmtpClient();
client.Send(message);
}
}
You're getting into dangerous territory attempting to parse HTML with a replace function. The easiest method would be to just display a warning box on the form that tells the user that lines cannot be longer than 1000 characters, and return an error message if they attempt to submit content with lines over that length.
Otherwise, you could insert a linebreak after X number of characters, and insert some special markup (like <!--AUTO-LINEBREAK-->, or similar) that informs whoever is receiving the e-mail that an automatic line break was inserted.
Add normal line breaks where you think they should be. For example:
Off the top of my head, find all <p>, <table>, <tr>,<td>,<br>, and <div> tags and add a \r\n right before them.
Once that is done, loop through all the lines one more time. If there are any that are still 1000+ characters long, I would insert a \r\n in the whitespace.
Also, you should be removing any script tags from the HTML email body. Having script tags can cause all types of problems (marked as spam, marked as a virus, blocked, etc..).
I am not sure how you are delivering your email... if it is handed off to a php script that then send it to a mail server or uses the mail() method, then this link might help.
http://php.net/manual/en/function.wordwrap.php
If not, can you clarify your question a bit?
Another simply thought, is that you could use:
html.replace('','\r\n');
or:
html.replace('',''+String.fromCharCode(13));//inserts a carriage return
However, since the will ideally be parsed in the browser, inserting "\r\n" may not be effective and may actually just display as "\r\n"....
Hope any of this is helpful.
Related
i'm using mimekit for receive and send mail for my project. I'm sending received mails with some modifications (to & from parts). And now i need to modify in body section. I'll replace specific word with asterix chars. Specific text different for every mail. Mail may be any format. You can see i found what i want but i don't know how can i replace without any error?
MimeMessage.Body is a tree structure, like MIME, so you'll have to navigate to the MimePart that contains the content that you want to modify.
In this case, since you want to modify a text/* MimePart, it will actually be a subclass of MimePart called TextPart which is what has the .Text property (which is writable).
I've written documentation on how to traverse the MIME structure of a message to find the part that you are looking for here: http://www.mimekit.org/docs/html/WorkingWithMessages.htm
A very simple solution might be:
var part = message.BodyParts.OfType<TextPart> ().FirstOrDefault ();
part.Text = part.Text.Replace ("x", "y");
But keep in mind that that logic assumes that the first text/* part you find is the one you are looking for.
This is my code:
MailMessage m = new MailMessage(from, to, subject, body);
SmtpClient s = new SmtpClient("...");
s.Send(m);
Only subject and body are user input.
The code's fine, but it depends on where "subject" and "body" are coming from. If (as you note) they are user-supplied, you want to make sure you are encoding it (HttpServerUtility.HtmlEncode)
You should validate subject and body then.
I am attempting to write up a test project to investigate this real fast, but from my look at Reflector and my reading of the documentation, the Subject and Body are strictly treated as System.Strings - this is to the point where you are welcome to explicitly set the encoding on the strings if you want (MailMessage.BodyEncoding).
Unless there is a major bug in how this class is put together, there should be no greater chance of code injection than there would be with any other string; especially if you explicitly set the BodyEncoding to be some manner of plain text, like UTF-8.
EDIT: Alternately, if you really really want to make sure that HTML isn't a part of the body you could use the regex
#"<[^>]*>"
to naively strip out anything inside a bunch of angle brackets, either with Regex.Replace(regex, string.empty) or Regex.Match and throwing on a return that indicated a match was found.
From the link #DavidHall posted of Is .NET MailMessage class injection-safe?, #Slaks mentions that to/from is validated, but your body content is not. So you would need to validate the subject and body.
Make sure you encode the subject and body before sending it to the user; that should be enough to handle in most scenarios.
I'm attempting to send emails from my C# app using the System.Net.Mail.MailMessage class. For the body of the email, I want to include line breaks to create separate blocks of text. How can I get an email client, like Outlook, to respect the line breaks when display the email has HTML to users? What character(s) do I insert in the body of the email text so that the line breaks are treated as line breaks?
Note: The body of my email is pure text, not HTML.
Line breaks are white space characters, and all white space characters are interpreted as spaces in HTML.
You can use break tags (<br/>) to add line breaks in the HTML code, and you can use paragraph tags (<p>...</p>) around paragraphs to get a distance between them.
You can just use Environment.NewLine property. That will insert a newline string defined for the current environment.
You need to insert HTML line-breaks i.e. <br/>
using the parograph tag is the best way in my opinion because it keeps the format clean and easy to maintain.
I hope this helps.
<p>Dear Mike</p>
<p>This is a sample message</p>
<p><b>Thank You,</b></p>
If you want to send the email as HTML, use the break tag as others have described. If you want line breaks to work, you'll need to send the message as plain text, which can be specified by setting IsBodyHtml = false.
If your existing line breaks aren't being used, try forcing with "\n" which is an escape sequence for a new line.
email.Body = new MessageBody(BodyType.Text, stringThatCointainsMessage);
at least using Microsoft.Exchange.WebServices.Data this will solve the problem
you need to use the overloaded method and the enum BodyType.Text
I am creating a web based email client using c# asp.net.
What is confusing is that various email clients seem to add the original text in alot of different ways when replying by email.
What I was wondering is that, if there is some sort of standardized way, to disambiguate this process?
Thank you
-Theo
I was thinking:
public String cleanMsgBody(String oBody, out Boolean isReply)
{
isReply = false;
Regex rx1 = new Regex("\n-----");
Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");
String txtBody = oBody;
while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
while (txtBody.Contains(" ")) txtBody = txtBody.Replace(" ", " ");
if (isReply = (isReply || rx1.IsMatch(txtBody)))
txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx2.IsMatch(txtBody)))
txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx3.IsMatch(txtBody)))
txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better
return txtBody;
}
There isn't a standardized way, but a sensible heuristic will get you a good distance.
Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.
It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".
Not really, no.
The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.
As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:
Plain text, "rich text", HTML will all have a different way of separating the reply from the original
In Outlook I can choose from the following options when replying to a message:
Do not include
Attach original message
Include original message text
Include and indent original message text
Prefix each line of the original message
On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.
Some heuristics you can try are
-Any number of > characters
-Looking for "wrote: " (be very careful with this one)
Also you can try relating the Message ID field with the In Reply To field
And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)
Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);