MailMessage from MailBee conversion from HTML to TEXT adds extra space - c#

I am using MailBee to convert HTML to Text but it adds an extra space in the beginning of each line except from the first one.
For example I have this HTML
<!DOCTYPE html>
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
</head>
<body>
<div style=\"font-size:13px;font-family:Arial;\"><br></div>
<div style=\"font-size:13px;font-family:Arial;\">test</div>
<div style=\"font-size:13px;font-family:Arial;\">test2</div>
<div style=\"font-size:13px;font-family:Arial;\">test3</div>
<div style=\"font-size:13px;font-family:Arial;\">test</div>
</body>
<html>
(The html is in one line. I have changed it to multi line just for readability.)
When I use this code to get text
MailMessage message = new MailMessage
{
BodyHtmlText = Html
};
message.MakePlainBodyFromHtmlBody();
return message.BodyPlainText;
I get this result
\r\ntest \r\n test2 \r\n test3 \r\n test \r\n
As you can see, before test2, test3 and test, there is an extra space added.
Is this a bug or am I doing something wrong?
Can someone help me?
Thanks

I suggest you use a simple regex to remove Spaces at start or end of line.
The regex:
^[ ]*|[ ]*$
It simply matches zero or more Spaces at either start or end of line.
You need to set the 'Multiline' option.
Then replace the Spaces with an empty string.
How to use:
message.BodyPlainText = Regex.Replace(message.BodyPlainText, "^[ ]*|[ ]*$", "", RegexOptions.Multiline);
Now your message will have Spaces removed.

Related

Email sent from web app has >> in bullet points

I have a web application which notifies customers of their application status via email. Standard email messages are uploaded through a user web page. And is stored in a SQL server db table. The email web service then reads the message content from the db table, converts it to string and triggers the email.
System.Data.SqlClient.SqlDataReader Msg = RC.dbTable("EmailMessage", parm);
if (Msg == null)
{
returnString = "Error Sending Email->" + RC.ErrorMessage("Error Getting Standard Email Message->");
}
else
{
if (Msg.Read())
{
msg = Msg["MessageContent"].ToString().Replace("[", "<").Replace("]", ">");
topic = Msg["MessageTopic"].ToString();
}
Msg.Close();
}
This time, I had to include some bullet points in my email so I created the email message in word, saved it as HTML file and uploaded it to the web page. The email message shows up perfectly in any browser.
Hello,
Please reply to xyz#abc.com with the following:
‐ a paper
‐ a pen
‐ a file cover
This needs to be completed.
Stay Safe.
But, when I tested the email functionality, I am getting the email like this:
Hello,
Please reply to xyz#abc.com with the following:
‐ >>a paper
‐ >>a pen
‐ >>a file cover
This needs to be completed.
Stay Safe.
I don't understand why is the email message has >> in the bullet points text. Please find below the HTML file snippet.
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List href="Hello_files/filelist.xml">
<link rel=themeData href="Hello_files/themedata.thmx">
<link rel=colorSchemeMapping href="Hello_files/colorschememapping.xml">
<style>
</style>
</head>
<body lang=EN-US link="#0563C1" vlink="#954F72" style='tab-interval:.5in'>
<div class=WordSection1>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Verdana",sans-serif'>Hello, <br>
<br>
Please reply to </span><a
href="mailto:xyz#abc.com"><span style='font-size:12.0pt;font-family:
"Verdana",sans-serif'>xyz#abc.com</span></a><span style='font-size:
12.0pt;font-family:"Verdana",sans-serif'> with the following:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Verdana",sans-serif'><o:p> </o:p></span></p>
<p class=MsoListParagraphCxSpFirst style='margin-bottom:8.0pt;mso-add-space:
auto;text-indent:-.25in;line-height:105%;mso-list:l0 level1 lfo1'><![if !supportLists]><span
lang=EN-CA style='font-size:12.0pt;line-height:105%;font-family:"Verdana",sans-serif;
mso-fareast-font-family:Verdana;mso-bidi-font-family:Verdana;mso-ansi-language:
EN-CA'><span style='mso-list:Ignore'>‐<span style='font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span lang=EN-CA style='font-size:12.0pt;
line-height:105%;font-family:"Verdana",sans-serif;mso-ansi-language:EN-CA'>a paper<o:p></o:p></span></p>
<p class=MsoListParagraphCxSpMiddle style='margin-bottom:8.0pt;mso-add-space:
auto;text-indent:-.25in;line-height:105%;mso-list:l0 level1 lfo1'><![if !supportLists]><span
lang=EN-CA style='font-size:12.0pt;line-height:105%;font-family:"Verdana",sans-serif;
mso-fareast-font-family:Verdana;mso-bidi-font-family:Verdana;mso-ansi-language:
EN-CA'><span style='mso-list:Ignore'>‐<span style='font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span lang=EN-CA style='font-size:12.0pt;
line-height:105%;font-family:"Verdana",sans-serif;mso-ansi-language:EN-CA'>a pen<o:p></o:p></span></p>
<p class=MsoListParagraphCxSpMiddle style='margin-bottom:8.0pt;mso-add-space:
auto;text-indent:-.25in;line-height:105%;mso-list:l0 level1 lfo1'><![if !supportLists]><span
lang=EN-CA style='font-size:12.0pt;line-height:105%;font-family:"Verdana",sans-serif;
mso-fareast-font-family:Verdana;mso-bidi-font-family:Verdana;mso-ansi-language:
EN-CA'><span style='mso-list:Ignore'>‐<span style='font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span lang=EN-CA style='font-size:12.0pt;
line-height:105%;font-family:"Verdana",sans-serif;mso-ansi-language:EN-CA'>a file cover<o:p></o:p></span></p>
<p class=MsoListParagraphCxSpLast style='margin-top:0in;margin-right:0in;
margin-bottom:8.0pt;margin-left:38.7pt;mso-add-space:auto;line-height:105%'><span
lang=EN-CA style='font-size:12.0pt;line-height:105%;font-family:"Verdana",sans-serif;
mso-ansi-language:EN-CA'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Verdana",sans-serif'>This needs to be completed.<o:p></o:p></span></p>
<p class=MsoNormal style='mso-margin-top-alt:auto'><span style='font-size:12.0pt;
font-family:"Verdana",sans-serif'>Stay Safe.<o:p></o:p></span></p>
</div>
</body>
</html>
Oh. MS Word was used to create the html. Hmm. Yes, it always makes a bit of a mess with all its crazy stylesheeting and masses of extra tags and other superfluous structure. You'd probably have a good result by just cleaning up the html to the minimum you need, which looks really simple - maybe 3 p and an ul, but I think the problem comes because of this:
msg = Msg["MessageContent"].ToString().Replace("[", "<").Replace("]", ">")
Plus these in the html:
lfo1'><![if !supportLists]><span
...
<![endif]>
Running that replacement will generate HTML that contains <!<endif>> which is definitely invalid html. Just because a browser can see it and not choke on it doesn't mean an email program will behave the same; you're putting garbage in and you're getting garbage out
Clean up the HTML:
<html>
<body style='font-face: sans-serif'>
<p>Hello,</p>
<p>Please reply to xyz#abc.com with the following:</p>
<ul>
<li>a paper</li>
<li>a pen</li>
<li>a file cover</li>
</ul>
<p>This needs to be completed.</p>
<p>Stay Safe.</p>
</body>
</html>
And don't do that replacement of square brackets with angle ones. It's asking for trouble

CsQuery replace tags

I using CsQuery in order to parse HTML documents. What I'm trying to do is to replace all the "br" HTML tags with "." character.
Assuming that this is my input HTML:
<html>
<body>
Hello
<br>
World
</body>
</html>
The requested output will be:
<html>
<body>
Hello
.
World
</body>
</html>
Pseudo code:
CQ dom = CQ.CreateFromUrl("http://my.url");
dom.ReplaceTag("<br>", ".");
Is this possible?
Thanks for advices.
That's pretty easy, just replace the <br> elements by setting their OuterHTML.
The relevant selector is just "br":
foreach (var br in dom["br"])
br.OuterHTML = ".";
Call dom.Render() to see the result.

AntiXSS doesn't sanitize unclosed html tag

Why are unclosed html tags not sanitized with Microsoft AntiXSS?
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns "<img src=x onmouseover=confirm(foo) y="
Closing tags are sanitized:
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=a>";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns ""
It is recommended to use HTML encoding whenever possible instead of HTML sanitation.
Sanitation should only be used if you actually need to use some HTML but want to remove any unsafe code. 99% of the times you don't need any HTML to be inserted by your users, and eliminating that should be done with encoding.
Having said that, if you still want to perform sanitation, AntiXSS is not the best solution - both because of the example above, and the fact that it also removes totally safe HTML and falsely recognizing it as unsafe, causing AntiXSS sanitizer to be ineffective.
Ajax control toolkit have a better internal sanitizer you can use, but notice that it is less secured because their partly work with black-lists (searching for the dangerous code instead of permitting only safe code).
If you still want to use AntiXSS sanitation, you can just check whether the HTML that was inserted is a valid one before you sent to the sanitizer. You can do it for example with XML document class of some kind as any valid HTML is also a valid XML.
Hope this helps.
What version of the AntiXss library are you using?
I used version 4.3.0.0 and when I ran this through Encoder.GetSafeHtmlFragment()
and the output gave the following value "<img src=x onmouseover=test(1) y="
as you can see, they automatically encoded the non-HTML values.
Here is the code I used:
protected void Page_Load(object sender, EventArgs e)
{
var testValue = "<img src=x onmouseover=test(1) y=";
litFirst.Text = testValue;
litSecond.Text = Sanitizer.GetSafeHtml(testValue);
litThird.Text = Sanitizer.GetSafeHtmlFragment(testValue);
}
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
<script>
function test(x) {
alert(x);
}
</script>
</head>
<body>
<form id="form1" runat="server">
<div>
First: <asp:Literal ID="litFirst" runat="server"/>
<br/>
Second: <asp:Literal ID="litSecond" runat="server"/>
<br/>
Third: <asp:Literal ID="litThird" runat="server"/>
</div>
</form>
</body>
</html>
But I also agree with Gil Cohen, in that you really should not allow users to enter HTML.
Along with Gil Cohen, I would recommend that instead of allowing them to enter in HTML directly, do it through an intermediate language like Markup, Textile, Wiki markup to name a few. This gives the advantage of allowing users to have more control over there output but still does not let them write HTML directly.
There are JavaScript WYSIWYG editors that will output the markup/preview for the user, then allow you to save the markup language for later use (to be converted to HTML during the output procedure, not before you save it to your data-store).

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression '//body//text()' to select all text nodes?
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?
I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)
The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

Categories