What's the easiest way to generate an HTML file? - c#

I'm working on a console application that's supposed to spit out an html document that contains a table and maybe some javascript.
I thought about writing the html by hand:
streamWriter.WriteLine("<html>");
streamWriter.WriteLine("<body>");
streamWriter.WriteLine(GetHtmlTable());
streamWriter.WriteLine("</body>");
streamWriter.WriteLine("</html>");
... but was wondering if there is a more elegant way to do it. Something along these lines:
Page page = new Page();
GridView gridView = new GridView();
gridView.DataSource = GetDataTable();
gridView.DataBind();
page.Controls.Add(gridView);
page.RenderControl(htmlWriter);
htmlWriter.Flush();
Assuming that I'm on the right track, what's the proper way to build the rest of the html document (ie: html, head, title, body elements) using the System.Web.UI.Page class? Do I need to use literal controls?

It would be a good idea for you to use a templating system to decouple your presentation and business logic.
Take a look at Razor Generator which allows the use of CSHTML templates within non ASP.NET applications.
http://razorgenerator.codeplex.com/

I do a lot of automated HTML page generation. I like to create an HTML page template with custom tags where to insert the dynamic controls, data, or literals. I then read template file into a string and replace the custom tag with the generated HTML like you are doing above and write the HTML file back out of the string. This saves me the time of creating all the tedious support HTML for the design template, css, and supporting JS.
Template File Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<CUSTOMHEAD />
</head>
<body>
<CUSTOMDATAGRID />
</body>
</html>
Create HTML From Template File loaded into string Example
private void GenerateHTML(string TemplateFile, string OutputFileName)
{
string strTemplate = TemplateFile;
string strHTMLPage = "";
string strCurrentTag = "";
int intStartIndex = 0;
int intEndIndex = 0;
while (strTemplate.IndexOf("<CUSTOM", intEndIndex) > -1)
{
intStartIndex = strTemplate.IndexOf("<CUSTOM", intEndIndex);
strHTMLPage += strTemplate.Substring(intEndIndex,
intStartIndex - intEndIndex);
strCurrentTag = strTemplate.Substring(intStartIndex,
strTemplate.IndexOf("/>", intStartIndex) + 6 - intStartIndex);
strCurrentTag = strCurrentTag.ToUpper();
switch (strCurrentTag)
{
case "<CUSTOMHEAD />":
strHTMLPage += GenerateHeadJavascript();
break;
case "<CUSTOMDATAGRID />":
StringWriter sw = new StringWriter();
GridView.RenderControl(new HtmlTextWriter(sw));
strHTMLPage += sw.ToString();
sw.Close();
break;
case "<CUSTOMANYOTHERTAGSYOUMAKE />":
//strHTMLPage += YourControlsRenderedAsString();
break;
}
intEndIndex = strTemplate.IndexOf("/>", intStartIndex) + 2;
}
strHTMLPage += strTemplate.Substring(intEndIndex);
try
{
StreamWriter swHTMLPage = new System.IO.StreamWriter(
OutputFileName, false, Encoding.UTF8);
swHTMLPage.Write(strHTMLPage);
swHTMLPage.Close();
}
catch (Exception ex)
{
// AppendLog("Write File Failed: " + OutputFileName + " - " + ex.Message);
}
}

Related

How to convert HTML to plain text? [duplicate]

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Inject some string to a specific part of string in C#

How to insert some string to a specific part of another string. What i am trying to achieve is i have an html string like this in my variable say string stringContent;
<html><head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<meta name="Viewport" content="width=320; user-scaleable=no;
initial-scale=1.0">
<style type="text/css">
body {
background: black;
color: #80c0c0;
}
</style>
<script>
</script>
</head>
<body>
<button type="button" onclick="callNative();">Call to Native
Code!</button>
<br><br>
</body></html>
I need to add below string content inside <script> <script/> tag
function callNative()
{
window.external.notify("Uulalaa!");
}
function addToBody(text)
{
document.body.innerHTML = document.body.innerHTML + "<br>" + text;
}
How i can achieve this in C#.
Assuming your content is stored in the string content, you can start by finding the script tag with:
int scriptpos = content.IndexOf("<script");
Then to get past the end of the script tag:
scriptpos = content.IndexOf(">", scriptpos) + 1;
And finally to insert your new content:
content = content.Insert(scriptpos, newContent);
This at least allows for potential attributes in the script tag.
Use htmlString.Replace(what, with)
var htmlString = "you html bla bla where's the script tag? oooups here it is!!!<script></script>";
var yourScript = "alert('HA-HA-HA!!!')";
htmlString = htmlString.Replace("<script>", "<script>" + yourScript);
Note that this will insert yourScript inside all <script> elements.
var htmlString = #"<script>$var1</script> <script>$var2</script>"
.Replace("$var1", "alert('var1')")
.Replace("$var2", "alert('var2')");
var htmlString = "you html bla bla where's the script tag? oooups here it is!!!<script></script>";
var yourScript = "alert('HA-HA-HA!!!')";
htmlString = htmlString.Insert(html.IndexOf("<script>") + "<script>".Length + 1, yourScript);
This can be done in another (safer) way, using HTML Agility Pack (open source project http://htmlagilitypack.codeplex.com). It helps you to parse and edit html without you having to worry about malformed tags (<br/>, <br />, < br / > etc). It includes operations to make it easy to insert elements, like AppendChild.
If you are dealing with HTML, this is the way to go.
For this you can read the html file into string by using File.ReadAllText method. Here For example, i have used sample html string. After that, by some string operations you can add tags under script like follows.
string text = "<test> 10 </test>";
string htmlString =
#" <html>
<head>
<script>
<tag1> 5 </tag1>
</script>
</head>
</html>";
int startIndex = htmlString.IndexOf("<script>");
int length = htmlString.IndexOf("</script>") - startIndex;
string scriptTag = htmlString.Substring(startIndex, length) + "</script>";
string expectedScripTag = scriptTag.Replace("<script>", "<script><br>" + text);
htmlString = htmlString.Replace(scriptTag, expectedScripTag);

Issue regarding add html data to web browser control repeatedly

This way i try to add html data to web browser control.
private void Adddata()
{
webBrowser1.DocumentText =
"<html><body>Please enter your name:<br/>" +
"<input type='text' name='userName'/><br/>" +
"<a href='http://www.microsoft.com'>continue</a>" +
"</body></html>";
}
This works but when i call the Adddata() routine repeatedly then only first time data gets added but from the next time no data is getting added. i just want to add the data repeatedly. is there any way out.
You can use this:
webBrowser1.DocumentText +=
But now when you can't add this code with many body and html tags.
Always build new String with one html and body tag.
Just append html code inside.
Change
webBrowser1.DocumentText = //blah
to
webBrowser1.DocumentText += //blah
Well, not really. It wouldn't be the best of ideas to do that for html. What I would do is
//in class def
bool firstTime;
//in method
bool firstTimeLcl = firstTime
firstTime = false;
if (firstTimeLcl)
{
//write header
}
else
{
String.Replace(/*closing tags*/, "");
}
//write everything within body
//write closing tags

C# WebBrowser control not applying css

I have a project that I am working on in VS2005. I have added a WebBrowser control. I add a basic empty page to the control
private const string _basicHtmlForm = "<html> "
+ "<head> "
+ "<meta http-equiv='Content-Type' content='text/html; charset=utf-8'/> "
+ "<title>Test document</title> "
+ "<script type='text/javascript'> "
+ "function ShowAlert(message) { "
+ " alert(message); "
+ "} "
+ "</script> "
+ "</head> "
+ "<body><div id='mainDiv'> "
+ "</div></body> "
+ "</html> ";
private string _defaultFont = "font-family: Arial; font-size:10pt;";
private void LoadWebForm()
{
try
{
_webBrowser.DocumentText = _basicHtmlForm;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
and then add various elements via the dom (using _webBrowser.Document.CreateElement). I am also loading a css file:
private void AddStyles()
{
try
{
mshtml.HTMLDocument currentDocument = (mshtml.HTMLDocument) _webBrowser.Document.DomDocument;
mshtml.IHTMLStyleSheet styleSheet = currentDocument.createStyleSheet("", 0);
TextReader reader = new StreamReader(Path.Combine(Path.GetDirectoryName(Application.ExecutablePath),"basic.css"));
string style = reader.ReadToEnd();
styleSheet.cssText = style;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Here is the css page contents:
body {
background-color: #DDDDDD;
}
.categoryDiv {
background-color: #999999;
}
.categoryTable {
width:599px; background-color:#BBBBBB;
}
#mainDiv {
overflow:auto; width:600px;
}
The style page is loading successfully, but the only elements on the page that are being affected are the ones that are initially in the page (body and mainDiv). I have also tried including the css in a element in the header section, but it still only affects the elements that are there when the page is created.
So my question is, does anyone have any idea on why the css is not being applied to elements that are created after the page is loaded? I have also tried no applying the css until after all of my elements are added, but the results don't change.
I made a slight modification to your AddStyles() method and it works for me.
Where are you calling it from? I called it from "_webBrowser_DocumentCompleted".
I have to point out that I am calling AddStyles after I modify the DOM.
private void AddStyles()
{
try
{
if (_webBrowser.Document != null)
{
IHTMLDocument2 currentDocument = (IHTMLDocument2)_webBrowser.Document.DomDocument;
int length = currentDocument.styleSheets.length;
IHTMLStyleSheet styleSheet = currentDocument.createStyleSheet(#"", length + 1);
//length = currentDocument.styleSheets.length;
//styleSheet.addRule("body", "background-color:blue");
TextReader reader = new StreamReader(Path.Combine(Path.GetDirectoryName(Application.ExecutablePath), "basic.css"));
string style = reader.ReadToEnd();
styleSheet.cssText = style;
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Here is my DocumentCompleted handler (I added some styles to basic.css for testing):
private void _webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElement element = _webBrowser.Document.CreateElement("p");
element.InnerText = "Hello World1";
_webBrowser.Document.Body.AppendChild(element);
HtmlElement divTag = _webBrowser.Document.CreateElement("div");
divTag.SetAttribute("class", "categoryDiv");
divTag.InnerHtml = "<p>Hello World2</p>";
_webBrowser.Document.Body.AppendChild(divTag);
HtmlElement divTag2 = _webBrowser.Document.CreateElement("div");
divTag2.SetAttribute("id", "mainDiv2");
divTag2.InnerHtml = "<p>Hello World3</p>";
_webBrowser.Document.Body.AppendChild(divTag2);
AddStyles();
}
This is what I get (modified the style to make it as ugly as a single human being can hope to make it :D ):
one solution is to inspect the html prior to setting the DocumentText and inject CSS on the client side. I don't set the control url property but rather get the HTML via WebCLient and then set the DocumentText. maybe setting DocumentText (or in your case Document) after you manipulate the DOM could get it to re-render properly
private const string CSS_960 = #"960.css";
private const string SCRIPT_FMT = #"<style TYPE=""text/css"">{0}</style>";
private const string HEADER_END = #"</head>";
public void SetDocumentText(string value)
{
this.Url = null; // can't have both URL and DocText
this.Navigate("About:blank");
string css = null;
string html = value;
// check for known CSS file links and inject the resourced versions
if(html.Contains(CSS_960))
{
css = GetEmbeddedResourceString(CSS_960);
html = html.Insert(html.IndexOf(HEADER_END), string.Format(SCRIPT_FMT,css));
}
if (Document != null) {
Document.Write(string.Empty);
}
DocumentText = html;
}
It would be quite hard to say unless you send a link of this.
but usually the best method for doing style related stuff is that you have the css already in the page and in your c# code you only add ids or classes to elements to see the styles effects.
I have found that generated tags with class attribute does not get their styles applied.
This is my workaround that is done after the document is generated:
public static class WebBrowserExtensions
{
public static void Redraw(this WebBrowser browser)
{
string temp = Path.GetTempFileName();
File.WriteAllText(temp, browser.Document.Body.Parent.OuterHtml,
Encoding.GetEncoding(browser.Document.Encoding));
browser.Url = new Uri(temp);
}
}
I use similiar control instead of WebBrowser, I load HTML page with "default" style rules and I change the rules within the program.
(DrawBack - maintainance, when I need to add a rule, I also need to change it in code)
' ----------------------------------------------------------------------
Public Sub mcFontOrColorsChanged(ByVal isRefresh As Boolean)
' ----------------------------------------------------------------------
' Notify whichever is concerned:
Dim doc As mshtml.HTMLDocument = Me.Document
If (doc.styleSheets Is Nothing) Then Return
If (doc.styleSheets.length = 0) Then Return
Dim docStyleSheet As mshtml.IHTMLStyleSheet = CType(doc.styleSheets.item(0), mshtml.IHTMLStyleSheet)
Dim docStyleRules As mshtml.HTMLStyleSheetRulesCollection = CType(docStyleSheet.rules, mshtml.HTMLStyleSheetRulesCollection)
' Note: the following is needed seperately from 'Case "BODY"
Dim docBody As mshtml.HTMLBodyClass = CType(doc.body, mshtml.HTMLBodyClass)
If Not (docBody Is Nothing) Then
docBody.style.backgroundColor = colStrTextBg
End If
Dim i As Integer
Dim maxI As Integer = docStyleRules.length - 1
For i = 0 To maxI
Select Case (docStyleRules.item(i).selectorText)
Case "BODY"
docStyleRules.item(i).style.fontFamily = fName ' "Times New Roman" | "Verdana" | "courier new" | "comic sans ms" | "Arial"
Case "P.myStyle1"
docStyleRules.item(i).style.fontSize = fontSize.ToString & "pt"
Case "TD.myStyle2" ' do nothing
Case ".myStyle3"
docStyleRules.item(i).style.fontSize = fontSizePath.ToString & "pt"
docStyleRules.item(i).style.color = colStrTextFg
docStyleRules.item(i).style.backgroundColor = colStrTextBg
Case Else
Debug.WriteLine("Rule " & i.ToString & " " & docStyleRules.item(i).selectorText)
End Select
Next i
If (isRefresh) Then
Me.myRefresh(curNode)
End If
End Sub
It could be that the objects on the page EXIST at the time the page is being loaded, so each style can be applied. just because you add a node to the DOM tree, doesnt mean that it can have all of its attributes manipulated and rendered inside of the browser.
the methods above seem to use an approach the reloads the page (DOM), which suggests that this may be the case.
In short, refresh the page after you've added an element
It sounds as though phq has experienced this. I think the way I would approach is add a reference to jquery to your html document (from the start).
Then inside of the page, create a javascript function that accepts the element id and the name of the class to apply. Inside of the function, use jquery to dynamtically apply the class in question or to modify the css directly. For example, use .addClass or .css functions of jquery to modify the element.
From there, in your C# code, after you add the element dynamically invoke this javascript as described by Rick Strahl here: http://www.west-wind.com/Weblog/posts/493536.aspx

How can I Convert HTML to Text in C#?

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Categories