Select elements added to the DOM by a script - c#

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers

Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

Related

Extracting full line of text using partial text [duplicate]

How do I use the HTML Agility Pack?
My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.
First, install the HTMLAgilityPack nuget package into your project.
Then, as an example:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
// Do something with bodyNode
}
}
}
(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)
The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)
HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.
Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.
There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.
I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.
HtmlAgilityPack Article Series
Introduction To The HtmlAgilityPack Library
Easily extracting links from a snippet of html with HtmlAgilityPack
The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.
HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp
To parse
<h2>
Jack
</h2>
<ul>
<li class="tel">
81 75 53 60
</li>
</ul>
<h2>
Roy
</h2>
<ul>
<li class="tel">
44 52 16 87
</li>
</ul>
I did this:
string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[#class='tel']//a"))
{
phones.Add(node.ChildNodes[0].InnerHtml);
}
Main HTMLAgilityPack related code is as follows
using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace GetMetaData
{
/// <summary>
/// Summary description for MetaDataWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.ComponentModel.ToolboxItem(false)]
// To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
[System.Web.Script.Services.ScriptService]
public class MetaDataWebService: System.Web.Services.WebService
{
[WebMethod]
[ScriptMethod(UseHttpGet = false)]
public MetaData GetMetaData(string url)
{
MetaData objMetaData = new MetaData();
//Get Title
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(url);
objMetaData.PageTitle = Regex.Match(sourceUrl, #
"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
//Method to get Meta Tags
objMetaData.MetaDescription = GetMetaDescription(url);
return objMetaData;
}
private string GetMetaDescription(string url)
{
string description = string.Empty;
//Get Meta Tags
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach(var tag in metaTags)
{
if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
{
description = tag.Attributes["content"].Value;
}
}
}
else
{
description = string.Empty;
}
return description;
}
}
}
public string HtmlAgi(string url, string key)
{
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[#name='{0}']", key));
if (ourNode != null)
{
return ourNode.GetAttributeValue("content", "");
}
else
{
return "not fount";
}
}
Getting Started - HTML Agility Pack
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
try this
string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
private string ParseHmlBody(string html)
{
string body = string.Empty;
try
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
body = htmlBody.OuterHtml;
}
catch (Exception ex)
{
dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
}
return body;
}

Suppress Tags from HTML with AgilityHTMLPack

I need help because I am not really used to work with HTML. I show a webdocument from my code, the web document read an HTML file, containing some Images.
Everytime, just before the Image tag, I observed two tags who create some wrong caracters. An example would be better.
<p ><br clear=all> </span>
<img border=0 width=265 height=105 id="Picture 84856"
src="Test_HTML/image272.jpg"></p>
the printing is partially correct because it shows the Images and a lots of wrong ÂÂÂÂÂÂÂÂÂ characters.
So I decided to try to cut the tags.
I don't know how to do this. Perhaps I am completely wrong but I think it is good start, isn't it?
My test to suppress these tags in a Html node is
public void ShowTag(string tag)
{
string innerHtml= "//div[#id='"+tag+ "']";
string inner = "//p";
string brToRemove = "//br";
string spanToRemove = "//span";
var nodes = document.DocumentNode.SelectSingleNode(innerHtml);
bool br_deleted = false;
foreach (HtmlNode nd in nodes.SelectNodes(inner))
{
foreach (HtmlNode child in nd.ChildNodes)
{
if (child.Name == "br")
{
int a = 0;
a++;
child.ParentNode.RemoveChild(child);
br_deleted = true;
}
if(child.Name=="span")
{
int b = 0;
b++;
if (br_deleted == true)
{
//nd.ParentNode.RemoveChild(child);
child.Remove();
br_deleted = false;
}
}
}
}
but I cannot remove the child, do you have any idea?
I founded where the problem came from: When selecting the good node, I needed to add the Headers so i could identify the encoding.
string innerHtml = "//div[#id='" + tag + "']";
string inner = "//p";
webbrowser.Navigate("about:blank");
LoadDocument();
HtmlNode nodes = document.DocumentNode.SelectSingleNode(innerHtml);
HtmlNode head = document.DocumentNode.SelectSingleNode("/html/head");
head.AppendChild(nodes);
webbrowser.NavigateToString(head.InnerHtml);

How to convert HTML to plain text? [duplicate]

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Removing jquery and CSS from an Xml Document

I'm using sgmlreader to convert HTML to XML. The output goes into a XmlDocument object, which I can then use the InnerText method to extract the plain text from the website. I'm trying to get the text to look as clean as possible, by removing any javascript. Looping through the xml and removing any <script type="text/javascript"> is easy enough, but I've hit a brick wall when any jquery or styling isn't encapsulated in any tags. Can anybody help me out?
Sample Code:
Step one:
Once I use the webclient class to download the HTML, I save it, then open the file with the text reader class.
Step two:
Create sgmlreader class and set the input stream to the text reader:
// setup SGMLReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
Step three:
Once I have a xmldocument, I use the doc.InnerText to get my plain text.
Step four:
I can easy remove JavaScript tags like so:
XmlNodeList nodes = document.GetElementsByTagName("text/javascript");
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
Some stuff still slips through. Heres an example of an ouput for one particular website I'm scriping:
Criminal and Civil Enforcement | Fraud | Office of Inspector General | U.S. Department of Health and Human Services
#fancybox-right {
right:-20px;
}
#fancybox-left {
left:-20px;
}
#fancybox-right:hover span, #fancybox-right span
#fancybox-right:hover span, #fancybox-right span {
left:auto;
right:0;
}
#fancybox-left:hover span, #fancybox-left span
#fancybox-left:hover span, #fancybox-left span {
right:auto;
left:0;
}
#fancybox-overlay {
/* background: url('/connections/images/wc-overlay.png'); */
/* background: url('/connections/images/banner.png') center center no-repeat; */
}
$(document).ready(function(){
$("a[rel=photo-show]").fancybox({
'titlePosition' : 'over',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
});
$(".title-under").fancybox({
'titlePosition' : 'outside',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
})
});
That jquery and styling needs to be removed.
I just threw this together in LinqPad based on the html of this page and it properly removes the script and style tags.
void Main()
{
string htmlPath = #"C:\Users\Jschubert\Desktop\html\test.html";
var sgmlReader = new Sgml.SgmlReader();
var stringReader = new StringReader(File.ReadAllText(htmlPath));
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = stringReader;
// create document
var doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
List<XmlNode> nodes = doc.GetElementsByTagName("script")
.Cast<XmlNode>().ToList();
var byType = doc.SelectNodes("script[#type = 'text/javascript']")
.Cast<XmlNode>().ToList();
var style = doc.GetElementsByTagName("style").Cast<XmlNode>().ToList();
nodes.AddRange(byType);
nodes.AddRange(style);
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
doc.DumpFormatted();
stringReader.Close();
sgmlReader.Close();
}
Casting to XmlNode to use the generic list is not ideal, but I did it for the sake of space and demonstration.
Also, you shouldn't need both
doc.GetElementsByTagName("script") and
doc.SelectNodes("script[#type = 'text/javascript']").
Again, I did that for the sake of demonstration.
If you have other scripts and you only want to remove JavaScript, use the latter. If you're removing all script tags, use the first one. Or, use both if you want.

How can I Convert HTML to Text in C#?

I'm looking for C# code to convert an HTML document to plain text.
I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.
The output should look like this:
Html2Txt at W3C
I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?
EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt # W3C produced. Too bad that source doesn't seem to be available.
I was looking to see if there is a more "canned" solution available.
EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!
Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), #"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
As an example, the following HTML code...
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a hyperlink.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...will be transformed into:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
...as opposed to:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
You could use this:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
Updated
Thanks for the comments I have updated to improve this function
I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..
http://www.codeplex.com/htmlagilitypack
Some sample on SO..
HTML Agility pack - parsing tables
What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.
I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.
Instead I used that utility from the Microsoft Team Foundation API:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.
Assuming you have well formed html, you could also maybe try an XSL transform.
Here's an example:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(#"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:
Convert HTML to Plain Text
Yep, looks so big, but works fine.
The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's.
It shouldn't be too hard to extend this to tables.
Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.
Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:
var a = "This & that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();
Using the two together you can get the plain text from the HTML.
Another post suggests the HTML agility pack:
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.
This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/
I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first
Try the easy and usable way: just call StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
In Genexus You can made with Regex
&pattern = '<[^>]+>'
&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")
In Genexus possiamo gestirlo con Regex,
If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.
Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
You can use this in a Windows Store app as well.
You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
This is another solution to convert HTML to Text or RTF in C#:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
This library is not free, this is commercial product and it is my own product.

Categories