I am currently trying to parse an HTML document to retrieve all of the footnotes inside of it; the document contains dozens and dozens of them. I can't really figure out the expressions to use to extract all of content I want. The thing is, the classes (ex. "calibre34") are all randomized in every document. The only way to see where the footnotes are located is to search for "hide" and it's always text afterwards and is closed with a < /td> tag. Below is an example of one of the footnotes in the HTML document, all I want is the text. Any ideas? Thanks guys!
<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>
Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:
//td[text()='[hide]']/following-sibling::td
Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).
How about use MSHTML to parse HTML source?
Here is the demo code.enjoy.
public class CHtmlPraseDemo
{
private string strHtmlSource;
public mshtml.IHTMLDocument2 oHtmlDoc;
public CHtmlPraseDemo(string url)
{
GetWebContent(url);
oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
oHtmlDoc.write(strHtmlSource);
}
public List<String> GetTdNodes(string TdClassName)
{
List<String> listOut = new List<string>();
IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
foreach (IHTMLElement item in iec)
{
if (item.className == TdClassName)
{
listOut.Add(item.innerHTML);
}
}
return listOut;
}
void GetWebContent(string strUrl)
{
WebClient wc = new WebClient();
strHtmlSource = wc.DownloadString(strUrl);
}
}
class Program
{
static void Main(string[] args)
{
CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");
Console.Write(oH.oHtmlDoc.title);
List<string> l = oH.GetTdNodes("x");
foreach (string n in l)
{
Console.WriteLine("new td");
Console.WriteLine(n.ToString());
}
Console.Read();
}
}
Related
I'm trying to iterate through each table, and extract the headers of each table separately. This is what I've got so far, but whenever I run this it seems to be extracting the header of all tables per loop (headerCount goes up to 61 on each iteration).
namespace DataCollection
{
internal class Program
{
static void Main(string[] args)
{
int headerCount;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("https://en.wikipedia.org/wiki/List_of_actors_who_have_played_the_Doctor");
//Extracting the tables from the HTML
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
headerCount = 0;
//Extracting the header cells from each table
foreach (HtmlNode headerCol in table.SelectNodes("//th"))
{
headerCount++;
Console.WriteLine(headerCount);
}
};
Console.ReadLine();
}
}
}
What am I doing wrong? Thanks in advance!
I went through the same problem and its intuitive.
using SelectNodes("//th") will search through the entire web document again instead of searching through the selected htmlnode. and its weird.
Try using ".//th"
Placing a dot at the start will tell it to search trough the active node and not the entire htmldocument again.
Hope it helps.
I've parsed html into a PDF and created a table of contents from the Header tags. The bookmarks in the document work fine, but clicking on the line in the table of contents doesn't do anything. The cursor doesn't change icons like it does if I put a URL in the link.
I used Itext RUPS to inspect the final PDF and the named destinations are in the final file.
I tried hard coding a couple of the names in just to see what happens, but they also didn't work. Putting in .CreateURL and google.com works fine.
The one thing I'm doing that may or may not be an issue is I'm creating the body document, then creating the table of contents and merging the two documents.
Maybe Bruno can make a cameo on this one.
private static List ProcessOutlineChildren(PdfDocument pdfDocument, List tableOfContents, IEnumerable<PdfOutline> pdfOutlines, IDictionary<String, PdfObject> names = null)
{
List<TabStop> tabStops = new List<TabStop>();
tabStops.Add(new TabStop(580, TabAlignment.RIGHT));
foreach (var o in pdfOutlines)
{
ListItem currentOutlineItem = new ListItem();
Paragraph paragraph = new Paragraph();
paragraph.AddTabStops(tabStops);
paragraph.Add(o.GetTitle());
paragraph.Add(new Tab());
paragraph.Add((pdfDocument.GetPageNumber((PdfDictionary) o.GetDestination().GetDestinationPage(names))).ToString());
paragraph.SetAction(PdfAction.CreateGoTo(o.GetDestination()));
currentOutlineItem.Add(paragraph);
if (o.GetAllChildren().Any())
{
currentOutlineItem.Add(ProcessOutlineChildren(pdfDocument, new List(), o.GetAllChildren(), names));
}
tableOfContents.Add(currentOutlineItem);
}
return tableOfContents;
}
public class CustomOutlineHandler : OutlineHandler
{
//PDF's require a unique name for destinations, this is how the actions/bookmarks jump to a location.
protected override string GenerateUniqueDestinationName(IElementNode element)
{
string destinationName = base.GenerateUniqueDestinationName(element);
if ("p".Equals(element.Name()))
{
destinationName = destinationName.Replace(GetDestinationNamePrefix(), "paragraph-prefix-");
}
return destinationName;
}
}
//From my main method converting things into PDF.
OutlineHandler customOutlineHandler = new CustomOutlineHandler().PutAllTagPriorityMappings(priorityMappings);
customOutlineHandler.SetDestinationNamePrefix("destination-name-");
properties.SetOutlineHandler(customOutlineHandler);
I need help because I am not really used to work with HTML. I show a webdocument from my code, the web document read an HTML file, containing some Images.
Everytime, just before the Image tag, I observed two tags who create some wrong caracters. An example would be better.
<p ><br clear=all> </span>
<img border=0 width=265 height=105 id="Picture 84856"
src="Test_HTML/image272.jpg"></p>
the printing is partially correct because it shows the Images and a lots of wrong ÂÂÂÂÂÂÂÂÂ characters.
So I decided to try to cut the tags.
I don't know how to do this. Perhaps I am completely wrong but I think it is good start, isn't it?
My test to suppress these tags in a Html node is
public void ShowTag(string tag)
{
string innerHtml= "//div[#id='"+tag+ "']";
string inner = "//p";
string brToRemove = "//br";
string spanToRemove = "//span";
var nodes = document.DocumentNode.SelectSingleNode(innerHtml);
bool br_deleted = false;
foreach (HtmlNode nd in nodes.SelectNodes(inner))
{
foreach (HtmlNode child in nd.ChildNodes)
{
if (child.Name == "br")
{
int a = 0;
a++;
child.ParentNode.RemoveChild(child);
br_deleted = true;
}
if(child.Name=="span")
{
int b = 0;
b++;
if (br_deleted == true)
{
//nd.ParentNode.RemoveChild(child);
child.Remove();
br_deleted = false;
}
}
}
}
but I cannot remove the child, do you have any idea?
I founded where the problem came from: When selecting the good node, I needed to add the Headers so i could identify the encoding.
string innerHtml = "//div[#id='" + tag + "']";
string inner = "//p";
webbrowser.Navigate("about:blank");
LoadDocument();
HtmlNode nodes = document.DocumentNode.SelectSingleNode(innerHtml);
HtmlNode head = document.DocumentNode.SelectSingleNode("/html/head");
head.AppendChild(nodes);
webbrowser.NavigateToString(head.InnerHtml);
I'm trying to update a site that uses an sanitizer based on AngleSharp to process user-generated HTML content. The site users need to be able to embed iframes, and I am trying to use a whitelist to control what domains the frame can load. I'd like to rewrite the 'blocked' iframes to a new custom element "blocked-iframe" that will then be stripped out by the sanitizer, so we can review if other domains need to be added to the whitelist.
I'm trying to use a solution based on this answer: https://stackoverflow.com/a/55276825/794
It looks like so:
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) || !Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Insert(AdjacentPosition.BeforeBegin, newElement.OuterHtml);
element.Remove();
}
}
return doc.FirstElementChild.OuterHtml;
}
It ostensibly works but I notice that the angle brackets in the new element's tag are being escaped on insertion, so the result just gets written into the page as text. I think I could build a map of replacements and just execute them against the string before sending back but I'm wondering if theres a way to do it using AngleSharp's API. The site is using 0.9.9 currently and I'm not sure how far ahead we'll be able to update considering some of the other dependencies in play.
Digging around in the source I found the ReplaceChild method in INode, which works if called from the parent of element
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) ||
!Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Parent.ReplaceChild(newElement, element);
}
}
return doc.FirstElementChild.OuterHtml;
}
I will keep testing but this seems decent enough to me, if there is a better way I'd love to hear it.
I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers
Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.