I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.
To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!
I have an HtmlDocument that may or may have a proper <head> and <body> section or might just be an html fragment. Either way, I want to run it through a function that will ensure that it has (more) proper html structure.
I know that I can check if it has a body by seeing if
doc.DocumentNode.SelectSingleNode("//body");
is null. If it does not have a body, how would I wrap the contents of doc.DocumentNode in a <body> element and assign it back to the HtmlDocument?
Edit: There seems to be some confusion about what I want to do. In jquery terms:
$doc = $(document);
if( !$doc.has('body') ) {
$doc.wrapInner('body');
}
Basically, if there is no body element, put a body element around everything.
You could do something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(MyTestHtm);
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body == null)
{
HtmlNode html = doc.DocumentNode.SelectSingleNode("//html");
// we presume html exists
body = CloneAsParentNode(html.ChildNodes, "body");
}
static HtmlNode CloneAsParentNode(HtmlNodeCollection nodes, string name)
{
List<HtmlNode> clones = new List<HtmlNode>(nodes);
HtmlNode parent = nodes[0].ParentNode;
// create a new parent with the given name
HtmlNode newParent = nodes[0].OwnerDocument.CreateElement(name);
// insert before the first node in the selection
parent.InsertBefore(newParent, nodes[0]);
// clone all sub nodes
foreach (HtmlNode node in clones)
{
HtmlNode clone = node.CloneNode(true);
newParent.AppendChild(clone);
}
// remove all sub nodes
foreach (HtmlNode node in clones)
{
parent.RemoveChild(node);
}
return newParent;
}
I need to select the siblings of root and only for the first level.
I'm selecting html and it can be anyting...
ex:
<p><img...></p>
<p><img...></p>
<ul><li><a>somelink</a></li></ul>
<a>...
what i tryed was this /*[following-sibling::*] but that only selects the two <p><img...></p>...
what i want is to select the first level as in P P ul a.
Im useing Html Agillity Pack and my code looks like this:
var nodeCollection = new List<HtmlNode>();
var document = new HtmlDocument();
document.LoadHtml(html);
if (document.DocumentNode != null)
{
foreach (var node in document.DocumentNode.SelectNodes("/*[following-sibling::*]"))
{
nodeCollection.Add(node);
}
Anyone know what im doing wrong ?
I think i found a working xpath and it was not so hard xD
the following worked: /child::*
I want to get all values of 'id' attribute of 'span' tag with html agility pack.
But instead of attributes I got tags themself. Here's the code
private static IEnumerable<string> GetAllID()
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.Load(FileName);
var nodes = sourceDocument.DocumentNode.SelectNodes(
#"//span/#id");
return nodes.Nodes().Select(x => x.Name);
}
I'll appreciate if someone tells me what's wrong here.
try
var nodes = sourceDocument.DocumentNode.SelectNodes("//span[#id]");
List<string> ids = new List<string>(nodes.Count);
if(nodes != null)
{
foreach(var node in nodes)
{
if(node.Id != null)
ids.Add(node.Id);
}
}
return ids;
I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?
You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;
Html Agility Pack does not support attribute selection.
You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}
Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342
Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}
I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.
I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});