XPath Select Siblings where sibling parent is root document - c#

I need to select the siblings of root and only for the first level.
I'm selecting html and it can be anyting...
ex:
<p><img...></p>
<p><img...></p>
<ul><li><a>somelink</a></li></ul>
<a>...
what i tryed was this /*[following-sibling::*] but that only selects the two <p><img...></p>...
what i want is to select the first level as in P P ul a.
Im useing Html Agillity Pack and my code looks like this:
var nodeCollection = new List<HtmlNode>();
var document = new HtmlDocument();
document.LoadHtml(html);
if (document.DocumentNode != null)
{
foreach (var node in document.DocumentNode.SelectNodes("/*[following-sibling::*]"))
{
nodeCollection.Add(node);
}
Anyone know what im doing wrong ?

I think i found a working xpath and it was not so hard xD
the following worked: /child::*

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

HTML Agility Issue

I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.
To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!

Trouble reading iTunes XML feed

I am trying to read an XML feed from http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml.
I want to access the fields like this:
<im:price amount="1.29000" currency="USD">$1.29</im:price>
<im:releaseDate label="December 31, 1960">1960-12-31T16:00:00-07:00</im:releaseDate>
Here is what I have done so far:
var xml = "http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml";
XmlDocument doc = new XmlDocument();
doc.Load(xml);
XmlNodeList items = doc.SelectNodes("//entry");
foreach (var item in items) {
// Do something with item.
}
No luck, though. items is null. Why? What am I doing wrong?
You need to create a namespace manager to map the RSS and also the iTunes custom tags namespace URIs to short prefixes (itunes and im in the example below):
var xml = "http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml";
XmlDocument doc = new XmlDocument();
doc.Load(xml);
var namespaceManager = new XmlNamespaceManager(doc.NameTable);
namespaceManager.AddNamespace("itunes", "http://www.w3.org/2005/Atom");
namespaceManager.AddNamespace("im", "http://itunes.apple.com/rss");
XmlNodeList items = doc.SelectNodes("//itunes:entry", namespaceManager);
foreach (XmlNode item in items)
{
var price = item.SelectSingleNode("im:price", namespaceManager);
var releaseDate = item.SelectSingleNode("im:releaseDate", namespaceManager);
if (price != null)
{
Console.WriteLine(price.Attributes["amount"].InnerText);
}
if (releaseDate != null)
{
Console.WriteLine(releaseDate.Attributes["label"].InnerText);
}
}
For that specific feed you should get 10 entries.
It's in the docs as well:
If the XPath expression does not include a prefix, it is assumed that
the namespace URI is the empty namespace. If your XML includes a
default namespace, you must still use the XmlNamespaceManager and add
a prefix and namespace URI to it; otherwise, you will not get any
nodes selected. For more information, see Select Nodes Using XPath
Navigation.
Alternatively you can use a namespace-agnostic XPath (from here):
XmlNodeList items = doc.SelectNodes("//*[local-name() = 'entry']");
Finally, not sure why you said items is null. It cannot be. When running your original code you should get this:

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.
In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.
HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.
the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories