Count specific child nodes with HtmlAgilityPack

Count specific child nodes with HtmlAgilityPack - c#

I have lot of trouble with this XPath selction that i use in HtmlAgilityPack.
I want to select all li elements (if they exist) nested in another li witch have a tag with id="menuItem2".
This is html sample:
<div id="menu">
<ul>
<li><a id="menuItem1"></a></li>
<li><a id="menuItem2"></a>
<ul>
<li><a id="menuSubItem1"></a></li>
<li><a id="menuSubItem2"></a></li>
</ul>
</li>
<li><a id="menuItem3"></a></li>
</ul>
</div>
this is XPath that i been using. When i lose this part /ul/li, it gets me the a tag that I wanted, but i need his descendants... This XPath always returns null.
string xpathExp = "//a[#id='" + parentIdHtml + "']/ul/li";
HtmlNodeCollection liNodes = htmlDoc.DocumentNode.SelectNodes(xpathExp);

The following XPath should work.
string xpathExp = "//li/a[#id='" + parentIdHtml + "']/following-sibling::ul/li";

Try this for your xpath:
string xpathExp = "//li[a/#id='" + parentIdHtml + "']/ul/li";
The problem is that you were select the a node itself, which has no ul children. You need to select the li node first, and filter on its a child.

XPath is so messy. You're using the HtmlAgilityPack, you might as well leverage the LINQ.
//find the li -- a *little* complicated with nested Where clauses, but clear enough.
HtmlNode li = htmlDoc.DocumentNode.Descendants("li").Where(n => n.ChildNodes.Where(a => a.Name.Equals("a") && a.Id.Equals("menuItem2", StringComparison.InvariantCultureIgnoreCase)).Count() > 0).FirstOrDefault();
IEnumerable<HtmlNode> liNodes = null;
if (li != null)
{
//Node found, get all the descendent <li>
liNodes = li.Descendants("li");
}

From your description I think you want to select the two <li> elements that contain <a> tags with ids menuSubItem1 and menuSubItem2?
If so then this is what you need
//li[a/#id="menuItem2"]//li

Related

How to find parent of two elements that match CssSelectors?

Given the following (generic) dynamic HTML structure:
<ol id="myOrderedList">
<li id="someGuidICantPredict">
<span data-serial="someData1">someData1</span>
<span data-manufacturer="someDataB1">someDataB1</span>
</li>
//(repeated many times with different data)
</ol>
How do I find the following:
Find the <li> where the spans match by CssSelector for both data-serial and data-manufacturer?
I know how to do this for one or the other span tag thusly:
By.CssSelector($"#olCurrentTanks li span[data-serial={serial1}]")
or
By.CssSelector($"#olCurrentTanks li span[data-manufacturer={manufacturer1}]")
But I don't know how to find the parent <li> element where both spans match. Meaning I need to get the IWebElement listItem where the both span's data attributes match the corresponding data which I can predict.
Edit: Difficulty: Okay to use x-path to get the li parent but not to find the spans.

With xpath you can get li element with specific spans children:
//li[./span[#data-serial="someData1"] and ./span[#data-manufacturer="someDataB1"]]
Selector below will give all li elements as a list for FindElements and single first one for FindElement:
By.XPath("//li[./span[#data-serial='someData1'] and ./span[#data-manufacturer='someDataB1']]")
Code examples:
IList<IWebElement> allMyLi = driver.FindElements(By.XPath($"//li[./span[#data-serial='{serial1}'] and ./span[#data-manufacturer='{manufacturer1}']]"));
foreach (var myLi in allMyLi)
{
IWebElement serial = myLi.FindElement(By.CssSelector($"span[data-serial={serial1}]"));
IWebElement manufacturer = myLi.FindElement(By.CssSelector($"span[data-manufacturer={manufacturer1}]"));
Console.WriteLine("serial, manufacturer: {0}, {1}", serial.Text, manufacturer.Text);
}

Combining tags from children in Umbraco

I have a blog section on a Umbraco site, where I want to get all tags from each blog item and combine them in a list without dublicates, so that I can use the taglist as a filter.
I have this section where tags will be listed
<ul id="blogTags" class="inline-list">
<li class="tag-item">Tag 1</li>
<li class="tag-item">Tag 2</li>
<li class="tag-item">Tag 3</li>
<li class="tag-item">Tag 4</li>
</ul>
On my BlogItem doctype I have a field tagsList where the editor can input a comma-separated list of tags.
So basically I want to get all tags from all BlogItems and combine them into a list where dublicates are removed.
I am getting all blog items using:
var blogItems = Umbraco.TypedContent(Model.Content.Id).Children.Where(x => x.DocumentTypeAlias == "BlogItem" && x.IsVisible());
But I am not sure how to get all tags, combine and remove dublicates.

one way of doing it is create a Hashset to store tags and use foreach to add it.
so you can do something like:
HashSet<string> uniqueTagList = new HashSet<string>();
var blogItems = Umbraco.TypedContent(Model.Content.Id).Children.Where(x => x.DocumentTypeAlias == "BlogItem" && x.IsVisible());
var tags = blogItems.Select(x => x.GetPropertyValue<string>("tagsList"));
foreach(var tag in tags)
{
var splitTag = tag.Split(',');
foreach(var singleTag in splitTag)
{
uniqueTagList.Add(singleTag);
}
}
then your uniqueTagList is a list of all the tag, which you can use to create your list
<ul id="blogTags" class="inline-list">
#foreach(var tag in uniqueTagList)
{
<li class="tag-item">#tag</li>
}
</ul>
but beware that if you have a lot of children, this can take some time.
So I would suggest checkout Umbraco tag data type to do something like this:
https://shermandigital.com/blog/display-umbraco-tags-on-razor-templates/
https://shermandigital.com/blog/get-umbraco-content-by-tag/

Rather than a comma separated list, you could use the built in Umbraco tagging controls, which should allow you to do what you're after.
There are API methods for the tags that allow you to get al the tags for specific groups etc. So you could set up a tag property editor with a group called "blog" that you can assign to your blog posts, and then you can use the Tags API to pull out all of the unique tags from that group to build your cloud.

How to get href elements and attributes for each node?

I am working on a project that should read html, and find find all nodes that match a value, then find elements and attributes of the located nodes.
I am having difficulty figuring out how to get the href attributes and elements though.
I am using HTMLAgilityPack.
I have numerous nodes of
class="middle"
throughout the html. I need to get all of them, and from them, get the href element and attributes. Below is a sample of the html:
<div class="top">
<div class="left">
<a href="item123">
<img src="url.png" border="0" />
</a>
</div>
</div>
<div class="middle">
<div class="title">Captains Hat</div>
<div class="day">monday</div>
<div class="city">Tuscon, AZ | 100 Days | <script typs="text/javascript">document.write(ts_to_age_min(1445620427));</script></div>
</div>
I have been able to get the other attributes I need, but not for 'href'.
Here is the code I have:
List<string> listResults = new List<string>();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(url);
//get each listing
foreach (HtmlNode node in doc.DocumentNode.Descendants("div").Where(d =>
d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("middle")))
{
string day = node.SelectSingleNode(".//*[contains(#class,'day')]").InnerHtml;
string city = node.SelectSingleNode(".//*[contains(#class,'city')]").InnerHtml;
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
listResults.Add(day + EnvironmentNewline
+ city + EnvironmentNewline
+ item + EnvironmentNewline + EnvironmentNewline)
}
My code above though is giving me the first href value for the whole html page though, and is giving it for each node for some reason (visible by outputting the list to a messagebox). I thought being in my foreach loop that using SelectSingleNode should get the first href attribute for that specific node. If so, why am I getting the first href attribute for the whole html page loaded?
I've been going through lots of threads on here about getting href values with HTLMAgilityPack, but I haven't been able to get this to work.
How can I get the href attribute and elements for each node I'm selecting based off the class attribute (class="middle")?

Try replacing
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
with
string item = node.SelectSingleNode(".//a").Attributes["href"].Value;
Other than that, code above works for me.
Alternatively:
string item = node.SelectSingleNode(".//*[contains(#class,'title')]")
.Descendants("a").FirstOrDefault().Attributes["href"].Value;

XPath giving different results in browser and HtmlAgilityPack

I am attempting to parse a section of a webpage using HtmlAgilityPack in a C# program. Below is a simplified version of this section of the page (edited 1/30/2015 2:40PM EST):
<html>
<body>
<div id="main-box">
<div>
<div>...</div>
<div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<a href="/some/other/path">
<img src="/path/to/img" />
</a>
</p>
<p>
...
Correct extra text
</p>
</div>
<div>
...
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
No "a" tag this time
</p>
</div>
<div>
<p>
<ul>
<li>
<span>
<span style="display:none;">
Never Selected
</span>
</span>
</li>
<li>
<span>
Correct
and Wrongly Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<span>
Correct
</span>
</p>
<p>
...
Wrongly Selected extra text
</p>
</div>
<div>
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
</div>
</div>
</div>
</body>
</html>
I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box" (the ones with the the word "Correct" in them in the above example). I came up with the following XPath that gets these nodes and only these nodes in both Chrome's inspector and the Firepath add-on for Firefox (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")][1]
However, when I load this page using HttpWebRequest, load the response stream into an HtmlDocument object, and call SelectNodes(xpath) on its DocumentNode property using this XPath, it returns not only the three correct nodes, but also the two tags with the text "Wrongly Selected" in the example above. I noticed that this is effectively the same as if I were to use the XPath above, except without the last "[1]", like this (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")]
I have made sure that I am using the latest version of HtmlAgilityPack, attempted several variations on my XPath to determine if maybe it was hitting some arbitrary maximum length or other simple issues like that, and tried to research similar issues without success. I tried throwing together an even simpler HTML structure using the same basic concept to test, but couldn't reproduce the issue with that, so I suspect that it may be some subtle issue with how HtmlAgilityPack parses something in this structure.
If anyone knows what might cause this issue, or has a better way to write an XPath expression that will get the correct nodes and hopefully not cause issues in HtmlAgilityPack, I would be greatly appreciative.
EDIT
As suggested, here is a simplified version of the C# code I'm using, which I have confirmed does reproduce the problem for me.
using System;
using System.Net;
using HtmlAgilityPack;
...
static void Main(string[] args)
{
string url = "http://www.deerso.com/test.html";
string xpath = "//div[#id=\"main-box\"]/div/div[2]/div[contains(#class, \"row-box\")]/div[(position() = 3 or position() = 4) and descendant::a[contains(#href, \"a=\")]][1]/descendant::a[contains(#href, \"a=\")][1]";
int statusCode;
string htmlText;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Accept = "text/html,*/*";
request.Proxy = new WebProxy();
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0";
using (var response = (WebResponse)request.GetResponse())
{
statusCode = (int)((HttpWebResponse)response).StatusCode;
using (var stream = response.GetResponseStream())
{
if (stream != null)
{
using (var reader = new System.IO.StreamReader(stream))
{
htmlText = reader.ReadToEnd();
}
}
else
{
Console.WriteLine("Request to '{0}' failed, response stream was null", url);
htmlText = null;
return;
}
}
}
HtmlNode.ElementsFlags.Remove("form"); //fix for forms
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode node in nodes)
{
Console.WriteLine("Node Found:");
Console.WriteLine("Text: {0}", node.InnerText);
Console.WriteLine("Href: {0}", node.Attributes["href"].Value);
Console.WriteLine();
}
Console.WriteLine("Done!");
}

New answer based on updated Html
We can't use the //a[contains(#href,'a=')][1] filter since that is selecting the first <a> element from its direct parent.
We need to add brackets to include the descendant operator in the filter, i.e.
(//a[contains(#href,'a=')])[1]
However, if we expand that to apply the first descendant filter to each node in another nodeset, the resultant xpath expression is invalid:
//div[contains(#class,'row-box')](//a[contains(#href,'a=')])[1]
I think we need to break it into two steps:
Get the group of div elements containing the particular link we want.
Get the first descendant link element from each element in that group
In C# this looks like:
// Get the <div> elements we know are ancestors to the <a> elements we want
HtmlNodeCollection topDivs = doc.DocumentNode.SelectNodes("//a[contains(#href,'?a=')]/ancestor::div[contains(#class,'row-box')]");
// Create a new list to hold the <a> elements
List<HtmlNode> linksWeWant = new List<HtmlNode>(topDivs.Count)
// Iterate through the <div> elements and get the first descendant
foreach(var div in topDivs)
{
linksWeWant.Add(div.SelectSingleNode("(//a[contains(#href,'?a=')])[1]"));
}
Old Answer
Using this page as a guide I put together the xpath expression:
When I run it in HtmlAgilityPack I'm getting only these three elements returned:
<a href = "/test/path?a=123">
<a href = "/test/path?a=abc&b=123">
<a href = "/test/path?a=ghi">
Here's a breakdown of the expression:
//div[contains(#class,'row-box')] -> Get nodeset of <div class="*row-box*"> elements
/descendant::a -> From here get all descendant <a> elements
[contains(#href,'a=') and position()=1] -> Filter according to href value and element being the first descendant
I believe the key difference to the xpath in your question is /descendant::a[contains(#href,'a=') and position()=1] vs /descendant::a[contains(#href,'a=')][1]. Applying the [1] separately is filtering as the first child instead of the first descendant.

I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box"
I don't think such a query is possible in a single XPath expression. It would be quite easy in XQuery:
for $rowBox in //div[contains(#class, 'row-box')]
let $firstRelevant := ($rowBox/div[
(position() = 3 or position() = 4)
and .//a[contains(#href, 'a=')]
])[1]
return ($firstRelevant//a[contains(#href, 'a=')])[1]
But the amount of predicate grouping (i.e. (...)[...]) that is going on here exceeds the expressive capabilities of XPath.
Selecting the result in multiple steps in C# would be the way to go, in much the same way XQuery does it:
for each //div[contains(#class, 'row-box')]:
select ./div[(position() = 3 or position() = 4) and .//a[contains(#href, 'a=')]
for the first one:
select .//a[contains(#href, 'a=')]
take the first one

Getting li values from multiple ul's using HtmlAgilityPack C#

This query works perfect for some countries like Germany
"//h2[span/#id='Cities' or span/#id='Other_destinations']" + "/following-sibling::ul[1]" + "/li";
Where the HTML is formatted as:
<h2>
<span id='Other_destination'></span>
</h2>
<ul>
<li>...</li>
<li>...</li>
<li>...</li>
<li>...</li>
</ul>
However, in a country like Afghanistan the Div is formatted as such:
<h2>
<span id='Other_destination'></span>
</h2>
<ul
<li>...</li>
</ul>
<ul>
<li>...</li>
</ul>
So the question becomes, how do I handle the event of a country like Afghanistan where "/following-sibling::ul[1]" + :/li" only gets the first ul in Div='Other_destinations'? I hope that getting a handle on this will help with the other exceptions and formatting issues that I will come across on my other countries. Thank you.

I hope this code solve your problem :
var xpath = "//ul[preceding-sibling::h2[span/#id='Cities' or span/#id='Other_destinations'] and following-sibling::h2[span/#id='Get_in']]" + "/li";
var doc = new HtmlDocument
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
// You need to call a WebClient here and set to the html variable.
var html = String.Empty;
doc.LoadHtml(html);
using (var write = new StreamWriter("testText.txt"))
{
foreach (var node in doc.DocumentNode.SelectNodes(xpath))
{
var all = node.InnerText;
//Writes to text file
write.WriteLine(all);
}
}
The above XPath can be translated to :
Select all the ul tags has between by a h2[span/#id='Cities' or span/#id='Other_destinations'] and a h2[span/#id='Get_in']]
I see that in all the pages has a span tag with id='Get_in' in the final.
I hope it solve your problem.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Count specific child nodes with HtmlAgilityPack - c#

The following XPath should work. string xpathExp = "//li/a[#id='" + parentIdHtml + "']/following-sibling::ul/li";

Try this for your xpath: string xpathExp = "//li[a/#id='" + parentIdHtml + "']/ul/li"; The problem is that you were select the a node itself, which has no ul children. You need to select the li node first, and filter on its a child.

From your description I think you want to select the two <li> elements that contain <a> tags with ids menuSubItem1 and menuSubItem2? If so then this is what you need //li[a/#id="menuItem2"]//li

Related

How to find parent of two elements that match CssSelectors?

Combining tags from children in Umbraco

How to get href elements and attributes for each node?

XPath giving different results in browser and HtmlAgilityPack

Getting li values from multiple ul's using HtmlAgilityPack C#

Categories

Resources