navigate to section of XML with xpath

navigate to section of XML with xpath - c#

i am not able to see where i am going wrong with my xpath logic.
here is a section of a larger xml that i am working on transversing. (note im using the Html Agility Pack)
<div>
<div></div>
<span class="pp-headline-item pp-headline-phone">
<span class="telephone" dir="ltr">
<nobr>(732) 562-1312</nobr>
<span class="pp-headline-phone-label" style="display:none">()</span>
</span>‎
</span>
<span> · </span>
<span class="pp-headline-item pp-headline-authority-page">
<span>
<a href="http://maps.google.com/local_url?q=http://www.fed.com/q=07746+pizza">
<span>fed.com</span>
</a>
</span>
</span>
</div>
my goal is to extract various data points from these chunks of xml that i get out of the master XML file by using a
.SelectNodes("//div/span['pp-headline-item pp-headline-phone']/../..")
with this i am expecting to get all the sections outlined above so i can iterate them and extract things like website, phone, address...
problem is when i iterate this nodeset i cant get to the data points i want as if the node set is not the one outlined on top.
my logic is to extract a nodeset from the top most div into the nodset and when iterating them to xpath into the data points i want.
i do it like this:
foreach (HtmlNode n in BuizRowsgoogMaps)
{
//get phone number
if (n.SelectSingleNode("span/nobr").InnerHtml != null)
{
strPhone = n.SelectSingleNode("span/nobr").InnerHtml;
//get phone site
strSite = n.SelectSingleNode("//span['pp-headline-item pp-headline-authority-page']/span/a/span").InnerHtml;
}
}
i suspect my xpaths dont mesh together to get what i want but when i validate my expression i get the desired results... i used this to validate my thinking and it works leaving me at wits end:
//div/span['pp-headline-item pp-headline-phone']/../../span['pp-headline-item pp-headline-phone']/span/nobr

Your code is almost right, you just need to modify your xpath a bit.
foreach (HtmlNode n in BuizRowsgoogMaps)
{
//get phone number
if (n.SelectSingleNode(".//span/nobr").InnerHtml != null)
{
strPhone = n.SelectSingleNode(".//span/nobr").InnerHtml;
//get phone site
strSite = n.SelectSingleNode(".//span['pp-headline-item pp-headline-authority-page']/span/a/span").InnerHtml;
}
}
The .// tells xpath to match from the current node and not from the root.

Related

How to access nested div based on class name

I have this html code:
<div class="searchResult webResult">
<div class="resultTitlePane">
Google
</div>
<div class="resultDisplayUrlPane">
www.google.com
</div>
<div class="resultDescription">
Search
</div>
</div>
I want to access innertext inside divs in diffrent variables
I know for accessing a div with a class I hould write
var titles = hd.DocumentNode.SelectNodes("//div[#class='searchResult webResult']");
foreach (HtmlNode node in titles)
{?}
what code should I write to get the innertext of each dive in different variables.TNX

I would extend the current XPath expression you have to match the inner div elements:
//div[#class='searchResult webResult']/div[contains(#class, 'result')]
Then, to get the text, use the .InnerText property:
C# - Get the text inside tags using HTML Agility Pack
C#: HtmlAgilityPack extract inner text

Since you don't know how many nodes will be returned, I suggest using a list:
List<string> titlesStringList = new List<string>();
foreach (HtmlNode node in titles)
{
titlesStringList.Add(node.InnerText);
}

How to get href elements and attributes for each node?

I am working on a project that should read html, and find find all nodes that match a value, then find elements and attributes of the located nodes.
I am having difficulty figuring out how to get the href attributes and elements though.
I am using HTMLAgilityPack.
I have numerous nodes of
class="middle"
throughout the html. I need to get all of them, and from them, get the href element and attributes. Below is a sample of the html:
<div class="top">
<div class="left">
<a href="item123">
<img src="url.png" border="0" />
</a>
</div>
</div>
<div class="middle">
<div class="title">Captains Hat</div>
<div class="day">monday</div>
<div class="city">Tuscon, AZ | 100 Days | <script typs="text/javascript">document.write(ts_to_age_min(1445620427));</script></div>
</div>
I have been able to get the other attributes I need, but not for 'href'.
Here is the code I have:
List<string> listResults = new List<string>();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(url);
//get each listing
foreach (HtmlNode node in doc.DocumentNode.Descendants("div").Where(d =>
d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("middle")))
{
string day = node.SelectSingleNode(".//*[contains(#class,'day')]").InnerHtml;
string city = node.SelectSingleNode(".//*[contains(#class,'city')]").InnerHtml;
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
listResults.Add(day + EnvironmentNewline
+ city + EnvironmentNewline
+ item + EnvironmentNewline + EnvironmentNewline)
}
My code above though is giving me the first href value for the whole html page though, and is giving it for each node for some reason (visible by outputting the list to a messagebox). I thought being in my foreach loop that using SelectSingleNode should get the first href attribute for that specific node. If so, why am I getting the first href attribute for the whole html page loaded?
I've been going through lots of threads on here about getting href values with HTLMAgilityPack, but I haven't been able to get this to work.
How can I get the href attribute and elements for each node I'm selecting based off the class attribute (class="middle")?

Try replacing
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
with
string item = node.SelectSingleNode(".//a").Attributes["href"].Value;
Other than that, code above works for me.
Alternatively:
string item = node.SelectSingleNode(".//*[contains(#class,'title')]")
.Descendants("a").FirstOrDefault().Attributes["href"].Value;

C# or VB .net HtmlAgility: how to get nested span from portion of html page

Example file (a portion of HTML page):
<span class="test">English
<span> failed</span>
<span class="retake">no</span>
</span>
HtmlAgilityPack Code Example:
node.SelectSingleNode(".//span[#class='test']").InnerText()
gives me:
English failed no
But I only need code which returns me line no. 2.:
<span> failed</span>

You can use Descendants to get the child nodes:
var node = doc.DocumentNode.SelectSingleNode(".//span[#class='test']");
if (node != null && node.Descendants("span").Any())
{
string result = node.Descendants("span").First().InnerText;
}

XPath giving different results in browser and HtmlAgilityPack

I am attempting to parse a section of a webpage using HtmlAgilityPack in a C# program. Below is a simplified version of this section of the page (edited 1/30/2015 2:40PM EST):
<html>
<body>
<div id="main-box">
<div>
<div>...</div>
<div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<a href="/some/other/path">
<img src="/path/to/img" />
</a>
</p>
<p>
...
Correct extra text
</p>
</div>
<div>
...
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
No "a" tag this time
</p>
</div>
<div>
<p>
<ul>
<li>
<span>
<span style="display:none;">
Never Selected
</span>
</span>
</li>
<li>
<span>
Correct
and Wrongly Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<span>
Correct
</span>
</p>
<p>
...
Wrongly Selected extra text
</p>
</div>
<div>
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
</div>
</div>
</div>
</body>
</html>
I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box" (the ones with the the word "Correct" in them in the above example). I came up with the following XPath that gets these nodes and only these nodes in both Chrome's inspector and the Firepath add-on for Firefox (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")][1]
However, when I load this page using HttpWebRequest, load the response stream into an HtmlDocument object, and call SelectNodes(xpath) on its DocumentNode property using this XPath, it returns not only the three correct nodes, but also the two tags with the text "Wrongly Selected" in the example above. I noticed that this is effectively the same as if I were to use the XPath above, except without the last "[1]", like this (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")]
I have made sure that I am using the latest version of HtmlAgilityPack, attempted several variations on my XPath to determine if maybe it was hitting some arbitrary maximum length or other simple issues like that, and tried to research similar issues without success. I tried throwing together an even simpler HTML structure using the same basic concept to test, but couldn't reproduce the issue with that, so I suspect that it may be some subtle issue with how HtmlAgilityPack parses something in this structure.
If anyone knows what might cause this issue, or has a better way to write an XPath expression that will get the correct nodes and hopefully not cause issues in HtmlAgilityPack, I would be greatly appreciative.
EDIT
As suggested, here is a simplified version of the C# code I'm using, which I have confirmed does reproduce the problem for me.
using System;
using System.Net;
using HtmlAgilityPack;
...
static void Main(string[] args)
{
string url = "http://www.deerso.com/test.html";
string xpath = "//div[#id=\"main-box\"]/div/div[2]/div[contains(#class, \"row-box\")]/div[(position() = 3 or position() = 4) and descendant::a[contains(#href, \"a=\")]][1]/descendant::a[contains(#href, \"a=\")][1]";
int statusCode;
string htmlText;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Accept = "text/html,*/*";
request.Proxy = new WebProxy();
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0";
using (var response = (WebResponse)request.GetResponse())
{
statusCode = (int)((HttpWebResponse)response).StatusCode;
using (var stream = response.GetResponseStream())
{
if (stream != null)
{
using (var reader = new System.IO.StreamReader(stream))
{
htmlText = reader.ReadToEnd();
}
}
else
{
Console.WriteLine("Request to '{0}' failed, response stream was null", url);
htmlText = null;
return;
}
}
}
HtmlNode.ElementsFlags.Remove("form"); //fix for forms
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode node in nodes)
{
Console.WriteLine("Node Found:");
Console.WriteLine("Text: {0}", node.InnerText);
Console.WriteLine("Href: {0}", node.Attributes["href"].Value);
Console.WriteLine();
}
Console.WriteLine("Done!");
}

New answer based on updated Html
We can't use the //a[contains(#href,'a=')][1] filter since that is selecting the first <a> element from its direct parent.
We need to add brackets to include the descendant operator in the filter, i.e.
(//a[contains(#href,'a=')])[1]
However, if we expand that to apply the first descendant filter to each node in another nodeset, the resultant xpath expression is invalid:
//div[contains(#class,'row-box')](//a[contains(#href,'a=')])[1]
I think we need to break it into two steps:
Get the group of div elements containing the particular link we want.
Get the first descendant link element from each element in that group
In C# this looks like:
// Get the <div> elements we know are ancestors to the <a> elements we want
HtmlNodeCollection topDivs = doc.DocumentNode.SelectNodes("//a[contains(#href,'?a=')]/ancestor::div[contains(#class,'row-box')]");
// Create a new list to hold the <a> elements
List<HtmlNode> linksWeWant = new List<HtmlNode>(topDivs.Count)
// Iterate through the <div> elements and get the first descendant
foreach(var div in topDivs)
{
linksWeWant.Add(div.SelectSingleNode("(//a[contains(#href,'?a=')])[1]"));
}
Old Answer
Using this page as a guide I put together the xpath expression:
When I run it in HtmlAgilityPack I'm getting only these three elements returned:
<a href = "/test/path?a=123">
<a href = "/test/path?a=abc&b=123">
<a href = "/test/path?a=ghi">
Here's a breakdown of the expression:
//div[contains(#class,'row-box')] -> Get nodeset of <div class="*row-box*"> elements
/descendant::a -> From here get all descendant <a> elements
[contains(#href,'a=') and position()=1] -> Filter according to href value and element being the first descendant
I believe the key difference to the xpath in your question is /descendant::a[contains(#href,'a=') and position()=1] vs /descendant::a[contains(#href,'a=')][1]. Applying the [1] separately is filtering as the first child instead of the first descendant.

I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box"
I don't think such a query is possible in a single XPath expression. It would be quite easy in XQuery:
for $rowBox in //div[contains(#class, 'row-box')]
let $firstRelevant := ($rowBox/div[
(position() = 3 or position() = 4)
and .//a[contains(#href, 'a=')]
])[1]
return ($firstRelevant//a[contains(#href, 'a=')])[1]
But the amount of predicate grouping (i.e. (...)[...]) that is going on here exceeds the expressive capabilities of XPath.
Selecting the result in multiple steps in C# would be the way to go, in much the same way XQuery does it:
for each //div[contains(#class, 'row-box')]:
select ./div[(position() = 3 or position() = 4) and .//a[contains(#href, 'a=')]
for the first one:
select .//a[contains(#href, 'a=')]
take the first one

C# HtmlAgilityPack Xpath problems, trouble finding H4 innertext

I have a method that will find everything I am looking for in a section of a webpage, except I am getting stuck trying to find an H4 within nodes. The xpath for //div[#class='job '] correctly finds all 8 occurances that I am looking for. But after I try and traverse the 8 occurances I hit problems.
Here is the HTML output of the code I am looking inside.
<div class="job_art ">
<div style="background: #444 url('https://a.akamaihd.net/mwfb/mwfb/graphics/jobs/chicago/meet_with_the_south_gang_family_ 760x225_01.jpg') 50% 0 no-repeat;">
</div>
</div>
<div class="job_details clearfix">
<h4>Meet With the South Gang Family</h4>
<div class="mastery_bar" title="Indicates how much of this Job you've mastered. Master Jobs to earn Skill Points."><div style="width: 0%" class="noHighlight"></div><p>100% Mastered</p><div style="width: 0%"><p>100% Mastered</p></div></div><ul class="uses clearfix" style="width:100px;"><li class="energy" base_value="2" current_value="2" title="Spend 2 Energy to do this Job once.">2</li></ul><ul class="pays clearfix" style="width:120px" title="Earn XP, City Cash and Loot items while doing Jobs."><li class="experience" base_value="2" current_value="2">2</li><li class="cash_icon_jobs_8" base_value="2" current_value="2">2</li></ul><a id='btn_dojob_1' class='sexy_button_new sexy_energy_new medium orange impulse_buy' selector='#inner_page' requirements='{"energy":2}' precall='BrazilJobs.preDoJob' callback='BrazilJobs.doJob' href='remote/h.php?job=1&tab=1&clkdiv=btn_dojob_1'><span><span>Do Job</span></span></a></div><div class="job_additional_results"><div id="loot-bandit-1" class="lootContainer"></div><div class="previous_loot"></div></div><div id="bandit-contextual-1" class="contextual bandit-contextual"></div>
It always finds something else like "Clams(Bank)", which I have no idea how. The problem starts with
string MissionName = node.SelectSingleNode("//h4").InnerText;
I have tried numerous xpath, like //div[h4[1]], h4[1]. I only need the first occurence since it only occurs once. Where does the problem start in my code?
I need the inner text "Meet With the South Gang Family"
public static List<string> GetMissions()
{
List<string> FoundMissions = new List<string>();
HTML_CONTENT = HTML_CONTENT.Replace("\r", "");
HTML_CONTENT = HTML_CONTENT.Replace("\t", "");
HTML_CONTENT = HTML_CONTENT.Replace("\n", "");
HTML_CONTENT = HTML_CONTENT.Replace("\\", "");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(HTML_CONTENT));
if(doc.DocumentNode == null)
return FoundMissions;
var DivNodes = doc.DocumentNode.SelectNodes("//div[#class='job ']");
if (DivNodes != null)
{
string Count = DivNodes.Count.ToString();
Like I said, it finds all 8 occurances fine. I debugged and got the above HTML i put at the top of this, so I think this part is fine.
foreach (HtmlNode node in DivNodes)
{
string MissionName = node.SelectSingleNode("//h4").InnerText;
}
}
return FoundMissions;
}
}

You need to explicitly tell that the XPath query is relative to current node by adding single dot (.) at the beginning :
string MissionName = node.SelectSingleNode(".//h4").InnerText;
otherwise, the XPath will search from root node. That's likely what cause you got incorrect result with your attempt.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

navigate to section of XML with xpath - c#

Related

How to access nested div based on class name

How to get href elements and attributes for each node?

C# or VB .net HtmlAgility: how to get nested span from portion of html page

XPath giving different results in browser and HtmlAgilityPack

C# HtmlAgilityPack Xpath problems, trouble finding H4 innertext

Categories

Resources