C# HtmlAgilityPack Xpath problems, trouble finding H4 innertext

C# HtmlAgilityPack Xpath problems, trouble finding H4 innertext - c#

I have a method that will find everything I am looking for in a section of a webpage, except I am getting stuck trying to find an H4 within nodes. The xpath for //div[#class='job '] correctly finds all 8 occurances that I am looking for. But after I try and traverse the 8 occurances I hit problems.
Here is the HTML output of the code I am looking inside.
<div class="job_art ">
<div style="background: #444 url('https://a.akamaihd.net/mwfb/mwfb/graphics/jobs/chicago/meet_with_the_south_gang_family_ 760x225_01.jpg') 50% 0 no-repeat;">
</div>
</div>
<div class="job_details clearfix">
<h4>Meet With the South Gang Family</h4>
<div class="mastery_bar" title="Indicates how much of this Job you've mastered. Master Jobs to earn Skill Points."><div style="width: 0%" class="noHighlight"></div><p>100% Mastered</p><div style="width: 0%"><p>100% Mastered</p></div></div><ul class="uses clearfix" style="width:100px;"><li class="energy" base_value="2" current_value="2" title="Spend 2 Energy to do this Job once.">2</li></ul><ul class="pays clearfix" style="width:120px" title="Earn XP, City Cash and Loot items while doing Jobs."><li class="experience" base_value="2" current_value="2">2</li><li class="cash_icon_jobs_8" base_value="2" current_value="2">2</li></ul><a id='btn_dojob_1' class='sexy_button_new sexy_energy_new medium orange impulse_buy' selector='#inner_page' requirements='{"energy":2}' precall='BrazilJobs.preDoJob' callback='BrazilJobs.doJob' href='remote/h.php?job=1&tab=1&clkdiv=btn_dojob_1'><span><span>Do Job</span></span></a></div><div class="job_additional_results"><div id="loot-bandit-1" class="lootContainer"></div><div class="previous_loot"></div></div><div id="bandit-contextual-1" class="contextual bandit-contextual"></div>
It always finds something else like "Clams(Bank)", which I have no idea how. The problem starts with
string MissionName = node.SelectSingleNode("//h4").InnerText;
I have tried numerous xpath, like //div[h4[1]], h4[1]. I only need the first occurence since it only occurs once. Where does the problem start in my code?
I need the inner text "Meet With the South Gang Family"
public static List<string> GetMissions()
{
List<string> FoundMissions = new List<string>();
HTML_CONTENT = HTML_CONTENT.Replace("\r", "");
HTML_CONTENT = HTML_CONTENT.Replace("\t", "");
HTML_CONTENT = HTML_CONTENT.Replace("\n", "");
HTML_CONTENT = HTML_CONTENT.Replace("\\", "");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(HTML_CONTENT));
if(doc.DocumentNode == null)
return FoundMissions;
var DivNodes = doc.DocumentNode.SelectNodes("//div[#class='job ']");
if (DivNodes != null)
{
string Count = DivNodes.Count.ToString();
Like I said, it finds all 8 occurances fine. I debugged and got the above HTML i put at the top of this, so I think this part is fine.
foreach (HtmlNode node in DivNodes)
{
string MissionName = node.SelectSingleNode("//h4").InnerText;
}
}
return FoundMissions;
}
}

You need to explicitly tell that the XPath query is relative to current node by adding single dot (.) at the beginning :
string MissionName = node.SelectSingleNode(".//h4").InnerText;
otherwise, the XPath will search from root node. That's likely what cause you got incorrect result with your attempt.

Related

How to get href elements and attributes for each node?

I am working on a project that should read html, and find find all nodes that match a value, then find elements and attributes of the located nodes.
I am having difficulty figuring out how to get the href attributes and elements though.
I am using HTMLAgilityPack.
I have numerous nodes of
class="middle"
throughout the html. I need to get all of them, and from them, get the href element and attributes. Below is a sample of the html:
<div class="top">
<div class="left">
<a href="item123">
<img src="url.png" border="0" />
</a>
</div>
</div>
<div class="middle">
<div class="title">Captains Hat</div>
<div class="day">monday</div>
<div class="city">Tuscon, AZ | 100 Days | <script typs="text/javascript">document.write(ts_to_age_min(1445620427));</script></div>
</div>
I have been able to get the other attributes I need, but not for 'href'.
Here is the code I have:
List<string> listResults = new List<string>();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(url);
//get each listing
foreach (HtmlNode node in doc.DocumentNode.Descendants("div").Where(d =>
d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("middle")))
{
string day = node.SelectSingleNode(".//*[contains(#class,'day')]").InnerHtml;
string city = node.SelectSingleNode(".//*[contains(#class,'city')]").InnerHtml;
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
listResults.Add(day + EnvironmentNewline
+ city + EnvironmentNewline
+ item + EnvironmentNewline + EnvironmentNewline)
}
My code above though is giving me the first href value for the whole html page though, and is giving it for each node for some reason (visible by outputting the list to a messagebox). I thought being in my foreach loop that using SelectSingleNode should get the first href attribute for that specific node. If so, why am I getting the first href attribute for the whole html page loaded?
I've been going through lots of threads on here about getting href values with HTLMAgilityPack, but I haven't been able to get this to work.
How can I get the href attribute and elements for each node I'm selecting based off the class attribute (class="middle")?

Try replacing
string item = node.SelectSingleNode("//a").Attributes["href"].Value;
with
string item = node.SelectSingleNode(".//a").Attributes["href"].Value;
Other than that, code above works for me.
Alternatively:
string item = node.SelectSingleNode(".//*[contains(#class,'title')]")
.Descendants("a").FirstOrDefault().Attributes["href"].Value;

Need an XPath expressions to locate based on a sibling

I've got this code repeated in a div tag and want to write an XPath expression to find the dsd link so that I can click on it, based on the text in the h4 tag. Changing the HTML isn't an option.
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>
At the moment, I'm trying something like, where name is the name of the h4 tag;
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("../div/a[contains(#href,'dsd')]"));
submitButton.Click();
But I'm unable to get this to work. Any suggestions would be gratefully received.

I do not see an issue with your xpaths. The HTML you supplied is invalid due to your placeholders, but your xpaths appear to work with this:
void Main()
{
var xml = #"
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>";
var xmldoc = new XmlDocument();
xmldoc.LoadXml(xml);
var node = xmldoc.DocumentElement.SelectSingleNode("//div/h4[contains(text(),'Test Block')]");
node = node.SelectSingleNode("../div/a[contains(#href,'dsd')]");
Console.WriteLine(node.InnerText);
}

I don't have a working machine so I can't test this, but you said any feedback would be well received, so, I'm pretty sure using XPath you can grab individual elements from a child. If you know for sure that this HTML will always be the same, you could do:
../div[0] //(First element of the child)

You could use //div[h4[contains(., 'Test Block')]]//a[contains(#href, 'dsd')]. Also something like //div[h4[contains(., 'Test Block')]]//a[contains(., 'Option 1')] should work.

why don't you use the following-sibling
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("following-sibling::div/a[contains(#href,'dsd')]"));

Xpath:How to get data from div tag

<div id="caption">
<div>
Position: Passenger Side Front
<br></br>
Color: Black
<br></br>
Finish: Smooth / Paintable
<br></br>
Part Brand: LatchWell
<br></br>
Lifetime Warranty
</div>
I need xpath that should fetch Part Brand : values.My desired OP is
LatchWell
Here is my code :
tag = htmlDoc.DocumentNode.SelectSingleNode("//div[#id='caption']//div");
if (tag != null)
{
wi.Brand = tag.InnerText.Trim();
}
I am not able to split by using split functions because the data above and below Part Brand are dynamic.

Since you have an HTML markup that isn't selectable with HtmlAgilityPack except for the two <div> tags, you'll have to use some kind of other method such as Regex evaluation.
Assuming that the Part Brand: something <br><br> always exists in your code, you could select the text between Part Brand: and <br> and get the brand name.
HtmlNode brandNode = doc.DocumentNode.SelectSingleNode("//div[#id='caption']//div");
string brand = Regex.Match(brandNode.InnerHtml, "Part Brand: (.*?)<br>").Groups[1].Value;
Console.WriteLine(brand);
This simple use of Regex.Match(string, regexp) will output Latchwell.

Actually, you can select that particular HTML line using XPath, for example :
var tag = htmlDoc.DocumentNode
.SelectSingleNode("//div[#id='caption']/div/text()[contains(.,'Part Brand:')]");
//given html input as posted in this question, following will print : "LatchWell"
Console.WriteLine(tag.InnerText.Trim().Replace("Part Brand: ", ""));

Removing DIV from a text file if it contains a certain classname

I am currently working with an XML document which has RSS feeds inside. And I wanted to parse it so that if a div tag with a class name "feedflare" is found, the code would remove the whole DIV.
I could not find an example of doing this as the search for it is polluted with "HTML editor errors" and other irrelevant data.
Would anyone here be kind enough to share methods in reaching my goal?
I must state that I DO NOT want to use HtmlAgilityPack if I can avoid it.
This is my process:
Load XML, parse through elements and pick out, Title, Description, Link.
Then save all this as HTML (with tags being added programatically to build a web page) and then when all of the tags are added, I want to parse the resulting "HTML text" and remove the annoying DIV tag.
Let's assume "string HTML = textBox1.text" where textBox1 is where the resulting HTML is pasted, after parsing the main XML document.
How would I then loop through the contents of textBox1.text and remove ONLY the div tag called "feedflare" (see below).
<div class="feedflare">
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:yIl2AUoC8zA">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=yIl2AUoC8zA" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:H0mrP-F8Qgo">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=H0mrP-F8Qgo" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU" border="0"></img></a>
</div>
Thank you in advance.

Using this xml library, do:
XElement root = XElement.Load(file); // or .Parse(string);
XElement div = root.XPathElement("//div[#class={0}]", "feedflare");
div.Remove();
root.Save(file); // or string = root.ToString();

try this
System.Xml.XmlDocument d = new System.Xml.XmlDocument();
d.LoadXml(Your_XML_as_String);
foreach(System.Xml.XmlNode n in d.GetElementsByTagName("div"))
d.RemoveChild(n);
and use d.OuterXml to retrieve the new xml.

My solution in Javascript is:
function unrichText(texto) {
var n = texto.indexOf("\">"); //Finding end of "<div class="ExternalClass...">
var sub = texto.substring(0, n+2); //Adding first char and last two (">)
var tmp = texto.replace(sub, ""); //Removing it
tmp = replaceAll(tmp, "</div>", ""); //Removing last "div"
tmp = replaceAll(tmp, "<p>", ""); //Removing other stuff
tmp = replaceAll(tmp, "</p>", "");
tmp = replaceAll(tmp, " ", "");
return (tmp);
}
function replaceAll(str, find, replace) {
return str.replace(new RegExp(find, 'g'), replace);
}

navigate to section of XML with xpath

i am not able to see where i am going wrong with my xpath logic.
here is a section of a larger xml that i am working on transversing. (note im using the Html Agility Pack)
<div>
<div></div>
<span class="pp-headline-item pp-headline-phone">
<span class="telephone" dir="ltr">
<nobr>(732) 562-1312</nobr>
<span class="pp-headline-phone-label" style="display:none">()</span>
</span>‎
</span>
<span> · </span>
<span class="pp-headline-item pp-headline-authority-page">
<span>
<a href="http://maps.google.com/local_url?q=http://www.fed.com/q=07746+pizza">
<span>fed.com</span>
</a>
</span>
</span>
</div>
my goal is to extract various data points from these chunks of xml that i get out of the master XML file by using a
.SelectNodes("//div/span['pp-headline-item pp-headline-phone']/../..")
with this i am expecting to get all the sections outlined above so i can iterate them and extract things like website, phone, address...
problem is when i iterate this nodeset i cant get to the data points i want as if the node set is not the one outlined on top.
my logic is to extract a nodeset from the top most div into the nodset and when iterating them to xpath into the data points i want.
i do it like this:
foreach (HtmlNode n in BuizRowsgoogMaps)
{
//get phone number
if (n.SelectSingleNode("span/nobr").InnerHtml != null)
{
strPhone = n.SelectSingleNode("span/nobr").InnerHtml;
//get phone site
strSite = n.SelectSingleNode("//span['pp-headline-item pp-headline-authority-page']/span/a/span").InnerHtml;
}
}
i suspect my xpaths dont mesh together to get what i want but when i validate my expression i get the desired results... i used this to validate my thinking and it works leaving me at wits end:
//div/span['pp-headline-item pp-headline-phone']/../../span['pp-headline-item pp-headline-phone']/span/nobr

Your code is almost right, you just need to modify your xpath a bit.
foreach (HtmlNode n in BuizRowsgoogMaps)
{
//get phone number
if (n.SelectSingleNode(".//span/nobr").InnerHtml != null)
{
strPhone = n.SelectSingleNode(".//span/nobr").InnerHtml;
//get phone site
strSite = n.SelectSingleNode(".//span['pp-headline-item pp-headline-authority-page']/span/a/span").InnerHtml;
}
}
The .// tells xpath to match from the current node and not from the root.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# HtmlAgilityPack Xpath problems, trouble finding H4 innertext - c#

Related

How to get href elements and attributes for each node?

Need an XPath expressions to locate based on a sibling

Xpath:How to get data from div tag

Removing DIV from a text file if it contains a certain classname

navigate to section of XML with xpath

Categories

Resources