html agility pack getting same output twice c# - c#

<div class="header">
<span id="content">test1</span>
</div>
<div class="header">
<span id="content">test2</span>
</div>
var web = new HtmlWeb();
var doc = web.Load(url)
var value = doc.DocumentNode.SelectNodes("//div[#class='header']")
foreach(var v in value)
{
var name = v.SelectSingleNode("//span[#id='content']")
Console.Writeline(name.OuterHtml);
}
the code above gives me as output twice <span id="content">test1</span>instead of <span id="content">test2</span> as second output. So it gets the correct number of nodes but not the correct output.

Using // and / in XPath will query the root node even you are using the current node.
Please see my fix in your code.
var value = doc.DocumentNode.SelectNodes("//div[#class='header']");
foreach (var v in value)
{
var name = v.SelectSingleNode("span[#id='content']");
Console.WriteLine(name.OuterHtml);
}
See this fiddle. https://dotnetfiddle.net/nih2lw
A side note, id attribute should always be unique in the document. Use class instead.

Related

HtmlAgilityPack issue

Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2
It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}

Wrong string found while parsing HTML

Here is my Regular Expression for getting version number from playstore HTML content:
var content = responseMsg.Content == null
? null
: await responseMsg.Content.ReadAsStringAsync();
var versionMatch = Regex.Match(
content,
"<div[^>]*>Current Version</div><span[^>]*><div><span[^>]*>(.*?)<").Groups[1];
if (versionMatch.Success)
{
version = versionMatch.Value.Trim();
}
Here I am getting this value Inside VersionMatch= "{}"
So how to get this proper version? like VersionMatch="1.9"
The html content is very large so I cut off from that html content :
<div class="hAyfc">
<div class="BgcNfc">Current Version</div>
<span class="htlgb">
<div class="IQ1z0d">
<span class="htlgb">1.9</span>
</div>
To skip over the intermediate text between Current Version</div> and the <span> where the version number is in, you can use a (non-greedy) .*?. The dot will also match \r\n, if RegexOptions.Singleline is given. To get the correct span, specify its content as "digits and dots" ([\d\.]+) instead of "anything" (.*?)
var content = #"<div class=""hAyfc"">
<div class=""BgcNfc"">Current Version</div>
<span class=""htlgb"">
<div class=""IQ1z0d"">
<span class=""htlgb"">1.9</span>
</div>";
var versionMatch = Regex.Match(
content,
#"<div[^>]*>Current Version</div>.*?<span[^>]*>([\d\.]+)<", RegexOptions.Singleline).Groups[1];
versionMatch.Value is then "1.9"
You could try using HtmlAgilityPack with Fizzler.Systems.HtmlAgilityPack so you can basically do something like this:
var web = new HtmlWeb();
var html = web.Load(uri);
var documentNode = html.DocumentNode;
var version = documentNode.QuerySelector(".htlgb").InnerHtml;
And you don't have to worry about the regex

Getting InnerText ignoring script node by using Html Agility Pack in C#

I have following page from which I want to get a list of proxy servers from a table:
http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any
Each row in the table is an ul element. My problem is when obtaining the first li element which associated class is "proxy" from the ul element. I want to obtain the IP and Port so I perform an InnerText but as li element has an script child node, it returns the text of the script node.
Below an image of the structure of the page:
I have tried below code using Html Agility Pack and LINQ:
WebClient webClient = new WebClient();
string page = webClient.DownloadString("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//div[#class='table']")
.Descendants("ul")
.Where(ul => ul.Elements("li").Count() > 1)
.Select(ul => ul.Elements("li").Select(li =>
{
string result = string.Empty;
if (li.HasClass("proxy"))
{
HtmlNode liTmp = li.Clone();
liTmp.RemoveAllChildren();
result = liTmp.InnerText.Trim();
}
else
{
result = li.InnerText.Trim();
}
return result;
}).ToList()).ToList();
I can obtain a list which each item is a list containing the fields (Proxy, País, Tipo, Velocidad, HTTPS/SSL) but field proxy is always empty. Also I am not getting at all the "País" and "Ciudad" columns.
That is because those values are injected into the DOM by JavaScript after page load. Actually the value inside the Proxy() is a Base64 representation of what you are looking for.
In the image you have posted above the value MTQ4LjI0My4zNy4xMDE6NTMyODE= decodes to 148.243.37.101:53281
The raw parsed string you are feeding to the Agility pack only contains the Proxy field...
<div class=\ "table-wrap\">\r\n
<div class=\ "table\">\r\n
<ul>\r\n
<li class=\ "proxy\">
<script type=\ "text/javascript\">
Proxy('MTM4Ljk3LjkyLjI0OTo1MzgxNg==')
</script>
</li>\r\n
<li class=\ "https\">HTTP</li>\r\n
<li class=\ "speed\">29.5kbit</li>\r\n
<li class=\ "type\">
<strong>Elite</strong>
</li>\r\n
<li class=\ "country-city\">\r\n
<div>\r\n
<span class=\ "country\" title=\ "Brazil\">
<span class=\ "country-code\">
<span class=\ "flag br\"></span>
<span class=\ "name\">BR Brasil</span>
</span>
</span>
<!--\r\n -->
<span class=\ "city\">
<span>Rondon</span>
</span>\r\n </div>\r\n </li>\r\n </ul>\r\n
<div class=\ "clear\"></div>\r\n
Using the following code:
HttpClient client = new HttpClient();
var docResult = client.GetStringAsync("http://proxy-list.org/spanish/search.php?search=&country=any&type=any&port=any&ssl=any").Result;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(docResult);
Regex reg = new Regex(#"Proxy\('(?<value>.*?)'\)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var stuff = doc.DocumentNode.SelectSingleNode("//div[#class='table']")
.Descendants("li")
.Where(x => x.HasClass("proxy"))
.Select(li =>
{
return li.InnerText;
}).ToList();
foreach (var item in stuff)
{
var match = reg.Match(item);
var proxy = Encoding.Default.GetString(System.Convert.FromBase64String(match.Groups["value"].Value));
Console.WriteLine($"{item}\t\tproxy = {proxy}");
}
I get:

HtmlAgilityPack filtering HTML based on a query

I have a block of two HTML elements which look like this:
<div class="a-row">
<a class="a-size-small a-link-normal a-text-normal" href="/Chemical-Guys-CWS-107-Extreme-Synthetic/dp/B003U4P3U0/ref=sr_1_1_sns?s=automotive&ie=UTF8&qid=1504525216&sr=1-1">
<span aria-label="$19.51" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">19</span>
<sup class="sx-price-fractional">51</sup>
</span>
</span>
<span class="a-letter-space"></span>Subscribe & Save
</a>
</div>
And next block of HTML:
<div class="a-row a-spacing-none">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B003U4P3U0" rel="nofollow noreferrer">
<span aria-label="$22.95" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">22</span>
<sup class="sx-price-fractional">95</sup>
</span>
</span>
</a>
<span class="a-letter-space"></span>
<i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime">
<span class="a-icon-alt">Prime</span>
</i>
</div>
Both of these elements are quite similar in their structure, but the trick is that I want to extract the value of element which next to it contains a span element with a class: aria-label="Prime"
This is how I currently extract the price but it's not good:
if (htmlDoc.DocumentNode.SelectNodes("//span[#class='a-color-base sx-zero-spacing']") != null)
{
var span = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']");
price = span.Attributes["aria-label"].Value;
}
This basically selects HTML element at position 0, since there are more than one element. But the trick here is that I would like to select that span element which contains the prime value , just like the 2nd piece of HTML I've shown...
In case the 2nd element with such values doesn't exists I would just simply use this first method I wrote up there...
Can someone help me out with this ? =)
I've also tried something like this:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']")
.Where(x => x.SelectSingleNode("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']") != null)
.Select(x => x.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']").Attributes["aria-label"].Value);
But it's still returning first element xD
New version guys:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']");
string prrrrrr = "";
for (int i = 0; i < pr.Count; i++)
{
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
{
prrrrrr = pr.ElementAt(i).SelectNodes("//span[#class='a-color-base sx-zero-spacing']").ElementAt(i).Attributes["aria-label"].Value;
}
}
So the idea is that I take out all "a" elements from the HTML file and create a HTML Node collection of a's, and then loop through them and see which one indeed contains the element that I'm looking for and then match it...?
The problem here is that this if statement always passes:
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
How can I loop through each individual element in node collection ?
I think you should start to look at div level with class a-row. Then loop and check if the div contains a i with class area-label equals to 'Prime'. And finally get the span with the a-color-base sx-zero-spacing class and the value of the attribute aria-label like this:
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//div[starts-with(#class,'a-row')]");
foreach (HtmlNode node in nodes)
{
HtmlNode i = node.SelectSingleNode("i[#aria-label='Prime']");
if (i != null)
{
HtmlNode span = node.SelectSingleNode(".//span[#class='a-color-base sx-zero-spacing']");
if (span != null)
{
string currentValue = span.Attributes["aria-label"].Value;
}
}
}

XPath giving different results in browser and HtmlAgilityPack

I am attempting to parse a section of a webpage using HtmlAgilityPack in a C# program. Below is a simplified version of this section of the page (edited 1/30/2015 2:40PM EST):
<html>
<body>
<div id="main-box">
<div>
<div>...</div>
<div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<a href="/some/other/path">
<img src="/path/to/img" />
</a>
</p>
<p>
...
Correct extra text
</p>
</div>
<div>
...
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
No "a" tag this time
</p>
</div>
<div>
<p>
<ul>
<li>
<span>
<span style="display:none;">
Never Selected
</span>
</span>
</li>
<li>
<span>
Correct
and Wrongly Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
<div class="other-classes row-box">
<div>...</div>
<div>...</div>
<div>
<p>
<span>
Correct
</span>
</p>
<p>
...
Wrongly Selected extra text
</p>
</div>
<div>
<p>
<ul>
...
<li>
<span>
Never Selected
and Never Selected.
</span>
</li>
</ul>
</p>
</div>
...
</div>
</div>
</div>
</div>
</body>
</html>
I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box" (the ones with the the word "Correct" in them in the above example). I came up with the following XPath that gets these nodes and only these nodes in both Chrome's inspector and the Firepath add-on for Firefox (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")][1]
However, when I load this page using HttpWebRequest, load the response stream into an HtmlDocument object, and call SelectNodes(xpath) on its DocumentNode property using this XPath, it returns not only the three correct nodes, but also the two tags with the text "Wrongly Selected" in the example above. I noticed that this is effectively the same as if I were to use the XPath above, except without the last "[1]", like this (wrapped for legibility):
//div[#id="main-box"]/div/div[2]/div[contains(#class, "row-box")]/div[
(position() = 3 or position() = 4) and descendant::a[
contains(#href, "a=")
]
][1]/descendant::a[contains(#href, "a=")]
I have made sure that I am using the latest version of HtmlAgilityPack, attempted several variations on my XPath to determine if maybe it was hitting some arbitrary maximum length or other simple issues like that, and tried to research similar issues without success. I tried throwing together an even simpler HTML structure using the same basic concept to test, but couldn't reproduce the issue with that, so I suspect that it may be some subtle issue with how HtmlAgilityPack parses something in this structure.
If anyone knows what might cause this issue, or has a better way to write an XPath expression that will get the correct nodes and hopefully not cause issues in HtmlAgilityPack, I would be greatly appreciative.
EDIT
As suggested, here is a simplified version of the C# code I'm using, which I have confirmed does reproduce the problem for me.
using System;
using System.Net;
using HtmlAgilityPack;
...
static void Main(string[] args)
{
string url = "http://www.deerso.com/test.html";
string xpath = "//div[#id=\"main-box\"]/div/div[2]/div[contains(#class, \"row-box\")]/div[(position() = 3 or position() = 4) and descendant::a[contains(#href, \"a=\")]][1]/descendant::a[contains(#href, \"a=\")][1]";
int statusCode;
string htmlText;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Accept = "text/html,*/*";
request.Proxy = new WebProxy();
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0";
using (var response = (WebResponse)request.GetResponse())
{
statusCode = (int)((HttpWebResponse)response).StatusCode;
using (var stream = response.GetResponseStream())
{
if (stream != null)
{
using (var reader = new System.IO.StreamReader(stream))
{
htmlText = reader.ReadToEnd();
}
}
else
{
Console.WriteLine("Request to '{0}' failed, response stream was null", url);
htmlText = null;
return;
}
}
}
HtmlNode.ElementsFlags.Remove("form"); //fix for forms
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode node in nodes)
{
Console.WriteLine("Node Found:");
Console.WriteLine("Text: {0}", node.InnerText);
Console.WriteLine("Href: {0}", node.Attributes["href"].Value);
Console.WriteLine();
}
Console.WriteLine("Done!");
}
New answer based on updated Html
We can't use the //a[contains(#href,'a=')][1] filter since that is selecting the first <a> element from its direct parent.
We need to add brackets to include the descendant operator in the filter, i.e.
(//a[contains(#href,'a=')])[1]
However, if we expand that to apply the first descendant filter to each node in another nodeset, the resultant xpath expression is invalid:
//div[contains(#class,'row-box')](//a[contains(#href,'a=')])[1]
I think we need to break it into two steps:
Get the group of div elements containing the particular link we want.
Get the first descendant link element from each element in that group
In C# this looks like:
// Get the <div> elements we know are ancestors to the <a> elements we want
HtmlNodeCollection topDivs = doc.DocumentNode.SelectNodes("//a[contains(#href,'?a=')]/ancestor::div[contains(#class,'row-box')]");
// Create a new list to hold the <a> elements
List<HtmlNode> linksWeWant = new List<HtmlNode>(topDivs.Count)
// Iterate through the <div> elements and get the first descendant
foreach(var div in topDivs)
{
linksWeWant.Add(div.SelectSingleNode("(//a[contains(#href,'?a=')])[1]"));
}
Old Answer
Using this page as a guide I put together the xpath expression:
When I run it in HtmlAgilityPack I'm getting only these three elements returned:
<a href = "/test/path?a=123">
<a href = "/test/path?a=abc&b=123">
<a href = "/test/path?a=ghi">
Here's a breakdown of the expression:
//div[contains(#class,'row-box')] -> Get nodeset of <div class="*row-box*"> elements
/descendant::a -> From here get all descendant <a> elements
[contains(#href,'a=') and position()=1] -> Filter according to href value and element being the first descendant
I believe the key difference to the xpath in your question is /descendant::a[contains(#href,'a=') and position()=1] vs /descendant::a[contains(#href,'a=')][1]. Applying the [1] separately is filtering as the first child instead of the first descendant.
I am attempting to get the first and only the first "a" tag with the GET parameter "a" in the 3rd or 4th child div of each div with the class "row-box"
I don't think such a query is possible in a single XPath expression. It would be quite easy in XQuery:
for $rowBox in //div[contains(#class, 'row-box')]
let $firstRelevant := ($rowBox/div[
(position() = 3 or position() = 4)
and .//a[contains(#href, 'a=')]
])[1]
return ($firstRelevant//a[contains(#href, 'a=')])[1]
But the amount of predicate grouping (i.e. (...)[...]) that is going on here exceeds the expressive capabilities of XPath.
Selecting the result in multiple steps in C# would be the way to go, in much the same way XQuery does it:
for each //div[contains(#class, 'row-box')]:
select ./div[(position() = 3 or position() = 4) and .//a[contains(#href, 'a=')]
for the first one:
select .//a[contains(#href, 'a=')]
take the first one

Categories