xpath and htmlagility pack

xpath and htmlagility pack - c#

I figured it out! I will leave this posted just in case some other newbie like myself has the same question.
Answer: **("./td[2]/span[#class='smallfont']")***
I am a novice at xpath and html agility. I am so close yet so far.
GOAL: to pull out 4:30am
by using the following with htmlagility pack:
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#id='weekdays']/tr[2]")){
string time = table.SelectSingleNode("./td[2]").InnerText;
I get it down to "\r\n\t\t\r\n\t\t\t4:30am\r\n\t\t\r\n\t" when I try doing anything with the span I get xpath exceptions. What must I add to the ("./td[2]") to just end up with the 4:30am?
HTML
<td class="alt1 espace" nowrap="nowrap" style="text-align: center;">
<span class="smallfont">4:30am</span>
</td>

I don't know if Linq is an option, but you could have also done something like this:
var time = string.Empty;
var html =
"<td class=\"alt1 espace\" nowrap=\"nowrap\" style=\"text-align: center;\"><span class=\"smallfont\">4:30am</span></td>";
var document = new HtmlDocument() { OptionWriteEmptyNodes = true, OptionOutputAsXml = true };
document.LoadHtml(html);
var timeSpan =
document.DocumentNode.Descendants("span").Where(
n => n.Attributes["class"] != null && n.Attributes["class"].Value == "smallfont").FirstOrDefault();
if (timeSpan != null)
time = timeSpan.InnerHtml;

Related

HtmlAgilityPack filtering HTML based on a query

I have a block of two HTML elements which look like this:
<div class="a-row">
<a class="a-size-small a-link-normal a-text-normal" href="/Chemical-Guys-CWS-107-Extreme-Synthetic/dp/B003U4P3U0/ref=sr_1_1_sns?s=automotive&ie=UTF8&qid=1504525216&sr=1-1">
<span aria-label="$19.51" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">19</span>
<sup class="sx-price-fractional">51</sup>
</span>
</span>
<span class="a-letter-space"></span>Subscribe & Save
</a>
</div>
And next block of HTML:
<div class="a-row a-spacing-none">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B003U4P3U0" rel="nofollow noreferrer">
<span aria-label="$22.95" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">22</span>
<sup class="sx-price-fractional">95</sup>
</span>
</span>
</a>
<span class="a-letter-space"></span>
<i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime">
<span class="a-icon-alt">Prime</span>
</i>
</div>
Both of these elements are quite similar in their structure, but the trick is that I want to extract the value of element which next to it contains a span element with a class: aria-label="Prime"
This is how I currently extract the price but it's not good:
if (htmlDoc.DocumentNode.SelectNodes("//span[#class='a-color-base sx-zero-spacing']") != null)
{
var span = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']");
price = span.Attributes["aria-label"].Value;
}
This basically selects HTML element at position 0, since there are more than one element. But the trick here is that I would like to select that span element which contains the prime value , just like the 2nd piece of HTML I've shown...
In case the 2nd element with such values doesn't exists I would just simply use this first method I wrote up there...
Can someone help me out with this ? =)
I've also tried something like this:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']")
.Where(x => x.SelectSingleNode("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']") != null)
.Select(x => x.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']").Attributes["aria-label"].Value);
But it's still returning first element xD
New version guys:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']");
string prrrrrr = "";
for (int i = 0; i < pr.Count; i++)
{
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
{
prrrrrr = pr.ElementAt(i).SelectNodes("//span[#class='a-color-base sx-zero-spacing']").ElementAt(i).Attributes["aria-label"].Value;
}
}
So the idea is that I take out all "a" elements from the HTML file and create a HTML Node collection of a's, and then loop through them and see which one indeed contains the element that I'm looking for and then match it...?
The problem here is that this if statement always passes:
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
How can I loop through each individual element in node collection ?

I think you should start to look at div level with class a-row. Then loop and check if the div contains a i with class area-label equals to 'Prime'. And finally get the span with the a-color-base sx-zero-spacing class and the value of the attribute aria-label like this:
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//div[starts-with(#class,'a-row')]");
foreach (HtmlNode node in nodes)
{
HtmlNode i = node.SelectSingleNode("i[#aria-label='Prime']");
if (i != null)
{
HtmlNode span = node.SelectSingleNode(".//span[#class='a-color-base sx-zero-spacing']");
if (span != null)
{
string currentValue = span.Attributes["aria-label"].Value;
}
}
}

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210

I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!

var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}

It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

How to Get element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span class="DateHover">01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span class="DateHover" >10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack
private static string GetDataByIClass(string HtmlIn, string ClassToGet)
{
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlIn);
HtmlAgilityPack.HtmlNode InputNode = DocToParse.GetElementbyId(ClassToGet);//here is the problem i dont have method DocToParse.GetElementbyClass
if (InputNode != null)
{
if (InputNode.Attributes["value"].Value != null)
{
return InputNode.Attributes["value"].Value;
}
}
return null;
}
Sow i need to read this data to get the date 01.03.14 and 10.02.14 for be able to save this to array list (and then to xml file)
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?

Html Agility Pack has XPATH support, so you can do something like this:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}
This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.

Match and replace string in text using regular expressions

I have a large string and it might have the following:
<div id="Specs" class="plinks">
<div id="Specs" class="plinks2">
<div id="Specs" class="sdfsf">
<div id="Specs" class="ANY-OTHER_NAME">
How can I replace values in the string from anything above to:
<div id="Specs" class="">
this is what I came up with, but it does not work:
string source = "bunch of text";
string regex = "<div id=\"Specs\" class=[\"']([^\"']*)[\"']>";
string regexReplaceTo = "<div id=\"Specs\" class=\"\">";
string output = Regex.Replace(source, regex, regexReplaceTo);

What about...
Regex to match : class=\"[A-Za-z0-9_\-]+\"
Replace with : class=\"\"
This way, we ignore the first part (id="Specs", etc) and
just replace the class name... with nothing.

Looks like another case of http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html. What happens to the following valid tags with a Regex?
<div class="reversed" id="Specs">
<div id="Specs" class="additionalSpaces" >
<div id="Specs" class="additionalAttributes" style="" >
I don't see a how using Linq2Xml wouldn't work with any combination:
XElement root = XElement.Parse(xml); // XDocument.Load(xmlFile).Root
var specsDivs = root.Descendants()
.Where(e => e.Name == "div"
&& e.Attributes.Any(a => a.Name == "id")
&& e.Attributes.First(a => a.Name == "id").Value == "Specs"
&& e.Attributes.Any(a => a.Name == "class"));
foreach(var div in specsDivs)
{
div.Attributes.First(a => a.Name == "class").value = string.Empty;
}
string newXml = root.ToString()

If your input isn't XML compliant, which most HTML isn't, then you can use the HTML Agility Pack to parse the HTML and manipulate the contents. With the HTML Agility PAck, combined with Linq or Xpath, the order of your attributes no longer matters (which it does when you use Regex) and the overall stability of your solution increases a lot.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes("div").Where(div => div.Id == "Specs");
foreach (var node in nodes)
{
var classAttribute = node.Attributes["class"];
if (classAttribute != null)
{
classAttribute.Value = string.Empty;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

xpath and htmlagility pack - c#

Related

HtmlAgilityPack filtering HTML based on a query

Find specific link in html doc c# using HTML Agility Pack

How to parse this HTML text using htmlagilitypack?

How to Get element by class in HtmlAgilityPack

Match and replace string in text using regular expressions

Categories

Resources