Max Value with Substring with HTML Agility Pack - c#

I can't seem to get this xpath query to work with the HTMLAgilityPack with this code and I was wondering if anyone had any suggestions.
This is the query I have so far, but I can't seem to get it to return a number.
DocumentNode.GetAttributeValue("max(a[(#class='shackmsg')]/#href/substring-after(.,?id='))", "");
I'm trying to get the MAX value in the href attribute after the = sign on all hrefs with a class of shackmsg.
How long is the beta live before it goes retail? No one knows. We do know t</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218936" class="">
<div class="oneline oneline3 op olmod_ontopic olauthor_189801">
<a class="shackmsg" rel="nofollow" href="?id=31218936" onclick="return clickItem( 31218933, 31218936);"><span class="oneline_body"><b><u><span class="jt_yellow">Current Multiplayer Servers</span>!</u></b>
<span class="jt_sample"><span class="jt_green">Nighteyes's Japan Server: </span> <span class="jt_lime">(PvE)</span>: <b>211.15.2.34</b></span>
<span class="jt_sample"><span class="jt_green">zolointo's Canada Server: </span> <span class="jt_lime">(</span></span></span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218938" class="last">
<div class="oneline oneline2 op olmod_ontopic olauthor_189801">
<div class="treecollapse">
<a class="open" rel="nofollow" href="#" onclick="toggle_collapse(31218938); return false;" title="Toggle">toggle</a>
</div>
<a class="shackmsg" rel="nofollow" href="?id=31218938" onclick="return clickItem( 31218933, 31218938);"><span class="oneline_body">Had fun freezing my ass off last night with a bunch of shackers. Not sure who started the big tower we f...</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
<ul>
<li id="item_31218966" class="">
<div class="oneline oneline1 olmod_ontopic olauthor_128401">
<a class="shackmsg" rel="nofollow" href="?id=31218966" onclick="return clickItem( 31218933, 31218966);"><span class="oneline_body">wasn't me. I hung out on my ship for a bit listening to your kid play Christmas songs for a bit and then ...</span> : </a><span class="oneline_user ">jonin</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/jonin/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
<li id="item_31219008" class="last">
<div class="oneline oneline0 olmod_ontopic olauthor_8618">
<a class="shackmsg" rel="nofollow" href="?id=31219008" onclick="return clickItem( 31218933, 31219008);"><span class="oneline_body">haha i heard you guys booby trapped some poor sap's space ship</span> : </a><span class="oneline_user ">Break</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/Break/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
</ul>
Any suggestions?

There are two problems as far as I can see:
You're only scanning for anchor tags in the current context. You probably want to extend to scan everywhere (use // in the beginning of your query):
//a[#class='shackmsg']/#href/substring-after(., '?id=')
Note that I removed a pair of unnecessary parenthesis.
If I'm not completely mistaken, HTML Agility Pack only supports XPath 1.0 (yet I'm not totally sure). While System.Xml.XPath says it implements the XPath 2.0 data model, it does not actually implement XPath 2.0 (probably this is done so third party APIs can implement this API and offer XPath 2.0/XQuery support at the same time). Also have a look at this discussion on .NET's XPath 2.0 support.
Missing XPath 2.0 support would show up as two problems:
Function substring-after(...) does not exist.
A solution for your problem could be to use string-lenght($string) and substring($string, $start, $length) to extract the last n digits, or translate(...) to remove some characters:
translate('?id=31219008', '?id=', '')
will remove all occurences in the character class [?id=] (yet it is none, I just want to highlight it does not match strings, but individual characters of this set!).
You cannot apply functions in axis steps. This means, you cannot find the maximum value of substrings.
Possible solution: Only fetch all substrings and find the maximum from outside XPath.

You can combine XPath with HTML Agility Pack and make the following code :
var value = doc.DocumentNode.SelectNodes("//a[#class='shackmsg']").Select(
x => x.Attributes["href"].Value.Substring(4)).Max();
Console.WriteLine(value);
And this output :
31219008
In this code I assume to always exist the href attribute and always have the following structure :
"?id=XXXX"

Related

Scraping from a div

I am experimenting with web scraping and I am having trouble scraping a particular value out of some nested div classes. I am using the .NET HtmlAgilityPack class library in a .NET Framework C# Console App. Here is the div code:
<div class="ds-nearby-schools-list">
<div class="ds-school-row">
<div class="ds-school-rating">
<div class="ds-gs-rating-8">
<span class="ds-hero-headline ds-schools-display-rating">8</span>
<span class="ds-rating-denominator ds-legal">/10</span>
</div>
</div>
<div class="ds-nearby-schools-info-section">
<a class="ds-school-name ds-standard-label notranslate" href="https://www.greatschools.org/school?id=00870&state=MD" rel="nofollow noopener noreferrer" target="_blank">Candlewood Elementary School</a>
<ul class="ds-school-info-section">
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Grades:</span>
<span class="ds-school-value ds-body-small">K-5</span>
</li>
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Distance:</span>
<span class="ds-school-value ds-body-small">0.8 mi</span>
</li>
</ul>
</div>
</div>
</div>
I want to scrape the "8" from the ds-hero-headline ds-schools-display-rating class. I am having trouble formulating the selector for the SelectNodes method on the DocumentNode object of the HtmlNode.HtmlDocument class.
I guess you might be having a trouble to write XPath to select the node. Try //*[contains(#class, 'ds-hero-headline') and contains(#class, 'ds-schools-display-rating')] with SelectNodes method.
However, this XPath could have a problem if the page your targeting would also have class name like ds-hero-headline-content, which ds-hero-headline can partially match. In that case, see the solution in How can I find an element by CSS class with XPath?
I would use this to extract 0.8 mi
//div[#class='ds-nearby-schools-list']/div[#class='ds-school-row']/div[#class='ds-nearby-schools-info-section']/ul[#class='ds-school-info-section']/li[#class='ds-school-info']/span[#class='ds-school-value ds-body-small' and preceding-sibling::span[#class='ds-school-key ds-body-small' and text()='Distance:']]/text()
Then this regex to group data:
^[0-9\.]+ (.*)$
At the end you can use some kind of conversion to save distance to an object.
Have you tried the following to get the 8. You can search for a specific span element with the class name to get the inner text.
Note: I used text file to load the html from your question.
string htmlFile = File.ReadAllText(#"TempFile.html");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlFile);
HtmlNode htmlDoc = doc.DocumentNode;
HtmlNode node = htmlDoc.SelectSingleNode("//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(node.InnerText);
// output: 8
Alternate:
Another way is to specify the path that you want the value from, starting from the div element.
HtmlNode node2 = htmlDoc.SelectSingleNode("//div[#class='ds-gs-rating-8']//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(subNode.InnerText);
output
8

Xpath grabbing separate text in between link nodes

I'm currently retrieving text from inside <a> tags utilizing HtmlAgilityPack:
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
lblTest1.Text = lblTest1.Text + ", " + node.InnerText.ToString();
}
and the web code looks like this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
Currently it is returning to me: Battery (1), Brakes (2), Cables/Lines (1) which is obviously all of the inner text. What I would like to know is how to split the two bits apart so I can store them each in a list for later usage. Something along the lines of: Battery, 1, Brakes, 2, Cables/Lines, 1 so as they are returned to me I can just toss them into lists.
The text in between the <em> tags are the number of results on the page that the <a> is taking you to. I could just parse the entire string after getting the line of text, but I feel as if there is a method to do this automatically with XPath and return one piece at a time for me to handle and store. I am very new to XPath and have been attempting to solve this for multiple days myself with no avail. Any help would be greatly appreciated.
Change your XPath expression to //div[#class='acTrigger']/a//text()[normalize-space()] separate text nodes.

Separated elements overlooked, now combined

Currently in the process of learning MVC and I think it's interfering with HTML code. I have just a basic navigational menu as a list and two <li> items seem to combine into one. Any way to make sure the two are separated when live?
#if ((Request.Url.AbsolutePath.ToString().ToLower() != "/home/index") && (Request.Url.AbsolutePath.ToString() != "/"))
{
<nav data-spy="affix" data-offset-top="500" style="border-radius:0px; left: 0" ng-hide="sideBar" id="nav">
<img src="~/Content/images/open.png" ng-model="sideBar" id="sideBarOpen" style="left:0px; top:0;"/>
<div id="sideBar" style="left: -200px">
<ul>
<li> Home </li>
<li><br /></li>
<li>About Me</li>
#############
<li>Experience</li>
<li>Resume</li>
############ These two seem to be recognized as 1 <li> and not two.
<li>Contact</li>
</ul>
<img src="/Content/images/myPic.jpg" />
</div>
</nav>
<div id="sideBarBack" style="width:0%;">
</div>
}
Youre missing a quotation mark in the id="> part. This is not a valid html so your wen browser tries to workaround that resulting in the two elements combined.
To fix that instead of:
<li>Experience</li>
use a correct tag attribute id="":
<li>Experience</li>

Extract multiple tags from within a tag

thanks in advance for any help you can provide. I'm trying to scrape some HTML with HtmlAgilityPack and am having trouble with the XPATH syntax. The HTML I'm dealing with has multiple tags I'd like to access all within a < p >.
<p class="row" data-pid="5687754180">
<a href="/bod/5687754180.html" class="i gallery" data-ids="1:00c0c_fapkFsQg3Dx">
<span class="price">$5000</span>
</a>
<span class="txt">
<span class="pl">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">
<? __("favorite this post") ?>
</span>
</span>
<time datetime="2016-07-17 19:36" title="Sun 17 Jul 07:36:03 PM">Jul 17</time> <a href="/bod/5687754180.html" data-id="5687754180" class="hdrlnk">
<span id="titletextonly">☇☇♔♔♔♔♔1998 Mastercraft Prostar&#12963</span>
</a>
</span>
<span class="l2">
<span class="price">$5000</span>
<span class="pnr">
<span class="px">
<span class="p"> pic</span>
</span>
</span>
</span>
<span class="js-only banish-unbanish">
<span class="banish">
<span class="icon icon-trash" role="button"/>
<span class="screen-reader-text">hide this posting</span>
</span>
<span class="unbanish">
<span class="icon icon-trash red" role="button"/> restore this posting</span>
</span>
</span>
</p>
My thought was that I could iterate over all the < p > tags and get the tags within each that I needed, but it's not working out so well. Here's what I would like to get:
and then move on to the next < p > and get the same thing. I feel like I'm getting close, but am missing something crucial. For example, this snippet gets me the "data-pid" from each , but the "titletextonly" is same one over and over.
Thanks for any help you can provide!!
Whenever your XPath starts with /, it will always be treated as absolute XPath (in other words, relative to the root document) ignoring current context element, which in this case is referenced by variable title. That said, SelectSingleNode() will always return the first element in the entire document matched by the XPath parameter, regardless of the context element.
To make the XPath relative to context element, you need to add a . at the beginning :
var node = title.SelectSingleNode(".//span[#id='titletextonly']");

Cleaning up HTML created by contentEditable in c#

I've written a document editor which uses contentEditable to create HTML content. In some larger documents the style of syntax seems is all over the place. This is most likely a result of content pasted in from wordpad and earlier versions of the editor.
The problem is, now I'm left with a lot of very inconsistent documents.
It starts off fairly normal. Simple <p> tags for each line
<p>It is a truth</p>
<p>universally acknowledged</p>
<p>that a single man</p>
The only "bad" html up to this point is a few empty <i></i> tags, and the occasional instead of whitespace (anyone know why?)
Then it about halfway down the document, the line breaks switched to this format.
<div>
<br>
CHAPTER 1<br>
<br>
The sky above the port
<br>
was the color of a television
<br>
tuned to a dead channel.
</div>
<div>
<br>
</div>
Then about 3/4 down the page, we get this. It seems to have reverted to <p></p> tags, but now embeds them randomly in <span> tags with empty lang attributes
<div>
<span lang="">
<p>It was the best of times,</p>
<p>it was the worst of times,</p>
</span>
<p>it was the age of wisdom,</p>
<p>it was the age of foolishness,</p>
</div>
Note: some lines are inside a <span>, others are outside.
Worse, later on we get nested <span> tags
<span lang="">
<div>
<span lang="EN-GB">
<p>Stately, plump </p>
<p>Buck Mulligan came </p>
<span lang="EN-GB">
<p>from the stairhead, </p>
<p>bearing a bowl of lather </p>
<span lang="EN-GB">
<p> on which a mirror and a razor lay crossed</p>
</span>
</span>
</span>
</div>
</span>
You may also notice the parentage of the <span> and <div> tags is now reversed at the outset, with the <div> now a child of the <span>
I've noticed other oddities. <i></i> is used at the start but later <em></em> is used.
What's the best way to clean this HTML up?
Should I try and surround orphaned lines with <p> tags?
How do I remove only those <div> tags which contain <p> tags themselves? And how do I avoid leaving orphaned text in the document?
is a hard question, I had the same problem editing HTML from texts.
I found out this free pure HTML + JS editor: TinyMCE
http://www.tinymce.com/
which includes cleaning text options, you can choose the tags you want to clean from the text.
Is very powerful if you have the chance to change the editor you are using.

Categories