Scraping from a div

Scraping from a div - c#

I am experimenting with web scraping and I am having trouble scraping a particular value out of some nested div classes. I am using the .NET HtmlAgilityPack class library in a .NET Framework C# Console App. Here is the div code:
<div class="ds-nearby-schools-list">
<div class="ds-school-row">
<div class="ds-school-rating">
<div class="ds-gs-rating-8">
<span class="ds-hero-headline ds-schools-display-rating">8</span>
<span class="ds-rating-denominator ds-legal">/10</span>
</div>
</div>
<div class="ds-nearby-schools-info-section">
<a class="ds-school-name ds-standard-label notranslate" href="https://www.greatschools.org/school?id=00870&state=MD" rel="nofollow noopener noreferrer" target="_blank">Candlewood Elementary School</a>
<ul class="ds-school-info-section">
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Grades:</span>
<span class="ds-school-value ds-body-small">K-5</span>
</li>
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Distance:</span>
<span class="ds-school-value ds-body-small">0.8 mi</span>
</li>
</ul>
</div>
</div>
</div>
I want to scrape the "8" from the ds-hero-headline ds-schools-display-rating class. I am having trouble formulating the selector for the SelectNodes method on the DocumentNode object of the HtmlNode.HtmlDocument class.

I guess you might be having a trouble to write XPath to select the node. Try //*[contains(#class, 'ds-hero-headline') and contains(#class, 'ds-schools-display-rating')] with SelectNodes method.
However, this XPath could have a problem if the page your targeting would also have class name like ds-hero-headline-content, which ds-hero-headline can partially match. In that case, see the solution in How can I find an element by CSS class with XPath?

I would use this to extract 0.8 mi
//div[#class='ds-nearby-schools-list']/div[#class='ds-school-row']/div[#class='ds-nearby-schools-info-section']/ul[#class='ds-school-info-section']/li[#class='ds-school-info']/span[#class='ds-school-value ds-body-small' and preceding-sibling::span[#class='ds-school-key ds-body-small' and text()='Distance:']]/text()
Then this regex to group data:
^[0-9\.]+ (.*)$
At the end you can use some kind of conversion to save distance to an object.

Have you tried the following to get the 8. You can search for a specific span element with the class name to get the inner text.
Note: I used text file to load the html from your question.
string htmlFile = File.ReadAllText(#"TempFile.html");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlFile);
HtmlNode htmlDoc = doc.DocumentNode;
HtmlNode node = htmlDoc.SelectSingleNode("//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(node.InnerText);
// output: 8
Alternate:
Another way is to specify the path that you want the value from, starting from the div element.
HtmlNode node2 = htmlDoc.SelectSingleNode("//div[#class='ds-gs-rating-8']//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(subNode.InnerText);
output
8

Related

TagBuilder Find Specific Inner Element and Add New Attribute

I have a TagBuilder which contains Outer and Inner Elements. How do I traverse to the Input level line , and Add the following as an New attribute?
placeholder="Search"
<div class="form-group">
<div class="cont label-outside">
<label>Name</label>
<div class="group">
<input type="text" required="required" class="focusedOut">
<span class="highlight"></span>
<span class="bar"></span>
<span class="close-button" onclick="clear()"></span>
</div>
</div>
</div>
If no Tagbuilder/Taghelper method exists, should C# HTML Agility package be utilized to edit tag tree? https://html-agility-pack.net/traversing ; If yes, how to convert Tagbuilder/ or Taghelper Output into Agility Package HTMLDocument?
*This is different from question Add CSS Class to All Tags in TagBuilder, Edit Existing Attribute as this asks to Edit existing attribute. Question here is about Adding New Attribute. See article below :
Why isn't it good to ask multiple questions and answers in one question

For your task you can use HtmlAgilityPack.
Using HtmlAgilityPack you can use XPath Query to select necessary nodes and add tag to this node.
To select nodes you can use SelectNodes method:
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//input[contains(#class, 'focusedOut')]");
To add attribute you can use Attributes Collection:
node.Attributes.Add("placeholder","Search");

c# HtmlAgilityPack select main nodes

I'm trying to figure out how can i select only main nodes from a loaded HTML document as the following example:
<div id="main">
<p>paragraph 1</p>
<p>paragraph 2</p>
<img src="exzample.jpg" />
</div>
<div id="main2">
<div>some text</div>
<p>some text</p>
<img src="exzample.jpg" />
</div>
<p class="a_class">
<div>some text</div>
<span>some text</span>
</p>
I know I can iterate over all elements but in my case, I just need to get only the 3 blocks (in this example) from the loaded html. I do not know how can I select such nodes using SelectNodes function or any other function.
I'm using HtmlAgilityPack library.
Note: Main nodes can be any html tag (div, p, span and so on...)

/* will select all immediate descendats of the root node (which the document posted in this question is lacking).

Xpath grabbing separate text in between link nodes

I'm currently retrieving text from inside <a> tags utilizing HtmlAgilityPack:
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
lblTest1.Text = lblTest1.Text + ", " + node.InnerText.ToString();
}
and the web code looks like this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
Currently it is returning to me: Battery (1), Brakes (2), Cables/Lines (1) which is obviously all of the inner text. What I would like to know is how to split the two bits apart so I can store them each in a list for later usage. Something along the lines of: Battery, 1, Brakes, 2, Cables/Lines, 1 so as they are returned to me I can just toss them into lists.
The text in between the <em> tags are the number of results on the page that the <a> is taking you to. I could just parse the entire string after getting the line of text, but I feel as if there is a method to do this automatically with XPath and return one piece at a time for me to handle and store. I am very new to XPath and have been attempting to solve this for multiple days myself with no avail. Any help would be greatly appreciated.

Change your XPath expression to //div[#class='acTrigger']/a//text()[normalize-space()] separate text nodes.

c# selenium finding element using xpath

I am trying to find an element which is a div inside a div...
here is example of the code:
<div class="col-md-4">
<div style="display: none;" id="multiplier-win" class="label label-success multiplier">2X</div>
<div style="display: block;" id="multiplier-lose" class="label label-danger multiplier">0X</div>
<div style="display: none;" id="multiplier-tie" class="label label-warning multiplier">1X</div>
</div>
I want to find the class="label label-success multiplier" and check if her style="display:none".
How do I write this in c#?
Please help me
thank you!

In your case, the elements have a unique ID. So instead of finding them by class name (which could lead to multiple/inaccurate results), you should use By.Id(...). It is more easy to write by hand than xpath, too.
Let's say your IWebDriver instance is called driver. The code looks like this:
IWebElement element = driver.FindElement(By.Id("multiplier-win"));
String style = element.GetAttribute("style");
...
I don't want to offend you, but you should probably use google before you post here. This is very basic code you will find in multiple tutorials about selenium.
Edit: In case you are looking for multiple elements of a class:
ReadOnlyCollection<IWebElement> elements = driver.FindElements(By.ClassName("..."));
foreach (IWebElement el in elements)
{
...
}

To Find the element:
IWebElement element = driver.FindElement(By.XPath("//div[#class='label label-success multiplier']"));
To check if an element is displayed, this returns a bool (true if displayed, false if not displayed). If you go with philn's element list code, you can throw this line into his foreach statement and it will tell you which ones are displayed.
el.Displayed;

Max Value with Substring with HTML Agility Pack

I can't seem to get this xpath query to work with the HTMLAgilityPack with this code and I was wondering if anyone had any suggestions.
This is the query I have so far, but I can't seem to get it to return a number.
DocumentNode.GetAttributeValue("max(a[(#class='shackmsg')]/#href/substring-after(.,?id='))", "");
I'm trying to get the MAX value in the href attribute after the = sign on all hrefs with a class of shackmsg.
How long is the beta live before it goes retail? No one knows. We do know t</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218936" class="">
<div class="oneline oneline3 op olmod_ontopic olauthor_189801">
<a class="shackmsg" rel="nofollow" href="?id=31218936" onclick="return clickItem( 31218933, 31218936);"><span class="oneline_body"><b><u><span class="jt_yellow">Current Multiplayer Servers</span>!</u></b>
<span class="jt_sample"><span class="jt_green">Nighteyes's Japan Server: </span> <span class="jt_lime">(PvE)</span>: <b>211.15.2.34</b></span>
<span class="jt_sample"><span class="jt_green">zolointo's Canada Server: </span> <span class="jt_lime">(</span></span></span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
</li>
<li id="item_31218938" class="last">
<div class="oneline oneline2 op olmod_ontopic olauthor_189801">
<div class="treecollapse">
<a class="open" rel="nofollow" href="#" onclick="toggle_collapse(31218938); return false;" title="Toggle">toggle</a>
</div>
<a class="shackmsg" rel="nofollow" href="?id=31218938" onclick="return clickItem( 31218933, 31218938);"><span class="oneline_body">Had fun freezing my ass off last night with a bunch of shackers. Not sure who started the big tower we f...</span> : </a><span class="oneline_user ">legsbrogan</span>
</div>
<ul>
<li id="item_31218966" class="">
<div class="oneline oneline1 olmod_ontopic olauthor_128401">
<a class="shackmsg" rel="nofollow" href="?id=31218966" onclick="return clickItem( 31218933, 31218966);"><span class="oneline_body">wasn't me. I hung out on my ship for a bit listening to your kid play Christmas songs for a bit and then ...</span> : </a><span class="oneline_user ">jonin</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/jonin/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
<li id="item_31219008" class="last">
<div class="oneline oneline0 olmod_ontopic olauthor_8618">
<a class="shackmsg" rel="nofollow" href="?id=31219008" onclick="return clickItem( 31218933, 31219008);"><span class="oneline_body">haha i heard you guys booby trapped some poor sap's space ship</span> : </a><span class="oneline_user ">Break</span><a class="lightningbolt" rel=\"nofollow\" href="http://www.shacknews.com/user/Break/posts?result_sort=postdate_asc"><img src="http://cf.shacknews.com/images/bolt.gif" alt="This person is cool!" /></a>
</div>
</li>
</ul>
Any suggestions?

There are two problems as far as I can see:
You're only scanning for anchor tags in the current context. You probably want to extend to scan everywhere (use // in the beginning of your query):
//a[#class='shackmsg']/#href/substring-after(., '?id=')
Note that I removed a pair of unnecessary parenthesis.
If I'm not completely mistaken, HTML Agility Pack only supports XPath 1.0 (yet I'm not totally sure). While System.Xml.XPath says it implements the XPath 2.0 data model, it does not actually implement XPath 2.0 (probably this is done so third party APIs can implement this API and offer XPath 2.0/XQuery support at the same time). Also have a look at this discussion on .NET's XPath 2.0 support.
Missing XPath 2.0 support would show up as two problems:
Function substring-after(...) does not exist.
A solution for your problem could be to use string-lenght($string) and substring($string, $start, $length) to extract the last n digits, or translate(...) to remove some characters:
translate('?id=31219008', '?id=', '')
will remove all occurences in the character class [?id=] (yet it is none, I just want to highlight it does not match strings, but individual characters of this set!).
You cannot apply functions in axis steps. This means, you cannot find the maximum value of substrings.
Possible solution: Only fetch all substrings and find the maximum from outside XPath.

You can combine XPath with HTML Agility Pack and make the following code :
var value = doc.DocumentNode.SelectNodes("//a[#class='shackmsg']").Select(
x => x.Attributes["href"].Value.Substring(4)).Max();
Console.WriteLine(value);
And this output :
31219008
In this code I assume to always exist the href attribute and always have the following structure :
"?id=XXXX"

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Scraping from a div - c#

Related

TagBuilder Find Specific Inner Element and Add New Attribute

c# HtmlAgilityPack select main nodes

Xpath grabbing separate text in between link nodes

c# selenium finding element using xpath

Max Value with Substring with HTML Agility Pack

Categories

Resources