c# HtmlAgilityPack select main nodes

c# HtmlAgilityPack select main nodes - c#

I'm trying to figure out how can i select only main nodes from a loaded HTML document as the following example:
<div id="main">
<p>paragraph 1</p>
<p>paragraph 2</p>
<img src="exzample.jpg" />
</div>
<div id="main2">
<div>some text</div>
<p>some text</p>
<img src="exzample.jpg" />
</div>
<p class="a_class">
<div>some text</div>
<span>some text</span>
</p>
I know I can iterate over all elements but in my case, I just need to get only the 3 blocks (in this example) from the loaded html. I do not know how can I select such nodes using SelectNodes function or any other function.
I'm using HtmlAgilityPack library.
Note: Main nodes can be any html tag (div, p, span and so on...)

/* will select all immediate descendats of the root node (which the document posted in this question is lacking).

Related

Selenium: How to find any ancestor based on one of ancestor's attribute using IWebElement.FindElement()

Let's say I have this HTML
<div class="item-wrapper">
<div>
<h6>My Header 1</h6>
<div>
<div>
<div>
<label>
<input type="checkbox" />
<span class="class1">Text 2</span>
<span class="class2"></span>
</label>
<div>
</div>
</div>
<div class="item-wrapper">
<div>
<h6>My Header 2</h6>
<div>
<div>
<div>
<label>
<input type="checkbox" />
<span class="class1">Text 2</span>
<span class="class2"></span>
</label>
<div>
</div>
</div>
How do I get to any child to the common parent, i.e. <div class="item-wrapper">? In this case, the attribute is the class. However, the attribute could be anything that can identify the common ancestor.
var xPathToAncestor = "ancestor::div[#class='item-wrapper']";
var ancestor = child.FindElement(By.XPath(xPathToAncestor)
I've tried so many combinations //ancestor::div[#class='item-wrapper'], .//ancestor::div[#class='item-wrapper'], but nothing is working.

you can use
//div[#class='item-wrapper']//span
this find span tags which are a direct and indirect child
To find direct child
//div[#class='item-wrapper']/span
How to use ancestor:
//ancestor::div[#class="item-wrapper"]
This finds an element with tag div and class item-wrapper , and is a parent of any child
//span/ancestor::div[#class="item-wrapper"]
In your case just change
var xPathToAncestor = "./ancestor::div[#class='item-wrapper']";
See the '.' it indicates the child is the context node of xpath
This finds an element with tag div and class item-wrapper , and is a parent element of span tag
//ancestor::div[#class="item-wrapper"]/span
finds span tag that is a direct child of class="item-wrapper" and div tag
//ancestor::div[#class="item-wrapper"]//span
finds span tag that is a direct or non direct child of class="item-wrapper" and div tag

To get to any child with respect to the common parent <div class="item-wrapper"> you can use the following xpath based Locator Strategies:
Getting <input type="checkbox" /> using xpath:
//div[#class='item-wrapper']//h6[text()='My Header']//following::label[1]/input[1]
Getting <span class="class1">Text 1</span> using xpath:
//div[#class='item-wrapper']//h6[text()='My Header']//following::label[1]//span[#class='class1']
Update
As per your question update, to get the common parent <div class="item-wrapper"> you can use the following xpath based Locator Strategies:
Using ancestor of xpath and text Text 2:
//span[text()='Text 2']//ancestor::div[#class='item-wrapper']
Using ancestor of xpath and class="class1" attribute:
//span[#class='class1']//ancestor::div[#class='item-wrapper']

You could try using CSS selectors instead of XPath.
Using the HTML from your question as an example:
<div class="item-wrapper">
<div>
<h6>My Header</h6>
<div>
<div>
<div>
<label>
<input type="checkbox" />
<span class="class1">Text 1</span>
<span class="class2"></span>
</label>
<div>
</div>
</div>
To target class="class1" you would use this:
var classOne = driver.findElement(By.cssSelector(".item-wrapper > div(2) > div > label > '.class1'"));
This locates the item-wrapper class,
then the 2nd div (as the 1st div contains the header),
then the next div
then the label
then whatever has the class class1
Alternatively you could use the span instead of the class name. Which will make it var classOne = driver.findElement(By.cssSelector(".item-wrapper > div(2) > div > label > span(1)));. Because you want the first span rather than the 2nd.
Note that the CSS Selector for class is .. There are different CCS Selectors that you can use.
This answer explains using CSS Selectors too.

Scraping from a div

I am experimenting with web scraping and I am having trouble scraping a particular value out of some nested div classes. I am using the .NET HtmlAgilityPack class library in a .NET Framework C# Console App. Here is the div code:
<div class="ds-nearby-schools-list">
<div class="ds-school-row">
<div class="ds-school-rating">
<div class="ds-gs-rating-8">
<span class="ds-hero-headline ds-schools-display-rating">8</span>
<span class="ds-rating-denominator ds-legal">/10</span>
</div>
</div>
<div class="ds-nearby-schools-info-section">
<a class="ds-school-name ds-standard-label notranslate" href="https://www.greatschools.org/school?id=00870&state=MD" rel="nofollow noopener noreferrer" target="_blank">Candlewood Elementary School</a>
<ul class="ds-school-info-section">
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Grades:</span>
<span class="ds-school-value ds-body-small">K-5</span>
</li>
<li class="ds-school-info">
<span class="ds-school-key ds-body-small">Distance:</span>
<span class="ds-school-value ds-body-small">0.8 mi</span>
</li>
</ul>
</div>
</div>
</div>
I want to scrape the "8" from the ds-hero-headline ds-schools-display-rating class. I am having trouble formulating the selector for the SelectNodes method on the DocumentNode object of the HtmlNode.HtmlDocument class.

I guess you might be having a trouble to write XPath to select the node. Try //*[contains(#class, 'ds-hero-headline') and contains(#class, 'ds-schools-display-rating')] with SelectNodes method.
However, this XPath could have a problem if the page your targeting would also have class name like ds-hero-headline-content, which ds-hero-headline can partially match. In that case, see the solution in How can I find an element by CSS class with XPath?

I would use this to extract 0.8 mi
//div[#class='ds-nearby-schools-list']/div[#class='ds-school-row']/div[#class='ds-nearby-schools-info-section']/ul[#class='ds-school-info-section']/li[#class='ds-school-info']/span[#class='ds-school-value ds-body-small' and preceding-sibling::span[#class='ds-school-key ds-body-small' and text()='Distance:']]/text()
Then this regex to group data:
^[0-9\.]+ (.*)$
At the end you can use some kind of conversion to save distance to an object.

Have you tried the following to get the 8. You can search for a specific span element with the class name to get the inner text.
Note: I used text file to load the html from your question.
string htmlFile = File.ReadAllText(#"TempFile.html");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlFile);
HtmlNode htmlDoc = doc.DocumentNode;
HtmlNode node = htmlDoc.SelectSingleNode("//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(node.InnerText);
// output: 8
Alternate:
Another way is to specify the path that you want the value from, starting from the div element.
HtmlNode node2 = htmlDoc.SelectSingleNode("//div[#class='ds-gs-rating-8']//span[#class='ds-hero-headline ds-schools-display-rating']");
Console.WriteLine(subNode.InnerText);
output
8

How to Define main DIV if one of it childs consist of some text? Using c# & selenium

The main DIV consist of text in one of it's childs & consist of button i need to click if text is present. How can i define the main div to continue work with this if one of it's childs strictly consist of text i need?
Structure seems like:
<Div class="Green"> (Main Div i mentioned in description)
<Div class="Yel">
<Div class="Ora">
<Div class="Pur">
<span>text must be present</span>
I need to define main div , to proceed with this by findelement. then.

Have you tried the below.
Browser.FindElement(By.XPath("//div[#class='Green' and //span[normalize-space(.)='137']]//button[#class='needed_item']")).Click()
or
Browser.FindElement(By.XPath("//span[normalize-space(.)='137']/ancestor::div[#class='Green']//button[#class='needed_item']")).Click()
This xpath will first find the main div which have a span with 137 text. And then click on the button which have the class needed_item. Considered the below structure for the xpath.
<div #understand='main_div' #class='Green'>
<div #understand='child_div1'></div>
<div #understand='child_div2'>
<span #understand='target_span'> 137 </span>
</div>
<div #understand='child_div3></div>
<div #understand='div with button'>
<button #class='needed_item'>Target Button</button>
</div>
</div>
Let me know if there is any change in the structure.

Cleaning up HTML created by contentEditable in c#

I've written a document editor which uses contentEditable to create HTML content. In some larger documents the style of syntax seems is all over the place. This is most likely a result of content pasted in from wordpad and earlier versions of the editor.
The problem is, now I'm left with a lot of very inconsistent documents.
It starts off fairly normal. Simple <p> tags for each line
<p>It is a truth</p>
<p>universally acknowledged</p>
<p>that a single man</p>
The only "bad" html up to this point is a few empty <i></i> tags, and the occasional instead of whitespace (anyone know why?)
Then it about halfway down the document, the line breaks switched to this format.
<div>
<br>
CHAPTER 1<br>
<br>
The sky above the port
<br>
was the color of a television
<br>
tuned to a dead channel.
</div>
<div>
<br>
</div>
Then about 3/4 down the page, we get this. It seems to have reverted to <p></p> tags, but now embeds them randomly in <span> tags with empty lang attributes
<div>
<span lang="">
<p>It was the best of times,</p>
<p>it was the worst of times,</p>
</span>
<p>it was the age of wisdom,</p>
<p>it was the age of foolishness,</p>
</div>
Note: some lines are inside a <span>, others are outside.
Worse, later on we get nested <span> tags
<span lang="">
<div>
<span lang="EN-GB">
<p>Stately, plump </p>
<p>Buck Mulligan came </p>
<span lang="EN-GB">
<p>from the stairhead, </p>
<p>bearing a bowl of lather </p>
<span lang="EN-GB">
<p> on which a mirror and a razor lay crossed</p>
</span>
</span>
</span>
</div>
</span>
You may also notice the parentage of the <span> and <div> tags is now reversed at the outset, with the <div> now a child of the <span>
I've noticed other oddities. <i></i> is used at the start but later <em></em> is used.
What's the best way to clean this HTML up?
Should I try and surround orphaned lines with <p> tags?
How do I remove only those <div> tags which contain <p> tags themselves? And how do I avoid leaving orphaned text in the document?

is a hard question, I had the same problem editing HTML from texts.
I found out this free pure HTML + JS editor: TinyMCE
http://www.tinymce.com/
which includes cleaning text options, you can choose the tags you want to clean from the text.
Is very powerful if you have the chance to change the editor you are using.

Visual Studio collapsing html code

i have some html code as follows, which was supplied by our graphics developers. the issue is when i import this into asp.net (c#) page i get to see a lot of orphan divs. it feels as if there are not opening divs for several of the closing divs. following is code snippet.
<div class="col-lg-2 col-lg-3 quick-launch">
<div class="thumbnail">
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</a></div>
</div>
</div>
could someone here please let me know if there is something in visual studio that is causing this?

You're inverting <div> and <a> closing tags. This is valid HTML (but not valid XHTML so you'd better to check your DOCTYPE) but it may confuse Visual Studio editor:
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</a>
</div>
a
Should be:
<a href=""> <img src="assets/img/app_images/app_7.jpg" width="115" height="114">
<div class="caption">
<h3>TEST</h3>
</div>
</a>
Edit: what's wrong with that? It works because HTML parser doesn't complain about <a><div><a/></div> (if DOCTYPE isnt XHTML) but you should complain about it. Let me explain: parser won't complain because </div> (closing tag) isn't optional then it won't just silently add it. This is theory, in practice browsers handle this in many ways. Some of them silently close <div> when </a> is reached (then </div> will close outer one), some others don't do it (I repeat because it's not an optional closing tag) then </div> will close inner (and right) one. IMO With such unreliable behavior you should ask your developer/graphics designer to fix that code. In general (and with few exceptions like <hr> and <br>) I would write HTML code as it was XHTML.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# HtmlAgilityPack select main nodes - c#

/* will select all immediate descendats of the root node (which the document posted in this question is lacking).

Related

Selenium: How to find any ancestor based on one of ancestor's attribute using IWebElement.FindElement()

Scraping from a div

How to Define main DIV if one of it childs consist of some text? Using c# & selenium

Cleaning up HTML created by contentEditable in c#

Visual Studio collapsing html code

Categories

Resources