Using Html Agility Pack, Selecting the current element in a loop (XPATH)

Using Html Agility Pack, Selecting the current element in a loop (XPATH) - c#

I'm trying to do something simple, but somehow it doesnt work for me, here's my code:
var items = html.DocumentNode.SelectNodes("//div[#class='itembox']");
foreach(HtmlNode e in items)
{
int x = items.count; // equals 10
HtmlNode node = e;
var test = e.SelectNodes("//a[#class='head']");// I need this to return the
// anchor of the current itembox
// but instead it returns the
// anchor of each itembox element
int y =test.count; //also equals 10!! suppose to be only 1
}
my html page looks like this:
....
<div class="itembox">
<a Class="head" href="one.com">One</a>
</div>
<div class="itembox">
<a Class="head" href="two.com">Two</a>
</div>
<!-- 10 itembox elements-->
....
Is my XPath expression wrong? am i missing something?

Use
var test = e.SelectNodes(".//a[#class='head']");
instead. Your current code ( //a[]) searches all a elements starting from the root node. If you prefix it with a dot instead (.//a[]) only the descendants of the current node will be considered. Since it is a direct child in your case you could of course also do:
var test = e.SelectNodes("a[#class='head']");
As always see the Xpath spec for details.

var test = e.SelectNodes("//a[#class='head']");
This is an absolute expression, but you need a relative XPath expression -- to be evaluated off e.
Therefore use:
var test = e.SelectNodes("a[#class='head']");
Do note: Avoid using the XPath // pseudo-operator as much as possible, because such use may result in significant inefficiencies (slowdown).
In this particular XML document the a elements are just children of div -- not at undefinite depth off div.

Related

Need an XPath expressions to locate based on a sibling

I've got this code repeated in a div tag and want to write an XPath expression to find the dsd link so that I can click on it, based on the text in the h4 tag. Changing the HTML isn't an option.
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>
At the moment, I'm trying something like, where name is the name of the h4 tag;
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("../div/a[contains(#href,'dsd')]"));
submitButton.Click();
But I'm unable to get this to work. Any suggestions would be gratefully received.

I do not see an issue with your xpaths. The HTML you supplied is invalid due to your placeholders, but your xpaths appear to work with this:
void Main()
{
var xml = #"
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>";
var xmldoc = new XmlDocument();
xmldoc.LoadXml(xml);
var node = xmldoc.DocumentElement.SelectSingleNode("//div/h4[contains(text(),'Test Block')]");
node = node.SelectSingleNode("../div/a[contains(#href,'dsd')]");
Console.WriteLine(node.InnerText);
}

I don't have a working machine so I can't test this, but you said any feedback would be well received, so, I'm pretty sure using XPath you can grab individual elements from a child. If you know for sure that this HTML will always be the same, you could do:
../div[0] //(First element of the child)

You could use //div[h4[contains(., 'Test Block')]]//a[contains(#href, 'dsd')]. Also something like //div[h4[contains(., 'Test Block')]]//a[contains(., 'Option 1')] should work.

why don't you use the following-sibling
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("following-sibling::div/a[contains(#href,'dsd')]"));

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?

This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

C# parse html with xpath

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}

Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.

If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.

This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}

Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")

If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

Searching for XElement with attribute name that contain hyphens/dashes

I wrote some code in VB.Net a while ago that is using XElement, XDocument, etc... to store and manipulate HTML. Some of the HTML makes use of attribute names that contain a hyphen/dash (-). I encountered issues using LinqToXml to search for XElements by these attributes.
Back then I found an article (can't find it now) that indicated the solution in VB.net was to use syntax like this:
Dim rootElement as XElement = GetARootXElement()
Dim query = From p In rootElement.<div> Where p.#<data-qid> = 5 Select p
The "magic" syntax is the #<> which somehow translates the hyphenated attribute name into a format that can be successfully used by Linq. This code works great in VB.Net.
The problem is that we have now converted all the VB.Net code to C# and the conversion utility choked on this syntax. I can't find anything about this "magic" syntax in VB.Net and so I was hoping someone could fill in the details for me, specifically, what the C# equivalent is. Thanks.
Here is an example:
<div id='stuff'>
<div id='stuff2'>
<div id='stuff' data-qid=5>
<!-- more html -->
</div>
</div>
</div>
In my code above the rootElement would be the stuff div and I would want to search for the inner div with the attribuate data-qid=5.

I can get the following to compile in C# - I think it's equivalent to the original VB (note that the original VB had Option Strict Off):
XElement rootElement = GetARootXElement();
var query = from p in rootElement.Elements("div")
where p.Attribute("data-qid").Value == 5.ToString()
select p;
Here's my (revised) test, which finds the div with the 'data-qid' attribute:
var xml = System.Xml.Linq.XElement.Parse("<div id='stuff'><div id='stuff2'><div id='stuff3' data-qid='5'><!-- more html --></div></div></div>");
var rootElement = xml.Element("div");
var query = from p in rootElement.Elements("div")
where p.Attribute("data-qid").Value == 5.ToString()
select p;

Use HtmlAgilityPack (available from NuGet) to parse HTML. Here is an example:
HtmlDocument doc = new HtmlDocument();
doc.Load("index.html");
var innerDiv =
doc.DocumentNode.SelectSingleNode("//div[#id='stuff']/*/div[#data-qid=5]");
This XPath query gets inner div tag which has data-qid equal to 5. Also outer div should have id equal to 'stuff'. And here is the way to get data-qid attribute value:
var qid = innerDiv.Attributes["data-qid"].Value; // 5

Instead of using HtmlAgilityPack offered by Sergey Berezovskiy, there's easier way to do without it by using XmlPath's Extensions class, containing extension methods to work with LINQ to XML:
using System.Xml.XPath;
var xml = XElement.Parse(html);
var innderDiv = xml.XPathSelectElement("//div[#id='stuff' and #data-qid=5]");

How to change output HTML tag <b></b> to <strong></strong>?

How to replace <b></b> tag with <strong></strong> tag to a specific div?
ex:
<div id="aaa">hello<b>wow</b>!</div>
using javascript to replace with
<div id="aaa">hello<strong>wow</strong>!</div>
please help! thanks in advance.
***** Why I'm try to do is change the output HTML code <b></b> to <strong></strong> , in order to get W3C validation. Can I do that? **
Or Is there any solution that can use ASP.NET+C# to do that?

Here you go:
var root, elems;
root = document.getElementById( 'test' );
elems = root.getElementsByTagName( 'b' );
toArray( elems ).forEach( function ( elem ) {
var newElem = document.createElement( 'strong' );
newElem.textContent = elem.textContent;
elem.parentNode.replaceChild( newElem, elem );
});
where toArray is your preferred array-like to array converter function. I use this one:
function toArray( arrayLike ) { return [].slice.call( arrayLike ); }
Live demo: http://jsfiddle.net/mJSyH/3/
Note: this code doesn't work in IE8.

You can grab all <b> elements under a certain element, move all child nodes to a new <strong> element, and then replace the <b> with the <strong>.
<div id="aaa">hello<b>wow</b><b>2</b><b>3</b>!</div>
<script>
var container = document.getElementById("aaa")
var find = container.getElementsByTagName("b");
var bold, strong;
while (bold = find[0]) {
strong = document.createElement("strong");
while (bold.firstChild) {
strong.appendChild(bold.firstChild);
}
bold.parentNode.replaceChild(strong, bold);
}
</script>
The reason you can set bold = find[0] every time is that as the <b> elements are removed from the document, they are also removed from the NodeList find.
See the latest version at http://jsbin.com/eqikaj/13/edit.

Using jQuery you can find all b tags in scope of your parent div container element and then replace each of them with strong and copy inner text of the source tag:
$('#aaa b').each(function() {
$(this).replaceWith($('<strong>' + this.html() + '</strong>');
});

If you use jQuery, you can simply go like this:
$('b').replaceWith(function() {
return $('<strong>').html($(this).html());
});
Just download or include the jQuery library somehow, and you can use the snippet.
http://docs.jquery.com/Downloading_jQuery

A solution using regular expressions:
var e = document.getElementById("aaa");
e.innerHTML = e.innerHTML.replace(/<b[^>]*>(.*?)<\/b>/ig, '<strong>$1</strong>');
perhaps is not more fast that the versions above.
but the perfomace difference is very little(irrelevant in real applications).
Use whichever you think best
note: you don't need use a function as toArray, you can do this:
Array.forEach(elems, function() { ... })

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using Html Agility Pack, Selecting the current element in a loop (XPATH) - c#

Related

Need an XPath expressions to locate based on a sibling

Get text that lies after pattern without class or id

C# parse html with xpath

Searching for XElement with attribute name that contain hyphens/dashes

How to change output HTML tag <b></b> to <strong></strong>?

Categories

Resources