Html agility pack Addressing

Html agility pack Addressing - c#

in this Html
<div class="contacts-list">
<h4 class="title">Contact</h4>
<div class="contact-phone">
<span class="icon"><i class="ee-phone"></i></span><span class="type">تلفن</span>
<span class="contact-data">
<a dir='auto' href='tel:05138946697'>05138946697</a> </span>
</div>
I have to extract the value of the "a" tag but I must be sure it is inside a "div" tag with a "contact-phone" class.
I don't really understand how I have to do this can someone help me?

so I get the value I need like this using the HTML Agility pack and Xpath
foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//div[#class='" + "contact-phone" + "']/span[#class='"+ "contact-data" + "']/a"))
{
value = node.InnerText;
}

Related

Html Agility Pack - Remove element by id

I'm trying remove specific piece of code by element id with help of Html Agility Pack. Html:
<div id="id00">
<h1>Title</h1>
</div>
<div id="id10">
<div id="id11">
<h2>Title 2</h2>
<p>Some text</p>
</div>
<a id="idToRemove" href="#">Anchor text</a>
</div>
My method:
public static string RemoveElement(string html, string elementId)
{
elementId = "idToRemove";
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.GetElementbyId(elementId);
node.Remove();
html = htmlDoc.Text;
return html;
}
Unfortunately it's not working at all.

It works, but htmlDoc.Text is the wrong property, use:
return htmlDoc.DocumentNode.OuterHtml;

How to write XPath for the text present outside the tags?

<div class="accordion_browse">
<a target="_blank" href="https://www.silverpages.sg/tools/e-care-locator"
aria-expanded="false" aria-controls="collapseExample">
<span class="browse_img">
<img src="/sites/assets/assets/directory/icons/eldercare.png"/>
</span>
Eldercare Services
<i class="indicator fa fa-chevron-right pull-right" aria-hidden="true"/>
</a>
</div>
Note: the text required is 'Eldercare Services'
Code:
string Eldercare = driver.FindElement(By.XPath("//a[contains(.,'Eldercare Services')")).Text;
Console.WriteLine(Eldercare + " found");
Assert.IsTrue(Eldercare.Contains("Elder"), Eldercare + " not found");

A text node is retrieved using the text node text()
XPath: "/body/div/text()"
That will retrieve all children text nodes of /body/div.
"/body/div/text()[1]" will retrieve the first text node inside /body/div
Additional Reading on XPath:
https://www.w3schools.com/xml/xpath_examples.asp

The given xpath is looking good, but By.XPath("//a[contains(.,'Eldercare Services')") example code missing a closing square bracket: By.XPath("//a[contains(.,'Eldercare Services')]")
With this fix, the xpath will give the a element
<a target="_blank" href="https://www.silverpages.sg/tools/e:care:locator" aria:expanded="false" aria:controls="collapseExample">
<span class="browse_img">
<img src="/sites/assets/assets/directory/icons/eldercare.png"/>
</span>
Eldercare Services longtext
<i class="indicator fa fa:chevron:right pull:right" aria:hidden="true"/>
</a>
The missing bracket should fix the code.
If Selenium would have work with text elements it could be
//a[contains(.,'Eldercare Services')]/text()
results would be a string array with 3 strings. the middle one has the expected string, because before the span and after i elements there will probably be two empty strings.
Text =
Text =
Eldercare Services longtext
Text =
If use a second filter for the text, it would result the expected string only
//a[contains(.,'Eldercare Services')]/text()[contains(.,'Elder')]

fetching span value from html document

I have following xpath fetched using firefox xpath plugin
id('some_id')/x:ul/x:li[4]/x:span
using html agility pack I'm able to fetch id('some_id')/x:ul/x:li[4]
htmlDoc.DocumentNode.SelectNodes(#"//div[#id='some_id']/ul/li[4]").FirstOrDefault();
but I dont know how to get this span value.
update
<div id="some_id">
<ul>
<li><li>
<li><li>
<li><li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>

You don't need parse HTML with LINQ2XML, HTMLAgilityPack it's for it and it's more easy to obtain the node in the following way :
var html = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("div[#id='some_id']/ul/li/span").InnerText;
Console.WriteLine(value);

An alternative approach (without html-agility-pack) would be to use LINQ2XML. You can use the XDocument.Descendants method to take the span element and take it's value:
var xml = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Root.Descendants("span").FirstOrDefault().Value);
The code can be extended to check if the div element has the matching id, using the XElement.Attribute property:
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Elements("div").Where (e => e.Attribute("id").Value == "some_id").Descendants("span").FirstOrDefault().Value);
One drawback of this solution is that the XML structure (HTML, XHTML) needs to be properly closed or else the parsing will fail.

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

I am having a strange behaviour with a xpath expression with HtmlAgilityPack.
I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as
<div class='cont'> However, when I use the code below I simply get all values within
<div class='cont'> AND <div class='button'>. Does anyone know why this is happening?
Here is the full code to reproduce it:
using System;
using System.Xml.XPath;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string text1 = #"<div class=""cont"">
<h3>content</h3>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">
<div>content7 </div>
<div>content8 </div> </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
foreach (XPathNavigator node in SearchInPage(text1, "//div[#class='cont']"))
{
Console.WriteLine("option " + node.Value);
}
}
private static XPathNodeIterator SearchInPage(string text, string xpath)
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(text);
XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
XPathNodeIterator nodes = xpathNavigator.Select(xpath);
return nodes;
}
}
}
The code returns:
'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>

So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?
Try this:
HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[#class='cont']");
foreach(HtmlNode childNode in node)
{
Console.WriteLine(childNode.Value);
}
I don't have a way to debug this in front of me, but this should work. the (".//div[#class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

how to get html div element innertext by id using regular expression in C#

I'm getting full html code using WebClient. But i need to get specified div from full html using regular expression.
for example:
<body>
<div id="main">
<div id="left" style="float:left">this is a <b>left</b> side:<div style='color:red'> 1 </div>
</div>
<div id="right" style="float:left"> main side</div>
<div>
</body>
if i need div named 'main', function return
<div id="left" style="float:left">this is a <b>left</b> side:<div style='color:red'> 1 </div>
</div>
<div id="right" style="float:left"> main side</div>
If i need div named 'left', function return
this is a <b>left</b> side:<div style='color:red'> 1 </div>
If i need div named 'right', function return
main side
How can i do?

Why do people insist on trying to use regex to parse html? You can probably do it if you exclude a whole host of edge-cases... but just use HTML Agility Pack and you're done:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(...); // or Load
string main = doc.DocumentNode.SelectSingleNode("//div[#id='main']").InnerHtml;
(note I'm assuming it is not xhtml; if it is xhtml, use XmlDocument or XDocument, and very similar code to the above)

string divname = "somename";
Match m = RegEx.Match(htmlContent, "<div[^>]*id="+divname+".*?>(.*?)</div");
string contenct = m.Groups[1].Tostring();
won't work if you have nested divs inside the desired div

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html agility pack Addressing - c#

so I get the value I need like this using the HTML Agility pack and Xpath foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//div[#class='" + "contact-phone" + "']/span[#class='"+ "contact-data" + "']/a")) { value = node.InnerText; }

Related

Html Agility Pack - Remove element by id

How to write XPath for the text present outside the tags?

fetching span value from html document

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

how to get html div element innertext by id using regular expression in C#

Categories

Resources