c# and HtmlAgilityPack: get value of a href attribute - c#

Using vs 2019 and .net 4.8, the c# code below get for following html node and I'm having trouble getting the href value.
The href attribute has a full url but the only text I'm getting is "/".
Can someone please let me know where I'm going wrong and how to get the full url text?
Thank you.
The node:
<h2 class="n">
3.
<a class="business-name" href="/santa-monica-ca/mip/specialists-in-custom-software-16438720" data-analytics="{"target":"name","feature_click":""}" rel="" data-impressed="1">
<span>Specialists In Custom Software</span>
</a>
</h2>
My code:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']").ToArray();
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectNodes("//a//span").First().InnerText);
Console.WriteLine(node.SelectNodes("//a").First().Attributes["href"].Value);
}

That should do it, although I didn't understand why it doesn't work the way it is:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
var nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectSingleNode("a/span").InnerText);
Console.WriteLine(node.SelectSingleNode("a").Attributes["href"].Value);
Console.WriteLine();
}
Fiddle: https://dotnetfiddle.net/dtoZGl

Related

HTML Agility pack with c#

c# code:`
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
textBox1.Text=ournode.InnerHtml;
`
html code :
//< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<img alt="" class="head" id="face_56578735" src="http://img1.ask.fm/assets2/091/364/883/712/thumb_tiny/11094711_919135961470973_149663457_njpg720960png1280963.png" />
I want to see the following in the text box
/sudenur3434
/leylaulucay
I have added an additional line to your code:
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
var val = ournode.Attributes["href"].Value;
textBox1.Text=val;
This would let you get the href attribute. Simply use the same code to get the other nodes href value and then add them to your textbox
Since a text box is usually used for one liners, I am giving you an example that will simply write all links in the direct output window of VS.
If you use e.g. a ListBox instead of a text box you can replace Debug.Print by e.g. ListBox1.Items.Add(href.Value)
This here will give you all href urls from all a children in div id="heads":
var site = new HtmlWeb();
var htmldoc = site.Load("http://ask.fm/");
var headDiv = htmldoc.DocumentNode.SelectSingleNode("//div[#id='heads']");
if (headDiv != null)
{
var anchors = headDiv.SelectNodes("a");
foreach (HtmlNode aNode in anchors)
{
var href = aNode.Attributes.AttributesWithName("href").FirstOrDefault();
if (href != null)
Debug.Print(href.Value);
}
}
< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<a href="/leylaulucay" data-rlt-aid="welcome_head"><img alt="" class="head" id="face_5
how to agility pack parse in textbox

get value from web page using Html Agility Pack

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}
Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}
The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

selecting Node does not work using HtmlAgilityPack

I am using VS2010 and using HTMLAGilityPack1.4.6 (from Net40-folder).
Following is my HTML
<html>
<body>
<div id="header">
<h2 id="hd1">
Patient Name
</h2>
</div>
</body>
</html>
I am using following code in C# to access "hd1".
Please tell me correct way to do it.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try
{
string filePath = "E:\\file1.htm";
htmlDoc.LoadHtml(filePath);
if (htmlDoc.DocumentNode != null)
{
HtmlNodeCollection _hdPatient = htmlDoc.DocumentNode.SelectNodes("//h2[#id=hd1]");
// htmlDoc.DocumentNode.SelectNodes("//h2[#id='hd1']");
//_hdPatient.InnerHtml = "Patient SurName";
}
}
catch (Exception ex)
{
throw ex;
}
Tried many permutations and combinations... I get null.
plz help.
Your problem is the way how you load data into HtmlDocument. In order to load data from file you should use Load(fileName) method. But you are using LoadHtml(htmlString) method, which treats "E:\\file1.htm" as document content. When HtmlAgilityPack tries to find h2 tags in E:\\file1.htm string, it finds nothing. Here is the correct way to load html file:
string filePath = "E:\\file1.htm";
htmlDoc.Load(filePath); // use instead of LoadHtml
Also #Simon Mourier correctly pointed that you should use SelectSingleNode method if you are trying to get single node:
// Single HtmlNode
var patient = doc.DocumentNode.SelectSingleNode("//h2[#id='hd1'");
patient.InnerHtml = "Patient SurName";
Or if you are working with collection of nodes, then process them in a loop:
// Collection of nodes
var patients = doc.DocumentNode.SelectNodes("//div[#class='patient'");
foreach (var patient in patients)
patient.SetAttributeValue("style", "visibility: hidden");
You were almost there:
HtmlNode _hdPatient = htmlDoc.DocumentNode.SelectSingleNode("//h2[#id='hd1']");
_hdPatient.InnerHtml = "Patient SurName"

Trying to scrape a webpage with Agility on C#

i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?
If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;

How to get content via xpath

On the web page i have
<meta name="description" content="Learn about 94.100.179.159" />
how can i get exactly the text "Learn about 94.100.179.159" via Xpath or HtmlAgilityPack
i've tried
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
string s = link.InnerText;
Console.WriteLine(s);
}
Console.ReadLine();
but that gives me not that i want, how to solve that?
//meta[#name = 'description']/#content
is the XPATH for the attribute you specified
string s = link.Value;
should return the attribute content.
Meta tags don't have any inner text, they have attributes.
Try this:
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
Console.WriteLine("-META-");
var attribDump=link.Attributes.Select(a=>a.Name+" : "+a.Value);
foreach (var x in attribDump)
{
Console.WriteLine(x);
}
}
Select the nodes as follows
SelectNodes("//*[local-name()='meta')]"))
Then, for each HtmlNode,
Console.WriteLine(link.Attributes["content"].Value);

Categories