Trying to scrape a webpage with Agility on C# - c#

i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?

If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;

Related

problam get element value from dynamic URL Adress

I want scrape element value from url adress by c# but dont work program.
I tested many codes but I could not get the element value
using HtmlAgilityPack;
var web = new HtmlWeb();
var doc = web.Load("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=5516102131364383");
var element = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'ltr inline')]");
MessageBox.Show(element.InnerText);

c# and HtmlAgilityPack: get value of a href attribute

Using vs 2019 and .net 4.8, the c# code below get for following html node and I'm having trouble getting the href value.
The href attribute has a full url but the only text I'm getting is "/".
Can someone please let me know where I'm going wrong and how to get the full url text?
Thank you.
The node:
<h2 class="n">
3.
<a class="business-name" href="/santa-monica-ca/mip/specialists-in-custom-software-16438720" data-analytics="{"target":"name","feature_click":""}" rel="" data-impressed="1">
<span>Specialists In Custom Software</span>
</a>
</h2>
My code:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']").ToArray();
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectNodes("//a//span").First().InnerText);
Console.WriteLine(node.SelectNodes("//a").First().Attributes["href"].Value);
}
That should do it, although I didn't understand why it doesn't work the way it is:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
var nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectSingleNode("a/span").InnerText);
Console.WriteLine(node.SelectSingleNode("a").Attributes["href"].Value);
Console.WriteLine();
}
Fiddle: https://dotnetfiddle.net/dtoZGl

How do I access the title attribute with xpath?

So I am trying to access the title attribute in the colors section.
So if you were to hover over any of the small images to the right side of the product you will see that it says the Color name.
I've managed to navigate to there but I can't for the life of me figure out how to get the title attribute. Currently it's printing out nodes but I want to access the title attribute.
How do i properly access the title attribute and print out the corresponding color to the pictures?
This is the test link I am using (AliExpress)
Console.WriteLine("Product URL: ");
//Declare the URL
string url = Console.ReadLine();
// HtmlWeb - A Utility class to get HTML document from http
var web = new HtmlWeb();
//Load() Method download the specified HTML document from an Internet resource.
var doc = web.Load(url);
var nodes = doc.DocumentNode.SelectNodes("//li[#class = 'item-sku-image']");
foreach (var node in nodes)
{
//var colors = node.Attributes["/a[title]"].Value;
Console.WriteLine(node);
}
Console.ReadLine();
You can try and iterate over the following X-Path: //li[#class = 'item-sku-image']/a/img/#title, or else, replace this: //li[#class = 'item-sku-image'], with this: //li[#class = 'item-sku-image']/a/img, and then check the attributes of the nodes.
It should yield a series of strings which contain the title you are after.

get value from web page using Html Agility Pack

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}
Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}
The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

C# Html Agility Pack ( SelectSingleNode )

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

Categories