get value from web page using Html Agility Pack - c#

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}

Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}

The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

Related

How to scrape a variable data from a source code?

I'm trying to scrape a link from the source code of a website that varies with every source code.
Form example:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10 changes to any http://www + number like http://www65.
How can I scrape the exact link with the new changed number?
Edit :
Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
How about a JS function like this, run when the page loads:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });

Xpath for <a></a> tag

Let say this is my html code
<a class="" data-tracking-id="0_Motorola"
href="/motorola?otracker=nmenu_sub_electronics_0_Motorola">
Motorola
</a>
I used C# code to find the href value like this
var tags = htmlDoc.DocumentNode.SelectNodes("//div[#class='top-menu unit']
//ul//li//div[#id='submenu_electronics']//a");
if (tags != null)
{
foreach (var t in tags)
{
var name = t.InnerText.Trim();
var url =t.Attributes["href"].Value;
}
}
I am getting url='/motorola' but I need url=/motorola?otracker=nmenu_sub_electronics_0_Motorola
its not appending text after ?,&.. Please clarify where I went wrong.
I have used HtmlAgilityPack in the past and I have previously used it like this :
var url = t.GetAttributeValue("href","");
You can try that and see if it works.

selecting Node does not work using HtmlAgilityPack

I am using VS2010 and using HTMLAGilityPack1.4.6 (from Net40-folder).
Following is my HTML
<html>
<body>
<div id="header">
<h2 id="hd1">
Patient Name
</h2>
</div>
</body>
</html>
I am using following code in C# to access "hd1".
Please tell me correct way to do it.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try
{
string filePath = "E:\\file1.htm";
htmlDoc.LoadHtml(filePath);
if (htmlDoc.DocumentNode != null)
{
HtmlNodeCollection _hdPatient = htmlDoc.DocumentNode.SelectNodes("//h2[#id=hd1]");
// htmlDoc.DocumentNode.SelectNodes("//h2[#id='hd1']");
//_hdPatient.InnerHtml = "Patient SurName";
}
}
catch (Exception ex)
{
throw ex;
}
Tried many permutations and combinations... I get null.
plz help.
Your problem is the way how you load data into HtmlDocument. In order to load data from file you should use Load(fileName) method. But you are using LoadHtml(htmlString) method, which treats "E:\\file1.htm" as document content. When HtmlAgilityPack tries to find h2 tags in E:\\file1.htm string, it finds nothing. Here is the correct way to load html file:
string filePath = "E:\\file1.htm";
htmlDoc.Load(filePath); // use instead of LoadHtml
Also #Simon Mourier correctly pointed that you should use SelectSingleNode method if you are trying to get single node:
// Single HtmlNode
var patient = doc.DocumentNode.SelectSingleNode("//h2[#id='hd1'");
patient.InnerHtml = "Patient SurName";
Or if you are working with collection of nodes, then process them in a loop:
// Collection of nodes
var patients = doc.DocumentNode.SelectNodes("//div[#class='patient'");
foreach (var patient in patients)
patient.SetAttributeValue("style", "visibility: hidden");
You were almost there:
HtmlNode _hdPatient = htmlDoc.DocumentNode.SelectSingleNode("//h2[#id='hd1']");
_hdPatient.InnerHtml = "Patient SurName"

Trying to scrape a webpage with Agility on C#

i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?
If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;

C# Html Agility Pack ( SelectSingleNode )

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

Categories