C# Html Agility Pack ( SelectSingleNode ) - c#

I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?

The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;

Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;

HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;

public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful

Related

Using Regex to insert domain name into url

I am pulling in text from a database that is formatted like the sample below. I want to insert the domain name in front of every URL within this block of text.
<p>We recommend you check out the article
<a id="navitem" href="/article/why-apples-new-iphones-may-delight-and-worry-it-pros/" target="_top">
Why Apple's new iPhones may delight and worry IT pros</a> to learn more</p>
So with the example above in mind I want to insert http://www.mydomainname.com/ into the URL so it reads:
href="http://www.mydomainname.com/article/why-apples-new-iphones-may-delight-and-worry-it-pros/"
I figured I could use regex and replace href=" with href="http://www.mydomainname.com but this appears to not be working as I intended. Any suggestions or better methods I should be attempting?
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"^href=\"$", "href=\"https://www.mydomainname.com/");
You could use regex...
...but it's very much the wrong tool for the job.
Uri has some handy constructors/factory methods for just this purpose:
Uri ConvertHref(Uri sourcePageUri, string href)
{
//could really just be return new Uri(sourcePageUri, href);
//but TryCreate gives more options...
Uri newAbsUri;
if (Uri.TryCreate(sourcePageUri, href, out newAbsUri))
{
return newAbsUri;
}
throw new Exception();
}
so, say sourcePageUri is
var sourcePageUri = new Uri("https://somehost/some/page");
the output of our method with a few different values for href:
https://www.foo.com/woo/har => https://www.foo.com/woo/har
/woo/har => https://somehost/woo/har
woo/har => https://somehost/some/woo/har
...so it's the same interpretation as the browser makes. Perfect, no?
Try this code:
var content = Regex.Replace(DataBinder.Eval(e.Item.DataItem, "Content").ToString(),
"(href=[ \t]*\")\/", "$1https://www.mydomainname.com/", RegexOptions.Multiline);
Use html parser, like CsQuery.
var html = "your html text here";
var path = "http://www.mydomainname.com";
CQ dom = html;
CQ links = dom["a"];
foreach (var link in links)
link.SetAttribute("href", path + link["href"]);
html = dom.Html();

get value from web page using Html Agility Pack

I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}
Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}
The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler

Login through portal and trying to load HTML

I have a C# program that logs into a portal, and then needs to test if an element with a specific ID exists on the page or not. In order to test for this, I grab the HTML from the page and search for an element with a matching ID in the HTML.
However, whenever I try to access the HTML with this script, it always returns the HTML of the portal login page, and not the page after one logs in through the portal. I can confirm 100% that the program is logging into the portal, however for some reason it is still returning the wrong HTML.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString(this.currentStepUrl);
doc.LoadHtml(html);
var foo = (from bar in doc.DocumentNode.DescendantNodes()
where bar.GetAttributeValue("id", null) == expected
select bar).FirstOrDefault();
if (foo != null)
{
currentTestCaseResults[0]++;
}
else
{
currentTestCaseResults[1]++;
}
Easy fix I guess:
Replaced everything but the if clause with
HtmlElement expectedElement webBrowser2.Document.GetElementById(expected);

Trying to scrape a webpage with Agility on C#

i m using C# and Agility Pack to scrape a website but the result i m getting differs from what i m seeing in firebug. I suppose this is cause the site is using some Ajax.
// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();
// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.saxobank.com/market-insight/saxotools/forex-open-positions");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("href");
// If there is no node with that Id, someNode will be null
richTextBox1.Text = document.DocumentNode.OuterHtml;
Could someone advise on the code on how to do this correctly cause what i m getting back is just plain html. What i m looking for is for a
div id= "ctl00_MainContent_PositionRatios1_canvasContainer"
Any ideas?
If you have the ID, then use it - href is the name of an attribute, not the id.
HtmlNode someNode =
document.GetElementbyId("ctl00_MainContent_PositionRatios1_canvasContainer");
This will be the div with that id.
You can then select any a child node and its href attribute of that node using:
var href = someNode.SelectNodes("//a")[0].Attributes["href"].Value;

How to get XML-code of webpage that is opened in IE (without using WebRequest)?

I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code

Categories