I am using VS2010 and using HTMLAGilityPack1.4.6 (from Net40-folder).
Following is my HTML
<html>
<body>
<div id="header">
<h2 id="hd1">
Patient Name
</h2>
</div>
</body>
</html>
I am using following code in C# to access "hd1".
Please tell me correct way to do it.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try
{
string filePath = "E:\\file1.htm";
htmlDoc.LoadHtml(filePath);
if (htmlDoc.DocumentNode != null)
{
HtmlNodeCollection _hdPatient = htmlDoc.DocumentNode.SelectNodes("//h2[#id=hd1]");
// htmlDoc.DocumentNode.SelectNodes("//h2[#id='hd1']");
//_hdPatient.InnerHtml = "Patient SurName";
}
}
catch (Exception ex)
{
throw ex;
}
Tried many permutations and combinations... I get null.
plz help.
Your problem is the way how you load data into HtmlDocument. In order to load data from file you should use Load(fileName) method. But you are using LoadHtml(htmlString) method, which treats "E:\\file1.htm" as document content. When HtmlAgilityPack tries to find h2 tags in E:\\file1.htm string, it finds nothing. Here is the correct way to load html file:
string filePath = "E:\\file1.htm";
htmlDoc.Load(filePath); // use instead of LoadHtml
Also #Simon Mourier correctly pointed that you should use SelectSingleNode method if you are trying to get single node:
// Single HtmlNode
var patient = doc.DocumentNode.SelectSingleNode("//h2[#id='hd1'");
patient.InnerHtml = "Patient SurName";
Or if you are working with collection of nodes, then process them in a loop:
// Collection of nodes
var patients = doc.DocumentNode.SelectNodes("//div[#class='patient'");
foreach (var patient in patients)
patient.SetAttributeValue("style", "visibility: hidden");
You were almost there:
HtmlNode _hdPatient = htmlDoc.DocumentNode.SelectSingleNode("//h2[#id='hd1']");
_hdPatient.InnerHtml = "Patient SurName"
Related
I'm trying to scrape a link from the source code of a website that varies with every source code.
Form example:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10 changes to any http://www + number like http://www65.
How can I scrape the exact link with the new changed number?
Edit :
Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
How about a JS function like this, run when the page loads:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });
I've been trying to extract data from a website by giving it the HTML string.
I did some research and figured out that I had to use the HtmlAgilityPack; however,
I can't figure out how to apply the examples to my case.
I've done different tests but none seem to work.
A webpage example could be
http://www.trivago.com/?aDateRange[arr]=2014-11-02&aDateRange[dep]=2014-11-03&iRoomType=7&iPathId=34741&iGeoDistanceItem=0&iViewType=0&bIsSeoPage=false&bIsSitemap=false&
I would only need to extract the contact data,
address, telephone, the official homepage link and the title of the
element in the list.
I tried moving through the source with Firebug and the class structure
to get to this data is as follows:
class="no-touch"
class="web10152"
class="page_wrapper"
class="main_content"
class="main"
class="centercol content"
class="content"
class="container_itemlist itemlist_simplified"
class="itemlist hotellist group component" // Has a List of each item
// Item (undernode of itemlist hotellist group component)
class="hotel item bookmarkable historisable" //item main class
// Path to get title
class="cf item_wrapper"
class="item_prices"
<h3 title="ITEM TITLE" </h3>
// Path to get contact info
class="slideout_wrapper component expand"
class="slideout_content_container"
class="slideout_content info item_info js_trivago_info active"
class="item_info_block contact" // Contains info
<em> ADDRESS INFORMATION </em>
<em> TELEPHONE INFO </em>
class="partnerHomepageLink link"
//Contains Link info
I don't know how to communicate this with HtmlAgilityPack.
Here is the last thing i tried...
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
try
{
var table = doc.DocumentNode.SelectSingleNode("//h3[#class='jsheadline js_slideout_trigger js_trackable']/title");
var table1 = doc.DocumentNode.SelectSingleNode("//div[#class='item_info_block contact']");
var ele = table1.Elements("em");
}
catch { Program.ChangeColor(Program.TextColors.PROGRAM_ERROR);
Console.WriteLine("\nError Report: Failed to parse page!");
}
How can I achieve this?
Let say this is my html code
<a class="" data-tracking-id="0_Motorola"
href="/motorola?otracker=nmenu_sub_electronics_0_Motorola">
Motorola
</a>
I used C# code to find the href value like this
var tags = htmlDoc.DocumentNode.SelectNodes("//div[#class='top-menu unit']
//ul//li//div[#id='submenu_electronics']//a");
if (tags != null)
{
foreach (var t in tags)
{
var name = t.InnerText.Trim();
var url =t.Attributes["href"].Value;
}
}
I am getting url='/motorola' but I need url=/motorola?otracker=nmenu_sub_electronics_0_Motorola
its not appending text after ?,&.. Please clarify where I went wrong.
I have used HtmlAgilityPack in the past and I have previously used it like this :
var url = t.GetAttributeValue("href","");
You can try that and see if it works.
I am trying to get the value of the "Pool Hashrate" using the HTML Agility Pack. Right when I hit my string hash, I get Object reference not set to an instance of an object. Can somebody tell me what I am doing wrong?
string url = http://p2pool.org/ltcstats.php?address
protected void Page_Load(string address)
{
string url = address;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string hash = doc.DocumentNode.SelectNodes("/html/body/div/center/div/table/tbody/tr[1]")[0].InnerText;
}
Assuming you're trying to access that url, of course it should fail. That url doesn't return a full document, but just a fragment of html. There is no html tag, there is no body tag, just the div. Your xpath query returns nothing and thus the null reference exception. You need to query the right thing.
When I access that url, it returns this:
<div>
<center>
<div style="margin-right: 20px;">
<h3>Personal LTC Stats</h3>
<table class='zebra-striped'>
<tr><td>Pool Hashrate: </td><td>66.896 Mh/s</td></tr>
<tr><td>Your Hashrate: </td><td>0 Mh/s</td></tr>
<tr><td>Estimated Payout: </td><td> LTC</td></tr>
</table>
</div>
</center>
</div>
Given this, if you wanted to get the Pool Hashrate, you'd use a query more like this:
/div/center/div/table/tr[1]/td[2]
In the end you need to do this:
var url = "http://p2pool.org/ltcstats.php?address";
var web = new HtmlWeb();
var doc = web.Load(url);
var xpath = "/div/center/div/table/tr[1]/td[2]";
var poolHashrate = doc.DocumentNode.SelectSingleNode(xpath);
if (poolHashrate != null)
{
var hash = poolHashrate.InnerText;
// do stuff with hash
}
The problem is that xpath is not finding the specified node. You can specify an id to the table or the tr in order to have a smaller xpath
Also, based on your code I assume that you're looking for a single node only, so you may want to use something like this
doc.DocumentNode.SelectSingleNode("xpath");
Another good option is using Fizzler
I'm trying to parse this field, but can't get it to work. Current attempt:
var name = doc.DocumentNode.SelectSingleNode("//*[#id='my_name']").InnerHtml;
<h1 class="bla" id="my_name">namehere</h1>
Error: Object reference not set to an instance of an object.
Appreciate any help.
#John - I can assure that the HTML is correctly loaded. I am trying to read my facebook name for learning purposes. Here is a screenshot from the Firebug plugin. The version i am using is 1.4.0.
http://i54.tinypic.com/kn3wo.jpg
I guess the problem is that profile_name is a child node or something, that's why I'm not able to read it?
The reason your code doesn't work is because there is JavaScript on the page that is actually writing out the <h1 id='profile_name'> tag, so if you're requesting the page from a User Agent (or via AJAX) that doesn't execute JavaScript then you won't find the element.
I was able to get my own name using the following selector:
string name =
doc.DocumentNode.SelectSingleNode("//a[#id='navAccountName']").InnerText;
Try this:
var name = doc.DocumentNode.SelectSingleNode("//#id='my_name'").InnerHtml;
HtmlAgilityPack.HtmlNode name = doc.DocumentNode.SelectSingleNode("//h1[#id='my_name']").InnerText;
public async Task<List<string>> GetAllTagLinkContent(string content)
{
string html = string.Format("<html><head></head><body>{0}</body></html>", content);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//[#id='my_name']");
return nodes.ToList().ConvertAll(r => r.InnerText).Select(j => j).ToList();
}
It's ok with ("//a[#href]"); You can try it as above.Hope helpful