I'm currently crawling some web sites and retrieving information from them to store into a database for later use. I'm using HtmlAgilityPack and I've successfully done this for a few sites now but for some reason this one is giving me issues. I'm fairly new to XPath syntax still so I'm probably messing up there.
Heres what the code from the site looks like that I'm trying to retreive:
<form ... id="_subcat_ids_">
<input ....>
<ul ...>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#1">
Text I Need //need to retrieve this text between then <a></a>
<span class="subtle-note">(2)</span> //I Need that number from inside the span
</a>
</li>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#2">
Text I Need #2 //need to retrieve this text between then <a></a>
<span class="subtle-note">(6)</span> //I Need that number from inside the span
</a>
</li>
Each one of those represents an item on a page, but I'm only interested in what is happening with each <a></a>. I want to retrieve that href value from inside the <a>, then the text between the opening and closing, then I need the text inside the <span>. I left out the stuff inside of the other tags because they do not help uniquely identify each item, the class inside <a> is the only thing they share, and they are all inside of the form with id="_subcat_ids_".
Heres my code:
try
{
string fullUrl = "...";
HtmlWeb web = new HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
HtmlDocument html = web.Load(fullUrl);
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[#id='_subcat_ids_']")) //this gets me into the form
{
foreach (HtmlNode node2 in node.SelectNodes(".//a[#class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
{
//get the href
string tempHref = node2.GetAttributeValue("href", string.Empty);
//get the text between <a>
string tempCat = node2.InnerText.Trim();
//get the text between <span>
string tempNum = node2.SelectSingleNode(".//span[#class='subtle-note']").InnerText.Trim();
}
}
}
catch (Exception ex)
{
Console.Write("\nError: " + ex.ToString());
}
That first foreach loop doesn't error, but the second one gives me object reference not set to an instance of an object at the line where my second foreach loop is. Like I mentioned before, I'm still new to this syntax, I've used this type of method on another website with great success but I'm having some trouble with this site. Any tips would be appreciated.
Well I figured it out, heres the code
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[#id='_subcat_ids_']"))
{
//get the categories, store in list
foreach (HtmlNode node2 in node.SelectNodes("..//a[#class='facet-selection multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]"))
{
string tempCat = node2.InnerText.Trim();
categoryList.Add(tempCat);
Console.Write("\nCategory: " + tempCat);
}
foreach (HtmlNode node3 in node.SelectNodes("..//a[#class='facet-selection multiselect-facets ']"))
{
//get href for each category, store in list
string tempHref = node3.GetAttributeValue("href", string.Empty);
LinkCatList.Add(tempHref);
Console.Write("\nhref: " + tempHref);
//get the number of items from categories, store in list
string tempNum = node3.SelectSingleNode(".//span[#class='subtle-note']").InnerText.Trim();
string tp = tempNum.Replace("(", "");
tempNum = tp;
tp = tempNum.Replace(")", "");
tempNum = tp;
Console.Write("\nNumber of items: " + tempNum + "\n\n");
}
}
works like a charm
Related
i want to try a programm which reads out Values of a website with Geckofx. Now i have the following Problem that i dont get the needed Values and it shows to me it is null.
The HTML Code i want to access:
<li id="box" class="tooltip" title="">
<div class="classname"></div>
<span class="value">
<span id="class_test" class="">48.066</span>
</span>
</li>
48.066 is the Value i want to read.
I searched now for about 2 days for a solution that i can go on with my private project i hope anyone can help me :)
Solutions i tried:
Test 1:
GeckoElement testelement = null;
testelement = (GeckoElement)Browser.Document.GetElementById("class_test");
string text = testelement.GetAttribute("value");
Test 2:
GeckoHtmlElement testelement = null;
testelement = (GeckoHtmlElement)Browser.Document.GetHtmlElementById("class_test");
string text = testelement.InnerHtml;
If testelement is null, you're not loading the page correctly, or the source is incorrect.
This works fine:
string content = "<html><body><li id=\"box\" class=\"tooltip\" title=\"\">"+
"<div class=\"classname\"></div>" +
"<span class=\"value\">" +
"<span id=\"class_test\" class=\"\">48.066</span>" +
"</span></li></body></html>";
webBrowser1.LoadHtml(content, "http://www.example.com");
And in webBrowser_DocumentCompleted:
GeckoElement testelement = null;
testelement = (GeckoElement)webBrowser1.Document.GetElementById("class_test");
string text = testelement.InnerHtml; // 48.066
Most likely, you need to wait for the document to finish loading before looking for elements.
Browser.DocumentCompleted += (sender, e) =>
{
var testElement = Browser.Document.GetElementById("class_test") as GeckoElement;
// TODO: handle testElement being null
string text = testElement.GetAttribute("value");
}
This code can work with one of the web, but with some sites it back error messages like this, I do not know how to edit (Error in stars)
var document = webBrowser1.Document;
var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)document.DomDocument;
var htmlString = documentAsIHtmlDocument3.documentElement.innerHTML;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlString);
// Sử dụng node để lấy tin
HtmlNodeCollection texts = doc.DocumentNode.SelectNodes("//div[#id='footer']/p");
string kq = "";
// cho vòng lặp để lấy kết quả
foreach (var item in texts)
{
kq += item.InnerText + Environment.NewLine;
}
richTextBox1.Text = kq;
HTML code:
<div id="divTop" >
<div id="text-conent" style="width: 500px; float: right;"></div>
<div id="grid" style="margin-removed 505px; height: 700px;"></div>
</div>
It seems that on the pages where this is successful there exists a div with the id of footer
But on other pages where this fails no such div exists.
So it seems like your logic may need to change to make the search expression that doc.DocumentNode.SelectNodes more forgiving.
Alternatively create a few more search strings that would work if your original fails:
if(texts == null){
texts = doc.DocumentNode.SelectNodes("some other search string");
}
etc.
c# code:`
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
textBox1.Text=ournode.InnerHtml;
`
html code :
//< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<img alt="" class="head" id="face_56578735" src="http://img1.ask.fm/assets2/091/364/883/712/thumb_tiny/11094711_919135961470973_149663457_njpg720960png1280963.png" />
I want to see the following in the text box
/sudenur3434
/leylaulucay
I have added an additional line to your code:
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
var val = ournode.Attributes["href"].Value;
textBox1.Text=val;
This would let you get the href attribute. Simply use the same code to get the other nodes href value and then add them to your textbox
Since a text box is usually used for one liners, I am giving you an example that will simply write all links in the direct output window of VS.
If you use e.g. a ListBox instead of a text box you can replace Debug.Print by e.g. ListBox1.Items.Add(href.Value)
This here will give you all href urls from all a children in div id="heads":
var site = new HtmlWeb();
var htmldoc = site.Load("http://ask.fm/");
var headDiv = htmldoc.DocumentNode.SelectSingleNode("//div[#id='heads']");
if (headDiv != null)
{
var anchors = headDiv.SelectNodes("a");
foreach (HtmlNode aNode in anchors)
{
var href = aNode.Attributes.AttributesWithName("href").FirstOrDefault();
if (href != null)
Debug.Print(href.Value);
}
}
< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<a href="/leylaulucay" data-rlt-aid="welcome_head"><img alt="" class="head" id="face_5
how to agility pack parse in textbox
Let say this is my html code
<a class="" data-tracking-id="0_Motorola"
href="/motorola?otracker=nmenu_sub_electronics_0_Motorola">
Motorola
</a>
I used C# code to find the href value like this
var tags = htmlDoc.DocumentNode.SelectNodes("//div[#class='top-menu unit']
//ul//li//div[#id='submenu_electronics']//a");
if (tags != null)
{
foreach (var t in tags)
{
var name = t.InnerText.Trim();
var url =t.Attributes["href"].Value;
}
}
I am getting url='/motorola' but I need url=/motorola?otracker=nmenu_sub_electronics_0_Motorola
its not appending text after ?,&.. Please clarify where I went wrong.
I have used HtmlAgilityPack in the past and I have previously used it like this :
var url = t.GetAttributeValue("href","");
You can try that and see if it works.
I am using VS2010 and using HTMLAGilityPack1.4.6 (from Net40-folder).
Following is my HTML
<html>
<body>
<div id="header">
<h2 id="hd1">
Patient Name
</h2>
</div>
</body>
</html>
I am using following code in C# to access "hd1".
Please tell me correct way to do it.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try
{
string filePath = "E:\\file1.htm";
htmlDoc.LoadHtml(filePath);
if (htmlDoc.DocumentNode != null)
{
HtmlNodeCollection _hdPatient = htmlDoc.DocumentNode.SelectNodes("//h2[#id=hd1]");
// htmlDoc.DocumentNode.SelectNodes("//h2[#id='hd1']");
//_hdPatient.InnerHtml = "Patient SurName";
}
}
catch (Exception ex)
{
throw ex;
}
Tried many permutations and combinations... I get null.
plz help.
Your problem is the way how you load data into HtmlDocument. In order to load data from file you should use Load(fileName) method. But you are using LoadHtml(htmlString) method, which treats "E:\\file1.htm" as document content. When HtmlAgilityPack tries to find h2 tags in E:\\file1.htm string, it finds nothing. Here is the correct way to load html file:
string filePath = "E:\\file1.htm";
htmlDoc.Load(filePath); // use instead of LoadHtml
Also #Simon Mourier correctly pointed that you should use SelectSingleNode method if you are trying to get single node:
// Single HtmlNode
var patient = doc.DocumentNode.SelectSingleNode("//h2[#id='hd1'");
patient.InnerHtml = "Patient SurName";
Or if you are working with collection of nodes, then process them in a loop:
// Collection of nodes
var patients = doc.DocumentNode.SelectNodes("//div[#class='patient'");
foreach (var patient in patients)
patient.SetAttributeValue("style", "visibility: hidden");
You were almost there:
HtmlNode _hdPatient = htmlDoc.DocumentNode.SelectSingleNode("//h2[#id='hd1']");
_hdPatient.InnerHtml = "Patient SurName"