This is a a page from an open databse about food:
http://www.dabas.com/ProductSheet/Details.ashx/121308
Im trying to get some info from this page using XPath.
The table I'm interested in is the one called: Näringsvärde.
I want to get all the textnodes inside "Näringsvärde" saved into a string.
This is the relevant portion of the code linked above:
<!DOCTYPE html>
<html>
...
<body>
...
<table class="width100" style="page-break-inside: avoid">
<caption>
Produktinformation
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleProduktinformation"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyProduktinformation">
<tr>
<td class="col1">
Ursprungsland:
</td>
<td>
Sverige </td>
</tr>
...
</tbody>
</table>
<table id="tableHover" class="width100 marginTop30 bgTable">
<tr class="nohover">
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
<td class="padding">
<table class="noBorder width100">
<tr>
<td class="col1">
Tillagningsstatus:
</td>
<td>Tillagad</td>
<td colspan="2">
&nbsp;
</td>
</tr>
...
</table>
</td>
</tr>
</tbody>
</table>
</td>
...
</html>
I tried using something like this so far, but it didn't work:
public List<string> GetNaring(string xid) {
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(xid);
var xpath = "/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody";
var links = doc.DocumentNode.SelectNodes(xpath);
return links.Select(n => n.InnerText).ToList();
}
But this only gives back null, what am I missing?
The XPath expression:
/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody
does not match any nodes.
Since you have an unique string you can match, you should use it. Searching for that string in the source code, you will find:
...
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
...
The string is a child of the caption element inside the table you want. You have to get the string value of that element, trim the extra spaces and use the result to compare to "Näringsvärde". You can select the correct table using this expression:
//table[normalize-space(caption/text())='Näringsvärde']
Once you have the correct table, you can navigate inside it and select the nodes you want, or you can get the string-value which is a concatenation of all the descendant text nodes:
//table[normalize-space(caption/text())='Näringsvärde']//td
This will return all td nodes, which is where the text is.
Related
Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<tbody>
<tr class="odd">
<tr class="even">
<td class="padding5 sorting_1">
<span class="DateHover" sort="14/03/18/22/56" title="18.03.14" ref="18.03.14">18.03.14</span>
</td>
<td class="CellStyleDefaultText">
<span class="transSpan">Info</span>
</td>
<td class="CellStyleDefaultText" title="UserNumber123">UserNumber123</td>
<td class="CellStyleSignedNumber floatopHomePage">
<span title="701,554.23 ">701,554.23 </span>
</td>
<td class="CellStyleAmount CellStyleAmountNew">
<div title="-3354999.71">-3354999.71</div>
</td>
<td class="CellStyleDetails CCMoreDetailsTd">
<span> 17.03.14 Info</span>
</td>
</tr>
</tbody>
Ok the first span with dateTime i got
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='DateHover']"))
span with info
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='transSpan']"))
and then i stuck to get UserNumber123 i did this
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='CellStyleDefaultText']"))
but it returns me span transSpan as well because it in td
and all others td CellStyleSignedNumber,CellStyleAmount,CellStyleDetails i can't get.
Any ideas?
You can simply mention the attribute name to select element that has particular attribute set. So you can try to get UserNumber123 this way :
doc.DocumentNode.SelectNodes("//td[#class='CellStyleDefaultText' and #title]")
Above XPath means, select <td> element that has title attribute and hass class attribute value equals 'CellStyleDefaultText'.
For the rest <td>, try to use XPath contains() function, for example :
doc.DocumentNode.SelectNodes("//td[contains(#class,'CellStyleSignedNumber')]")
UPDATE :
Responding the latter part of your comment, if you intend to get <td> that has child <span>element, you can add the criteria as simple as following :
doc.DocumentNode.SelectNodes("//td[span and contains(#class,'CellStyleSignedNumber')]")
I have a table in the HTML code below:
<table style="padding: 0px; border-collapse: collapse;">
<tr>
<td><h3>My Regional Financial Office</h3></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><h3>My Address</h3></td>
</tr>
<tr>
<td>000 Test Ave S Ste 000</td>
</tr>
<tr>
<td>Golden Valley, MN 00000</td>
</tr>
<tr>
<td>Get Directions</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
How can I get the inner text of the next 2 <tr> tags after the tablerow containing the text "My Address?"
You can use following XPath :
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tdOfInterests =
htmlDoc.DocumentNode
.SelectNodes("//tr[td/h3[.='My Address']]/following-sibling::tr[position() <= 2]/td");
foreach (HtmlNode td in tdOfInterests)
{
//given html input in question following code will print following 2 lines:
//000 Test Ave S Ste 000
//Golden Valley, MN 00000
Console.WriteLine(td.InnerText);
}
The key of above XPath is using following-sibling with position() filter.
UPDATE :
A bit explanation about the XPath used in this answer :
//tr[td/h3[.='My Address']]
above part select <tr> element that has :
child <td> element that has child <h3> element with value equals
'My Address'
/following-sibling::tr[position() <= 2]
next part select following <tr> element with position <=2 from current <tr> element (the one selected by previous XPath part)
/td
the last part select child <td> element from current <tr> element
I'm trying to parse some HTML using the HTML Agility Pack. The following code snippet selects the table element containing the information I need but I need to dig deeper into the table.
Once I have the InnerHtml of the table, I plan to look for a <td> with an innertext value of "Field #2", for example. But, then, I need to select the innertext of the NEXT <td>. I need the value 110, in this example. How do I do that?
foreach (var x in doc.DocumentNode.SelectNodes("//table[contains(#class,'data')]"))
{
// psuedo code - search for td and use "contains" on the inner text / html.
// Then, grab the next td inner html.
Console.WriteLine(x.InnerHtml);
}
<tr>
<td width="158"><strong>Field #1:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #2:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #3:</strong></td>
<td width="99">85</td>
<td width="119"><strong>Field #4:</strong></td>
<td width="176">-259.34</td>
</tr>
<tr>
<td width="158"><strong>Field #5:</strong></td>
<td width="99">1</td>
<td width="119"><strong>Field #6:</strong></td>
<td width="176">110</td>
</tr>
<tr>
<td width="158"><strong>Field #7:</strong></td>
<td width="99">12</td>
<td width="119"><strong>Field #8:</strong></td>
<td width="176">123.23</td>
</tr>
This piece of code will return you desired td row.
//<td width="176">110</td>
var td = x.SelectNodes("//td").SkipWhile(g => !g.InnerText.Contains("Field #2:")).Select(s => s).Skip(1).FirstOrDefault();
Not sure the agility pack supports it, but in XPath you can query for the next sibling by using /following-sibling:
doc.DocumentNode.SelectNodes(
"//table[contains(#class,'data')]/tr/" +
"td[/strong/text()='Field #2:']" +
"/following-sibling:td");
essentially - find all of the td nodes with the given text, and give me it's next sibling td node.
I have a table like that. And I wanna get the just text FOO COMPANY from between td tags. How can I get it?
<table class="left_company">
<tr>
<td style="BORDER-RIGHT: medium none; bordercolor="#FF0000" align="left" width="291" bgcolor="#FF0000">
<table cellspacing="0" cellpadding="0" width="103%" border="0">
<tr style="CURSOR: hand" onclick="window.open('http://www.foo.com')">
<td class="title_post" title="FOO" valign="center" align="left" colspan="2">
<font style="font-weight: 700" face="Tahoma" color="#FFFFFF" size="2">***FOO COMPANY***</font>
</td>
</tr>
</table>
</td>
</tr>
<table>
I'm using following code but nS is null.
doc = hw.Load("http://www.foo.aspx?page=" + j);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//table[#class='left_company']"))
{
nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']");
}
var text = doc.DocumentNode.Descendants()
.FirstOrDefault(n => n.Attributes["class"] != null &&
n.Attributes["class"].Value == "title_post")
.Element("font").InnerText;
or
var text2 = doc.DocumentNode.SelectNodes("//td[#class='title_post']/font")
.First().InnerText;
Likely the page you are calling generate the content of interest using JavaScript. HtmlAgilityPack does not execute JavaScript, so the content cannot be extracted. One way to confirm this is to try to visit the page with scripting turned off, and try to see if the element you are interested in still exists.
insert some attribute to font element like company="FOO"
then use jquery to get that element like
alert($('font[company="FOO"]').html())
like this
cheers
Close: nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']//text()");
You can then open the nS node to retrieve the text. If there's more than one text node, you'll need to iterate over them.
Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk