How to select text from that html code?

How to select text from that html code? - c#

I have some html code
<tfoot>
<tr>
<th class="first"> </th>
<td class="first-col">14</td>
<td class="">15</td>
<td class="">16</td>
<td class="">17</td>
<td class="">18</td>
<td class="">19</td>
<td class="">20</td>
<td class="last-col">21</td>
</tr>
</tfoot>
and i need to select text from first <td> (14). I use
HtmlAgilityPack and code like that:
_footer = _htmlDocument.DocumentNode.SelectSingleNode("//tfoot/tr/th[#class='first']/td[1]");
return _footer.First().InnerText;
It return me nothing. What I'm doing wrong?

td is not child element of th. They are at same level. You should select td as direct child of tr. And you can specify it's class first-col instead of using index:
//tfoot/tr/td[#class='first-col']
Also don't use First() thus you are selecting single node:
return _footer.InnerText;
NOTE: As Jon pointed, you can still use your code to select cell by index instead of using it's class:
//tfoot/tr/td[0]

Related

Select Node with size of Its specific child Node In Linq, HtmlAgilityPack

I'm trying to get following data.
<html>
<body>
<tr class="udline">
<th rowspan="2" class="noln">시간</th>
<th rowspan="2">개인</th>
<th rowspan="2">외국인</th>
<th rowspan="2">기관계</th>
<th colspan="6" class="eb">기관</th>
<th rowspan="2">기타법인</th>
</tr>
<tr class="udline">
<th class="sub">금융투자</th>
<th class="sub">보험</th>
<th class="sub">투신<br>(사모)</th>
<th class="sub">은행</th>
<th class="sub">기타금융기관</th>
<th class="sub">연기금등</th>
</tr>
<tr>
<td colspan="11" class="blank_07"></td>
</tr>
<!-- following are data -->
<tr>
<td class="date2">18:01</td>
<td class="rate_up3">2,024</td>
<td class="rate_down3">-3,307</td>
<td class="rate_up3">1,116</td>
<td class="rate_up3">824</td>
<td class="rate_down3">-16</td>
<td class="rate_up3">764</td>
<td class="rate_down3">-43</td>
<td class="rate_down3">-5</td>
<td class="rate_down3">-408</td>
<td class="rate_up3">166</td>
</tr>
<tr>
<td class="date2">18:00</td>
<td class="rate_up3">2,022</td>
<td class="rate_down3">-3,305</td>
<td class="rate_up3">1,116</td>
<td class="rate_up3">824</td>
<td class="rate_down3">-16</td>
<td class="rate_up3">764</td>
<td class="rate_down3">-43</td>
<td class="rate_down3">-5</td>
<td class="rate_down3">-408</td>
<td class="rate_up3">166</td>
</tr>
...
</body></html>
I want to get Nodes list of "tr" tag which has a data. but I have problem with getting "tr" tag.
I think it is enough if I can get sets of "tr" which has 11 td tags.
so I write following source.
result = await httpClient.GetStringAsync(new Uri(timeUrlAddress));
htmlDoc.LoadHtml(result);
var nodes =
htmlDoc.DocumentNode.SelectNodes("//tr")
.Where(i => i.ChildNodes.Any(j => j.Name.Equals("td")).Count>10); // <--- I have Problem.
foreach(var i in nodes) { ... } // <-- iterating list of <tr> tags.
and It doesn't work.
I could get List of tr tag with DoucmentNode.SelectNodes("//tr") ... and I appended .Where(i=>i.ChildNodes.Count >10 ) to get what i want.
but tr has several "text"childNodes and I get Unwanted Node. following picture shows that I got with .Where(i=>i.ChildNodes.Count>10).
I want to get tr node that has td tag as child nodes and has exactly 11 of td tag.
how can I get that tr nodes with Linq syntax..?

If you want tr node with exactrly 11 td children you can use below XPath:
//tr[count(td) = 11]

HtmlAgilityPack SelectNodes Syntax

I have the following HTML:
<tbody>
<tr>
<td class="metadata_name">Headquarters</td>
<td class="metadata_content">Princeton New Jersey, United States</td>
</tr>
<tr>
<td class="metadata_name">Industry</td>
<td class="metadata_content"><ul><li>Engineering Software</li><li>Software Development & Design</li><li>Software</li><li>Custom Software & Technical Consulting</li></ul></td>
</tr>
<tr>
<td class="metadata_name">Revenue</td>
<td class="metadata_content">$17.5 Million</td>
</tr>
<tr>
<td class="metadata_name">Employees</td>
<td class="metadata_content">201 to 500</td>
</tr>
<tr>
<td class="metadata_name">Links</td>
<td class="metadata_content"><ul><li>Company website</li></ul></td>
</tr>
</tbody>
I want to be able to load the metadata_content value (ex "$17.5 Million") in to a var where the metadata_name is = to a value (ex: "Revenue").
I have tried to use combinations of code like this for a few hours...
orgHtml.DocumentNode.SelectNodes("//td[#class='metadata_name']")[0].InnerHtml;
But I'm not getting the right combination down. If you have a helpful SelectNodes syntax - that will get me the solution I would appreciate it.

It seems what you're looking for is this:
var found = orgHtml.DocumentNode.SelectSingleNode(
"//tr[td[#class = 'metadata_name'] = 'Revenue']/td[#class = 'metadata_content']");
if (found != null)
{
string html = found.InnerHtml;
// use html
}
Note that to get the text of an element, you should use found.InnerText, not found.InnerHtml, unless you specifically need its HTML content.

Why my code is selecting all text() nodes in Htmldocument

HtmlNode node = doc.DocumentNode.SelectNodes("//tr")[0];
foreach(HtmlTextNode n in node.SelectNodes("//text()"))
Console.WriteLine(n.Text);
HTML:
<table class="infobox" style="width: 17em; font-size: 100%;float: left;">
<tr>
<th style="text-align: center; background: #f08080;" colspan="3">خدیجہ مستور</th>
</tr>
<tr style="text-align: center;">
<td colspan="3"><img alt="خدیجہ مستور" src="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/150px-Khatijamastoor.JPG" width="150" height="203" srcset="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/225px-Khatijamastoor.JPG 1.5x, //upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/300px-Khatijamastoor.JPG 2x"><br>
<div style="font-size: 90%">خدیجہ مستور</div>
</td>
</tr>
<tr>
<th style="background: #f08080;" colspan="3">ادیب</th>
</tr>
<tr>
<td><b>ولادت</b></td>
<td colspan="2">1930ء، لکھنؤ، برطانوی ہندوستان</td>
</tr>
<tr>
<td><b>اصناف ادب</b></td>
<td colspan="2">ناول</td>
</tr>
<tr>
<td><b>معروف تصانیف</b></td>
<td colspan="2">آنگن</td>
</tr>
</table>
Output Should be :
خدیجہ مستور
but i found :
خدیجہ مستور
خدیجہ مستور
ادیب
ولادت
1930ء
،
لکھنؤ
،
برطانوی ہندوستان
اصناف ادب
ناول
معروف تصانیف
آنگن
Why node.selectNodes("//text()") is selecting all text() nodes in document rather text() nodes from just first tr tag??

Because you are adding two forward slashes to the beginning of your XPath (//tr), which selects all of the elements in the document, not just descendants of the selected node.
Try this instead:
foreach (HtmlTextNode n in node.SelectNodes("text()"))
Or just simplify the XPath to:
var node = doc.DocumentNode.SelectSingleNode("//tr[1]/text()");
Console.WriteLine(node.Text);

Retrieve element within html hierarchy

I have this piece of html code. I want to get the text inside the <div> tag using WatiN. The C# code is below, but I'm pretty sure it could be done way better than my solution. Any suggestions?
HTML:
<table id="someId" cellspacing="0" border="1" style="border-collapse:collapse;" rules="all">
<tbody>
<tr>
<th scope="col"> </th>
</tr>
<tr>
<td>
<div>Some text</div>
</td>
</tr>
</tbody>
</table>
C#
// Get the table ElementContainer
IElementContainer diagnosisElementContainer = (IElementContainer)_control.GetElementById("someId");
// Get the tbody element
IElementContainer tbodyElementContainer = (IElementContainer)diagnosisElementContainer.ChildrenWithTag("tbody");
// Get the <tr> children
ElementCollection trElementContainer = tbodyElementContainer.ChildrenWithTag("tr");
// Get the <td> child of the last <tr>
IElementContainer tdElementContainer = (IElementContainer)trElementContainer.ElementAt<Element>(trElementContainer.Count - 1);
// Get the <div> element inside the <td>
Element divElement = tdElementContainer.Divs[0];

Based on the given, something like this is how I'd go for IE.
IE myIE = new IE();
myIE.GoTo("[theurl]");
string theText = myIE.Table("someId").Divs[0].Text;
The above is working on WatiN 2.1, Win7, IE9.

Html Agility Pack - loop through rows and columns

How can I loop through table and row that have an attribute id or name to get inner text in deep down in each td cell? I work on asp.net, c#, and the newest html agility package. Please guide. Thank you.
An html file have several tables. One of them has an attribute id=main-part. In that identified table, there are many rows. Some of those rows have same attribute name=display. In those named rows, there are many columns which I have to extract text from. Something like this:
<body>
<table>
...
</table>
<table>
...
</table>
<table id="main-part">
<tr>
<td></td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Jan</td>
<td>Feb</td>
<td>Mar</td>
...
</tr>
<tr name="display">
<td>Apr</td>
<td>May</td>
<td>June</td>
...
</tr>
<tr name="display">
<td>Jul</td>
<td>Aug</td>
<td>Sep</td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Oct</td>
<td>Nov</td>
<td>Dec</td>
...
</tr>
<tr>
<td></td>
...
</tr>
</table>
<table>
...
</table>
</body>

You need to select these nodes using xpath:
foreach(HtmlNode cell in doc.DocumentElement.SelectNodes("//tr[#name='display']/td")
{
// get cell data
}

It worked! Thank you very much Oded.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:/samplefolder/sample.htm");
foreach(HtmlNode cell in doc.DocumentNode.SelectNodes("//tr[#name='display']/td"))
{
string test = cell.InnerText;
Response.Write(test);
}
It showed result like JanFebMarAprMayJuneJulAugSepOctNovDec. How can I sort them out, separate by a space or a tab? Thank you.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to select text from that html code? - c#

Related

Select Node with size of Its specific child Node In Linq, HtmlAgilityPack

HtmlAgilityPack SelectNodes Syntax

Why my code is selecting all text() nodes in Htmldocument

Retrieve element within html hierarchy

Html Agility Pack - loop through rows and columns

Categories

Resources