How to find last column of a table using Html Agility Pack - c#

I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?

I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1

Related

c# - html "nested table" in class

WebClient client = new WebClient();
var data = client.DownloadString("a web link");
and i am getting an HTML page in which there's a table like this
<table>
<tr>
<td> Team 1 ID </td>
<td> Team 1 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
<tr>
<td> Member 2 name </td>
<td> Member 2 age </td>
</tr>
</table>
</td>
</tr>
<tr>
<td> Team 2 ID </td>
<td> Team 2 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
</table>
</td>
</tr>
that means another table in each row of main table so i called it nested table.
whatever, now i want to get these data into class like this
class Team
{
public int teamID;
public string teamName;
public struct Member
{
public string memberName;
public int memberAge;
}
public Member member1;
public Member member2;
}
note that, each team might have 0 to 3 members
so i am seeking for a sound solution that can solve my problem.
should i use RegEx or HtmlAgilityPack or which way is appropriate and how?
thanks in advance
Just use HtmlAgilityPack. If you run into any troubles, I can help you.
Regular expressions can only match regular languages but HTML is a
context-free language. The only thing you can do with regexps on HTML
is heuristics but that will not work on every condition. It should be
possible to present a HTML file that will be matched wrongly by any
regular expression.
Using regular expressions to parse HTML: why not?
It will be easier if your html contains any identifiers (css classes or id)
Updated code: Here is my suggestion to approach your problem
string mainURL = "your url";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(mainURL);
var tables = doc.DocumentNode.Descendants("table").Where(_ => _.Descendants("table").Any());//this will give you all tables which contain another table inside
foreach (var table in tables)
{
var rows = table.ChildNodes.Where(_ => _.Name.Equals("tr"));//get all tr children (not grand children)
foreach (var row in rows)
{
for (int i = 0; i < row.ChildNodes.Count; i++)
{
if (row.ChildNodes[i].Name.Equals("td"))
{
//you can put your logic here, for eg i == 0, assign it to TeamID properties etc...
}
if (row.ChildNodes[i].Name.Equals("table"))
{
//here is your logic to handle nested table
}
}
}
}

Get HTML values from web response

I am trying to parse an HTML response for a couple of values and then insert them into SQL. I am able to get both values but, because the code is wrapped in a foreach statement, I get them twice.
Here is my HTML response
<div align="CENTER" class='dataTitle'>Host State Breakdowns:</div>
<p align='center'>
<a href='trends.cgi?host=hostname&includesoftstates=no&assumeinitialstates=yes&initialassumedhoststate=0&backtrack=4'><img src='trends.cgi?createimage&host=hostname&includesoftstates=no&initialassumedhoststate=0&backtrack=4' border="1" alt='Host State Trends' title='Host State Trends' width='500' height='20'></a><br>
</p>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>State</th><th class='data'>Type / Reason</th><th class='data'>Time</th><th class='data'>% Total Time</th><th class='data'>% Known Time</th></tr>
<tr class='dataEven'><td class='hostUP' rowspan="3">UP</td><td class='dataEven'>Unscheduled</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
<tr class='dataEven'><td class='dataEven'>Scheduled</td><td class='dataEven'>0d 0h 0m 0s</td><td class='dataEven'>0.000%</td><td class='dataEven'>0.000%</td></tr>
<tr class='hostUNREACHABLE'><td class='hostUNREACHABLE'>Total</td><td class='hostUNREACHABLE'>0d 0h 0m 0s</td><td class='hostUNREACHABLE'>0.000%</td><td class='hostUNREACHABLE'>0.000%</td></tr>
<tr class='dataOdd'><td class='dataOdd' rowspan="3">Undetermined</td><td class='dataOdd'>Nagios Not Running</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Insufficient Data</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Total</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr><td colspan="3"></td></tr>
<tr class='dataEven'><td class='dataEven'>All</td><td class='dataEven'>Total</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
</table>
</div>
<br><br>
<div align="CENTER" class='dataTitle'>State Breakdowns For Host Services:</div>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>Service</th><th class='data'>% Time OK</th><th class='data'>% Time Warning</th><th class='data'>% Time Unknown</th><th class='data'>% Time Critical</th><th class='data'>% Time Undetermined</th></tr>
<tr class='dataOdd'><td class='dataOdd'><a href='avail.cgi?host=hostname&service=servicename&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataOdd'>0.000%</td></tr>
<tr class='dataEven'><td class='dataEven'><a href='avail.cgi?host=hostname&service=servicename2&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename2</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataEven'>0.000%</td></tr>
</table>
</div>
Here is my code:
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//table[#class]"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class = 'serviceOK']"))
{
var value = node2.InnerText;
}
foreach (HtmlNode node3 in node.SelectNodes("//a[contains(#href, 'avail.cgi')]"))
{
var name = node3.InnerText;
}
}
name shows the servicename and value shows the class serviceOK but it repeats itself again because of the first foreach.
My results look like this:
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
Is there a way to, first, match the values up, and two, only have them show once?
Your first foreach traverses the entire document as do both of your other foreach statements inside of the first.
Because there are 2 table elements matching your XPath expression
"//table[#class]"
you are getting your answer twice. If you had more table elements matching your XPath expression, say 7 for example, you would get the result 7 times.
What you want is to find all table divisions (td) with class "serviceOK" that are within a table row (tr) within a table.
Once you have this HtmlNode you can just go to the previous sibling which will contain the service name.
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode serviceOkNode in doc.DocumentNode.SelectNodes("//table[#class]/tr/td[#class = 'serviceOK']"))
{
HtmlNode serviceNameNode = serviceOkNode.PreviousSibling;
var value = serviceOkNode.InnerText;
var name = serviceNameNode.InnerText;
}

How to get next 2 nodes in HTML + HTMLAgilitypack

I have a table in the HTML code below:
<table style="padding: 0px; border-collapse: collapse;">
<tr>
<td><h3>My Regional Financial Office</h3></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><h3>My Address</h3></td>
</tr>
<tr>
<td>000 Test Ave S Ste 000</td>
</tr>
<tr>
<td>Golden Valley, MN 00000</td>
</tr>
<tr>
<td>Get Directions</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
How can I get the inner text of the next 2 <tr> tags after the tablerow containing the text "My Address?"
You can use following XPath :
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tdOfInterests =
htmlDoc.DocumentNode
.SelectNodes("//tr[td/h3[.='My Address']]/following-sibling::tr[position() <= 2]/td");
foreach (HtmlNode td in tdOfInterests)
{
//given html input in question following code will print following 2 lines:
//000 Test Ave S Ste 000
//Golden Valley, MN 00000
Console.WriteLine(td.InnerText);
}
The key of above XPath is using following-sibling with position() filter.
UPDATE :
A bit explanation about the XPath used in this answer :
//tr[td/h3[.='My Address']]
above part select <tr> element that has :
child <td> element that has child <h3> element with value equals
'My Address'
/following-sibling::tr[position() <= 2]
next part select following <tr> element with position <=2 from current <tr> element (the one selected by previous XPath part)
/td
the last part select child <td> element from current <tr> element

get a specific row from the html with a specific word using regular expression

I want to fetch all rows having a specific word/string in its.. and store it in array
I have a string as below
<tr>
<td>Total</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>ABC</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>XYZ</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>Total</td>
<td>7676</td>
<td>8767</td>
</tr>
I want to fetch a row having the string Total and the value of should store in array
So output should
<tr>
<td>Total</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>Total</td>
<td>7676</td>
<td>8767</td>
</tr>
what should be the regular expression to fetch a row with a string "Total"
To build arrays for each table row that has a cell with the word "Total", you could use this regex:
(?<=<tr>\s*<td>Total</td>)(\s*<td>\d+</td>)+(?=\s*</tr>)
Which would give you the following 2 matches:
<td>123</td>
<td>567</td>
and
<td>7676</td>
<td>8767</td>
On these matches you could then split with this regex to get arrays in return:
\D+
IN JQUERY UR SOLUTION WILL--->
var tbl = $('#tblId')
var array = [];
$('tr td' ,tbl).each(function(){
var htmlstring = (this).innerHTML;
if(htmlstring == 'Total')
{
if((this).innerHTML == 'Total')
{
$('td', this.parentNode).each(function(){
array.push(this);
});
}
}
});
alert(array);
http://jsfiddle.net/gXGj6/13/
Good solution using jQuery:
http://jsfiddle.net/robfarmer/KaGBL/2/
HTML:
Source
<table id="source">
<tr>
<td>Total</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>ABC</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>XYZ</td>
<td>123</td>
<td>567</td>
</tr>
<tr>
<td>Total</td>
<td>7676</td>
<td>8767</td>
</tr>
</table>
Results
<table id="results"></table>
Array Results:
<ul id="arrayResults"/>
Javascript
$(document).ready(function() {
$("#source tr td:contains('Total')").closest("tr").clone().appendTo("#results");
var cells = [];
$("#source tr td:contains('Total')").closest("tr")
.children("td").not(":contains('Total')").each(function(index, element) {
cells.push($(element).text());
});
$(cells).each(function(index, element) {
$("#arrayResults").append($("<li>").text(element));
});
});

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories