Get specific table from html document with HtmlAgilityPack C# - c#

I have html document with two tables. For example:
<html>
<body>
<p>This is where first table starts</p>
<table>
<tr>
<th>head</th>
<th>head1</th>
</tr>
<tr>
<td>data</td>
<td>data1</td>
</tr>
</table>
<p>This is where second table starts</p>
<table>
<tr>
<th>head</th>
<th>head1</th>
</tr>
<tr>
<td>data</td>
<td>data1</td>
</tr>
</table>
</body>
</html>
And i want to parse first and second but separatly
I will explain:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(#richTextBox1.Text);
if(comboBox_tables.Text.Equals("Table1"))
{
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(string));
dt.Columns.Add("inserted_at", typeof(string));
dt.Columns.Add("DisplayName", typeof(string));
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in doc.DocumentNode.SelectNodes("//tr"))
{
var nodes = row.SelectNodes("td");
if (nodes != null)
{
var id = nodes[0].InnerText;
var inserted_at = nodes[1].InnerText;
var DisplayName = nodes[2].InnerText;
dt.Rows.Add(id, inserted_at, DisplayName);
}
dataGridView1.DataSource = dt;
I'm trying to select first table with //table[1]. But it's always takes both tables. How can i select the first table for if(table1) and the second for else if(table2)?

You are selecting the table[1], but not doing anything with the return value.
Use the table variable to select all tr nodes.
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var row in table.SelectNodes("//tr"))
.. rest of the code

Related

How to find last column of a table using Html Agility Pack

I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?
I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1

How to get specific rows from a document?

Suppose I have this rows inside a table:
<table class="lineup">
<tbody>
<tr><td class="bookings"></td></tr>
<tr><td class="bookingd><span></img></span></td></tr>
...
how can I get only the tr that contains the img tag?
I tried:
HtmlNodeCollection events = doc.DocumentNode
.SelectNodes("//table[contains(#class, 'lineup')]" +
"//tbody//tr[contains(#td, 'img')]");
but this will return null

Get HTML values from web response

I am trying to parse an HTML response for a couple of values and then insert them into SQL. I am able to get both values but, because the code is wrapped in a foreach statement, I get them twice.
Here is my HTML response
<div align="CENTER" class='dataTitle'>Host State Breakdowns:</div>
<p align='center'>
<a href='trends.cgi?host=hostname&includesoftstates=no&assumeinitialstates=yes&initialassumedhoststate=0&backtrack=4'><img src='trends.cgi?createimage&host=hostname&includesoftstates=no&initialassumedhoststate=0&backtrack=4' border="1" alt='Host State Trends' title='Host State Trends' width='500' height='20'></a><br>
</p>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>State</th><th class='data'>Type / Reason</th><th class='data'>Time</th><th class='data'>% Total Time</th><th class='data'>% Known Time</th></tr>
<tr class='dataEven'><td class='hostUP' rowspan="3">UP</td><td class='dataEven'>Unscheduled</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
<tr class='dataEven'><td class='dataEven'>Scheduled</td><td class='dataEven'>0d 0h 0m 0s</td><td class='dataEven'>0.000%</td><td class='dataEven'>0.000%</td></tr>
<tr class='hostUNREACHABLE'><td class='hostUNREACHABLE'>Total</td><td class='hostUNREACHABLE'>0d 0h 0m 0s</td><td class='hostUNREACHABLE'>0.000%</td><td class='hostUNREACHABLE'>0.000%</td></tr>
<tr class='dataOdd'><td class='dataOdd' rowspan="3">Undetermined</td><td class='dataOdd'>Nagios Not Running</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Insufficient Data</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Total</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr><td colspan="3"></td></tr>
<tr class='dataEven'><td class='dataEven'>All</td><td class='dataEven'>Total</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
</table>
</div>
<br><br>
<div align="CENTER" class='dataTitle'>State Breakdowns For Host Services:</div>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>Service</th><th class='data'>% Time OK</th><th class='data'>% Time Warning</th><th class='data'>% Time Unknown</th><th class='data'>% Time Critical</th><th class='data'>% Time Undetermined</th></tr>
<tr class='dataOdd'><td class='dataOdd'><a href='avail.cgi?host=hostname&service=servicename&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataOdd'>0.000%</td></tr>
<tr class='dataEven'><td class='dataEven'><a href='avail.cgi?host=hostname&service=servicename2&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename2</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataEven'>0.000%</td></tr>
</table>
</div>
Here is my code:
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//table[#class]"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class = 'serviceOK']"))
{
var value = node2.InnerText;
}
foreach (HtmlNode node3 in node.SelectNodes("//a[contains(#href, 'avail.cgi')]"))
{
var name = node3.InnerText;
}
}
name shows the servicename and value shows the class serviceOK but it repeats itself again because of the first foreach.
My results look like this:
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
Is there a way to, first, match the values up, and two, only have them show once?
Your first foreach traverses the entire document as do both of your other foreach statements inside of the first.
Because there are 2 table elements matching your XPath expression
"//table[#class]"
you are getting your answer twice. If you had more table elements matching your XPath expression, say 7 for example, you would get the result 7 times.
What you want is to find all table divisions (td) with class "serviceOK" that are within a table row (tr) within a table.
Once you have this HtmlNode you can just go to the previous sibling which will contain the service name.
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode serviceOkNode in doc.DocumentNode.SelectNodes("//table[#class]/tr/td[#class = 'serviceOK']"))
{
HtmlNode serviceNameNode = serviceOkNode.PreviousSibling;
var value = serviceOkNode.InnerText;
var name = serviceNameNode.InnerText;
}

HTMLAgilityPack Skip HTML table caption

I have an html table like below:
<table>
<caption>Table 2</caption>
<tr><td>hd1</td><td>hd2</td></tr>
<tr><td>val01</td><td>val02</td></tr>
<tr>
<td colspan="2">
<table>
<caption>Subtable 2</caption>
<tr><td>subval01</td><td>subval02</td></tr>
</table>
</td>
</tr>
</table>
EDIT
Here is my code:
foreach (HtmlNode rows in htmltable.SelectNodes("tr"))
{
DataRow dr = dt.NewRow();
int iRow = 0;
if (!rows.InnerHtml.Contains("<caption>"))
{
foreach (HtmlNode cell in rows.SelectNodes("td"))
{
iRow++;
dr[iRow] = cell.InnerText;
}
}
dt.Rows.Add(dr);
}
My code recognizing <caption> as row and selecting it as well.
I don't get how to skip caption while parsing. So I can parse ONLY the rows.Skip(1) method is not working for me.
If I understand this correctly, you want to skip <tr> having descendant node <caption> (the last <tr> within outer <table> tag). In this case we can use XPath to select only <tr> that doesn't have <caption> like so :
foreach (HtmlNode rows in htmltable.SelectNodes("tr[not(.//caption)]"))
{
DataRow dr = dt.NewRow();
.....
.....
dt.Rows.Add(dr);
}

C# How I can retrieve this information?

On the HTML Page I have something like that
<table class="information">
<tbody>
<tr>
<td class="name">Name:</td>
<td>John</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
....
</tbody>
</table>
How I can retrieve the name (there are other information too but in my example I wrote only name)?
Notes: HTML has more than one table
I tried this
foreach (HtmlElement item in wb.Document.GetElementsByTagName("table"))
{
if (item.OuterHtml.Contains("information"))
{
... //Here i don't know how to continue
}
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectSingleNode("//table[#class='information']");
var td = table.SelectSingleNode("//td[#class='name']");
Console.WriteLine(td.InnerText);
or
var text = doc.DocumentNode.Descendants("td")
.First(td => td.Attributes["class"] != null && td.Attributes["class"].Value == "name")
.InnerText;
HtmlElementCollection tData = wb.Document.GetElementsByTagName("td");
foreach (HtmlElement td in tData)
{
string name = "";
if (td.GetAttribute("classname") == "name")
{
name = td.InnerText;
}
}
Check out HtmlAgilityPack - it is free and quite good library to work with html sources.

Categories