Scraping HTML table to rectangular array using LINQ - c#

I would like to scrape the column headers and rows of data for each column into a two-dimensional array. The data looks like the following:
<div id="content">
<!-- start left col--><div id="LeftCol-wss">
<h1>Aircraft Names</h1>
<h3>Names by Type</h3>
<table cellspacing="1" cellpadding="2" class="data">
<tr valign="top" bgcolor="#FFFFFF">
<td valign="top" width="25%">
<table width="100%" cellpadding="3" cellspacing="0" border="0" class="data">
<tr class="datatop">
<td width="100%">
Fighter</td>
</tr>
<tr>
<td align="top" class="datatop" width="100%">
<br/>
<a href="/page/mig-29.html" >MiG-29</a>
<br/>
<a href="/page/f-15.html" >F-15</a>
<br/>
<a href="/page/f-86.html" >F-86</a>
<br/>
<br>
</td>
</tr>
</table>
</td>
<td valign="top" width="25%">
<table width="100%" cellpadding="3" cellspacing="0" border="0" class="data">
<tr class="datahead">
<td width="100%">
Bomber</td>
</tr>
<tr>
<td align="top" class="datatop" width="100%">
<br/>
<a href="/page/b-52.html" >B-52</a>
<br/>
<a href="/page/b-1b.html" >B-1B</a>
<br/>
<br>
</td>
</tr>
</table>
</td>
</div>
The result I am looking for will be a two-dimensional array that looks like:
Fighter MiG-29
Fighter F-15
Fighter F-86
Bomber B-52
Bomber B-1B
I am using C# and would prefer to use LINQ, but at this point I'll take any suggestions.

If you want to parse HTML in C#, the canonical answer is to use the HTML Agility Pack.

Related

Parsing HTML tables with different row numbers

I am trying to parse HTML tables, but the tables are not equal in rows with different row numbers, all tables under (form) I am selecting the (form) as SingleNode, but the (tbody) came the row not (td), I can't loop for all (td).
Part of the HTML code:
<form name="DetailsForm" method="post" action="">
<input type="hidden" name="helpPageId" value="WF03">
<input type="hidden" name="withMenu" value="1">
<table width="100%" cellspacing="0" border="0">
<tbody>
<tr valign="center">
<td class="blackHeadingLeft">Details</td>
</tr>
<tr></tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" border="0">
<tbody>
<tr>
<td class="whiteTd" height="21"> AWB:</td>
<td class="whiteTdNormal" nowrap="nowrap" height="21"> 7777995585 </td>
<td class="whiteTd" nowrap="nowrap" height="21"> No of Shipment Details:</td>
<td class="whiteTdNormal" nowrap="nowrap" height="21"> 1 </td>
<td class="whiteTdNormal" width="100%" height="21"> </td>
</tr>
</tbody>
</table>
<table class="bordered-table" width="100%" border="0">
<tbody>
<tr>
<td class="grayTd" width="5%" height="21"> Details</td>
<td class="grayTd" width="5%" height="21" align="center"> Orig</td>
<td class="grayTd" width="8%" height="21" align="center"> Location</td>
<td class="grayTd" width="7%" height="21"> Dest</td>
<td class="grayTd" width="5%" height="21" align="center"> Pcs</td>
<td class="grayTd" width="5%" height="21"> Weight(kg)</td>
<td class="grayTd" width="11%" height="21"> Volumetric Weight(kg)</td>
<td class="grayTd" width="9%" height="21"> Date/Time</td>
<td class="grayTd" width="8%" height="21"> Route/Cycle</td>
<td class="grayTd" width="8%" height="21"> Post Code</td>
<td class="grayTd" width="6%" height="21"> Product</td>
<td class="grayTd" width="9%" height="21"> Amount</td>
<td class="grayTd" width="9%" height="21"> Duplicate</td>
</tr>
Here is the way that I am able to do it:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
Console.WriteLine("Table: ");
foreach (HtmlNode tbody in table.SelectNodes("tbody"))
{
if (tbody.ChildNodes.Any(x => x.Name == "tr"))
{
Console.WriteLine("TBody: ");
foreach (HtmlNode cell in tbody.SelectNodes("tr"))
{
Console.WriteLine("TR: ");
if (cell.ChildNodes.Any(c => c.Name == "td"))
{
foreach (var item in cell.SelectNodes("td"))
{
Console.WriteLine("TD: ");
Console.WriteLine(item.InnerHtml);
}
}
Console.WriteLine();
}
}
}
}
This way it doesn't matter how many tr or td tags there are. One thing to note is that you have to add validation if there is a case in which there are no tr or td tags in the tbody.
I hope this helps.
Edited to include validation for tr and td tags. A similar logic can be used for all other tags that might be missing.

Convert a HTML Table with rowspans to DataTable C#

I need to convert a Html Table to DataTable in C#. I used HtmlAgilityPack but it does not convert it well because of rowspans.
The code I am currently using is:
private static DataTable convertHtmlTableToDataTable()
{
WebClient webClient = new WebClient();
string urlContent = webClient.DownloadString("http://example.com");
string tableCode = getTableCode(urlContent);
string htmlCode = tableCode.Replace(" ", " ");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
var headers = doc.DocumentNode.SelectNodes("//tr/th");
DataTable table = new DataTable();
foreach (HtmlNode header in headers)
{
table.Columns.Add(header.InnerText);
}
foreach (var row in doc.DocumentNode.SelectNodes("//tr[td]"))
{
table.Rows.Add(row.SelectNodes("td").Select(td => td.InnerText).ToArray());
}
return table;
}
And this is a part of Html Table:
<table class="tabel" cellspacing="0" border="0">
<caption style="font-family:Verdana; font-size:20px;">SEMGRP</caption>
<tr>
<th class="celula" >Ora</th>
<th class="latime_celula celula">Luni</th>
<th class="latime_celula celula">Marti</th>
<th class="latime_celula celula">Miercuri</th>
<th class="latime_celula celula">Joi</th>
<th class="latime_celula celula">Vineri</th>
</tr>
<tr>
<td class="celula" nowrap="nowrap">8-9</td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center">
Curs
<br />
<a class="link_celula" href="afis_n0.php?id_tip=287&tip=p">Prof</a>
<br />
<a class="link_celula" href="afis_n0.php?id_tip=9&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center">
Curs
<br />
<a class="link_celula" href="afis_n0.php?id_tip=287&tip=p">Prof</a>
<br />
<a class="link_celula" href="afis_n0.php?id_tip=12&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
<td class="celula"> </td>
<td class="celula"> </td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center">
Curs
<br />
<a class="link_celula" href="afis_n0.php?id_tip=293&tip=p">Prof</a>
<br />
<a class="link_celula" href="afis_n0.php?id_tip=9&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="celula" nowrap="nowrap">9-10</td>
<td class="celula"> </td>
<td class="celula"> </td>
</tr>
<tr>
<td class="celula" nowrap="nowrap">10-11</td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center"> Curs
<br /><a class="link_celula" href="afis_n0.php?id_tip=303&tip=p">Prof</a>
<br /><a class="link_celula" href="afis_n0.php?id_tip=9&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center"> Curs
<br />
<a class="link_celula" href="afis_n0.php?id_tip=331&tip=p">Prof</a>
<br />
<a class="link_celula" href="afis_n0.php?id_tip=14&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center"> Curs
<br /><a class="link_celula" href="afis_n0.php?id_tip=330&tip=p">Prof</a>
<br /><a class="link_celula" href="afis_n0.php?id_tip=9&tip=s">Sala</a>
<br />
</td>
</tr>
</table>
</td>
<td class="celula"> </td>
<td class="celula" rowspan="2">
<table border="0" align="center">
<tr>
<td nowrap="nowrap" align="center"> Curs
<br />
<a class="link_celula" href="afis_n0.php?id_tip=293&tip=p">Prof</a>
<br />
<a class="link_celula" href="afis_n0.php?id_tip=10&tip=s">Sala</a> <br />
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="celula" nowrap="nowrap">11-12</td>
<td class="celula"> </td>
</tr>
<tr>
I tried some solutions but I did not find any good...
Thanks for any help in advance.

Data binding in asp.net : listview

I'm new to web development using .net and am having issues in binding data from business logic to a table. I'm basically trying to populate a table dynamically.
Table in a list view i want to populate
<asp:ListView ID="processList" runat="server"
DataKeyNames="procName" GroupItemCount="1"
ItemType="SerMon.RemoteProcess" SelectMethod="fetchFromQueue">
<EmptyDataTemplate>
<table >
<tr>
<td>No data was returned.</td>
</tr>
</table>
</EmptyDataTemplate>
<EmptyItemTemplate>
<td/>
</EmptyItemTemplate>
<GroupTemplate>
<tr id="itemPlaceholderContainer" runat="server">
<td id="itemPlaceholder" runat="server"></td>
</tr>
</GroupTemplate>
<ItemTemplate>
<td runat="server">
<table id="myTable" class="table table-striped table-hover ">
<thead>
<tr>
<th>#</th>
<th>Process</th>
<th>Status</th>
<th>Machine</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>
<asp:Label runat="server" ID="lblId"><%#: Item.ProcName%></asp:Label></td>
<td>
<asp:Label runat="server" ID="Label1"><%#: Item.Procstatus%></asp:Label></td>
<td>
<asp:Label runat="server" ID="Label2"><%#: Item.mcName%></asp:Label></td>
</tr>
<tr>
</tbody>
</table>
</p>
</td>
</ItemTemplate>
</asp:ListView>
Method which is called to populate table
public List<RemoteProcess> fetchFromQueue()
{
List<RemoteProcess> pl = new List<RemoteProcess>();
foreach (CloudQueueMessage message in queue.GetMessages(5, TimeSpan.FromMinutes(1)))
{
Debug.Write(message.AsString);
RemoteProcess m = JsonConvert.DeserializeObject<RemoteProcess>(message.AsString);
pl.Add(m);
//queue.DeleteMessage(message);
}
return pl;
}
The table is generated but theres no data. Also for some odd reason, five tables are generated( This is always equal to the number specified in the getMessage function)
I am not sure if GroupTemplate is required for you in the ListView. The reason you were getting multiple table was that you have defined table header and row both in the item template. The ItemTemplate should have only structure of the item display. The LayoutTemplate should have the structure of the over all layout of how data should look like.
I have changed the ListView as following it make it look like a proper table.
<asp:ListView ID="processList" runat="server"
DataKeyNames="procName"
ItemType="WebApplication1.Models.RemoteProcess" SelectMethod="fetchFromQueue">
<EmptyDataTemplate>
<table>
<tr>
<td>No data was returned.</td>
</tr>
</table>
</EmptyDataTemplate>
<EmptyItemTemplate>
<td />
</EmptyItemTemplate>
<LayoutTemplate>
<table runat="server" id="table1">
<thead>
<tr runat="server">
<th>#</th>
<th>Process</th>
<th>Status</th>
<th>Machine</th>
</tr>
<tr id="itemPlaceholder" runat="server"></tr>
</thead>
</table>
</LayoutTemplate>
<ItemTemplate>
<tr runat="server">
<td>1</td>
<td>
<asp:Label runat="server" ID="lblId"><%#: Item.procName%></asp:Label></td>
<td>
<asp:Label runat="server" ID="Label1"><%#: Item.Procstatus%></asp:Label></td>
<td>
<asp:Label runat="server" ID="Label2"><%#: Item.mcName%></asp:Label></td>
</tr>
</ItemTemplate>
</asp:ListView>
Please note here that I have removed the GroupTemplate and setting related to it as I wasn't sure how to group the data.
This should help resolving your issue.

XPath query not working for this table

I have many tables in this format:
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
&nbsp&nbsp
<span class="BOLD">E-mail:</span>
zoro#xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
I am using HtmlAgilityPack to loop thru each of the tables using this code
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(#class,'DataRows')]"))
{
}
This gives me the entire node for each iteration one of which is the table as above. I tried to access the company name in each iteration using the code below.
string str= node.ChildNodes.Descendants() .SelectSingleNode("//td[#class='BOLD']").InnerText
but all I got was the company name of the first table for every table that is extracted in the loop. How do I get the next company name and address when I go thru each table in the loop?
This is a common mistake when one trying to do a relative XPath starting with // axis. Despite you're calling SelectSingleNode() from node variable, the XPath is still considered global, which mean it is relative to the root element of the XML. That's why you always get the same element every time, it is the first matched element in the entire XML.
To make the XPath scope local within current node element, simply put a single dot (.) at the beginning of the XPath :
string str = node.SelectSingleNode(".//td[#class='BOLD']")
.InnerText;
node.SelectSingleNode(By.Xpath(.//td[#class='BOLD'])).Innertext
This might work.
As said in a comment, using HAP should XPath used as an "extension" from former xpath start with "."- current node if i remember correctly

ListView - Show LayoutTemplate on empty data source

For a shopping cart page, the list of items is displayed in a html table.
I use a ListView for that and it works great.
When the cart is empty, the text 'This cart is empty' appears. But it only renders the code in the EmptyDataTemplate. My goal is to display the table headers ('delete', 'product', 'quantity', etc.) without repeating that html code in the EmptyDataTemplate.
Trying being clever I changed my EmptyDataTemplate into an EditItemTemplate and used the bit of code displayed below.
Can anyone think of a more more elegant solution for this problem??
[C# code]
lvShoppingCart.DataSource = _cart.Items;
lvShoppingCart.DataBind();
if (_cart.ProductCount == 0)
{
lvShoppingCart.DataSource = new List<string>() { "dummy cart item" };
lvShoppingCart.EditIndex = 0;
lvShoppingCart.DataBind();
}
[ASPX code]
<asp:ListView ID="lvShoppingCart" runat="server">
<LayoutTemplate>
<table style="width: 600px;" border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="50">
<strong>Delete</strong>
</td>
<td width="400">
<strong>Product</strong>
</td>
<td width="100">
<strong>Quantity</strong>
</td>
<td width="100">
<strong>Price</strong>
</td>
<td width="100">
<strong>Total</strong>
</td>
</tr>
</table>
<hr />
</td>
</tr>
<tr id="itemPlaceHolder" runat="server">
</tr>
<tr id="trShoppingCartUpdateBtn" runat="server">
<td>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="50">
</td>
<td width="400">
</td>
<td colspan="3" width="300">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<asp:ImageButton ID="btnImgUpdateQuantities" ImageUrl="../img/refresh.gif" AlternateText="update shopping cart"
OnClick="btnUpdateQuantities_Click" runat="server" />
</td>
<td>
<asp:LinkButton ID="btnUpdateQuantities" Text="update cart" OnClick="btnUpdateQuantities_Click"
runat="server" />
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
<tr id="trShoppingCartTotals" runat="server">
<td>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td colspan="4">
<div align="right">
<strong>Totals: </strong>
</div>
</td>
<td width="100">
<asp:Label ID="lblCartTotal" runat="server" Text="0" />
</td>
</tr>
</table>
</td>
</tr>
</table>
</LayoutTemplate>
<EditItemTemplate>
<tr>
<td colspan="5" align="center">
<p>
<em>This cart is empty.</em>
</p>
</td>
</tr>
</EditItemTemplate>
<ItemTemplate>
<tr>
<td>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="50">
<a href='<%# ShoppingCartUrl %>?action=remove&id=<%# Eval("Product.Id") %>'>X</a>
</td>
<td width="400">
<%# Eval("Product.DisplayName") %>
</td>
<td width="100">
<label>
<asp:TextBox ID="txtQuantity" Text='<%# Eval("Quantity") %>' runat="server" size="3" />
</label>
</td>
<td width="100">
<%# Eval("Price", "{0:C}") %>
</td>
<td width="100">
<%# Eval("TotalPrice", "{0:C}") %>
</td>
</tr>
</table>
<hr />
</td>
</tr>
</ItemTemplate>
</asp:ListView>
You can add an empty InsertItemTemplate and set InsertItemPosition="LastItem"
Below you can find a simplified example of the shopping cart code.
It is using the 'InsertItemTemplate' solution provided in the answer of user757933.
I find this a more elegant solution than using the 'EditItemTemplate' which requires a 'dummy' data source.
Usage:
By default you should see an empty cart. When you uncomment the lines for 'bread', 'apples' and 'eggs' the message 'This cart is empty' should be hidden, instead you will see the three items appear in the cart.
[ASPX code]
<asp:ListView ID="lvShoppingCart" runat="server">
<LayoutTemplate>
<pre>
---------------------------------------------------------------------------
| Product | Quantity | Price | Total |
---------------------------------------------------------------------------
<div id="itemPlaceHolder" runat="server">
</div>
---------------------------------------------------------------------------
| | <asp:Label ID="lblCartTotal" runat="server" Text="0" /> |
---------------------------------------------------------------------------
</pre>
</LayoutTemplate>
<InsertItemTemplate>
| This cart is empty |
</InsertItemTemplate>
<ItemTemplate>
| <%# Container.DataItem.ToString().PadRight(17) %> | | | |
</ItemTemplate>
</asp:ListView>
[C# code]
internal class Cart : IEnumerable<string>
{
public List<string> Items { get; set; }
public Cart()
{
Items = new List<string>();
}
public IEnumerator<string> GetEnumerator()
{
return Items.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
protected void Page_Load(object sender, EventArgs e)
{
Cart _cart = new Cart();
//_cart.Items.Add("bread");
//_cart.Items.Add("apples");
//_cart.Items.Add("eggs");
lvShoppingCart.DataSource = _cart;
// Make sure the 'InsertItemTemplate' is hidden from view when items are added to the cart.
lvShoppingCart.InsertItemPosition = _cart.Items.Count == 0 ? InsertItemPosition.LastItem : InsertItemPosition.None;
lvShoppingCart.DataBind();
Label _lblCartTotal = lvShoppingCart.FindControl("lblCartTotal") as Label;
if (_lblCartTotal != null)
{
_lblCartTotal.Text = string.Format("<strong>Total: </strong> {0}", _cart.Items.Count);
}
}

Categories