XPath query not working for this table - c#

I have many tables in this format:
<table class="DataRows" frame="myFrames" rules="Standard" width="100%">
<colgroup><col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
</colgroup><thead>
<col width="70" align="CENTER">
<col width="200" align="LEFT">
<col width="80" align="LEFT">
<col align="LEFT">
<col align="RIGHT">
<thead>
<tr>
<td valign="TOP"><span class="classicBold"> 20 </span> Kg.
<td class="BOLD" valign="TOP" nowrap="">
PA Passion Foods Inc.
<td class="BOLD">Fax:
<td>
222-555666
<td class="BOLD">
Processed foods and juices
<tr>
<td><a target="_blank" href="">See on Map </a>
<td>
120 NW 157TH AVE
<td class="BOLD">Warehouse Hours:
<td colspan="2">
<tr>
<td>
<td><span class="BOLD">
Jacksonville,
</span>
FL 300000
<td class="BOLD">Url:
<td colspan="2">
<a target="_blank" href="">PA Passion</a>
&nbsp&nbsp
<span class="BOLD">E-mail:</span>
zoro#xyz.com
<tr>
<td>
<td class="REDBOLD" colspan="4">
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
Nutrella
</span>
<tr>
<td>
<td colspan="4" align="LEFT">Franchisee for:<span class="BOLD">
APPLE Foods, Constants
</span>
<tr>
<td>
<td colspan="4" align="LEFT"><span class="BOLD">
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We service:<span class="BOLD">
All occasions and hospitality services
</span>
<tr>
<td>
<td colspan="4" align="LEFT">We sell :<span class="BOLD">
----
</span>
</td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></tr></td></td></td></td></tr></td></td></td></td></tr></td></td></td></td></td></tr>
</thead>
</table>
I am using HtmlAgilityPack to loop thru each of the tables using this code
foreach (HtmlNode node in htmlAgilityPackDoc.DocumentNode.SelectNodes("//table[contains(#class,'DataRows')]"))
{
}
This gives me the entire node for each iteration one of which is the table as above. I tried to access the company name in each iteration using the code below.
string str= node.ChildNodes.Descendants() .SelectSingleNode("//td[#class='BOLD']").InnerText
but all I got was the company name of the first table for every table that is extracted in the loop. How do I get the next company name and address when I go thru each table in the loop?

This is a common mistake when one trying to do a relative XPath starting with // axis. Despite you're calling SelectSingleNode() from node variable, the XPath is still considered global, which mean it is relative to the root element of the XML. That's why you always get the same element every time, it is the first matched element in the entire XML.
To make the XPath scope local within current node element, simply put a single dot (.) at the beginning of the XPath :
string str = node.SelectSingleNode(".//td[#class='BOLD']")
.InnerText;

node.SelectSingleNode(By.Xpath(.//td[#class='BOLD'])).Innertext
This might work.
As said in a comment, using HAP should XPath used as an "extension" from former xpath start with "."- current node if i remember correctly

Related

Parsing HTML tables with different row numbers

I am trying to parse HTML tables, but the tables are not equal in rows with different row numbers, all tables under (form) I am selecting the (form) as SingleNode, but the (tbody) came the row not (td), I can't loop for all (td).
Part of the HTML code:
<form name="DetailsForm" method="post" action="">
<input type="hidden" name="helpPageId" value="WF03">
<input type="hidden" name="withMenu" value="1">
<table width="100%" cellspacing="0" border="0">
<tbody>
<tr valign="center">
<td class="blackHeadingLeft">Details</td>
</tr>
<tr></tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" border="0">
<tbody>
<tr>
<td class="whiteTd" height="21"> AWB:</td>
<td class="whiteTdNormal" nowrap="nowrap" height="21"> 7777995585 </td>
<td class="whiteTd" nowrap="nowrap" height="21"> No of Shipment Details:</td>
<td class="whiteTdNormal" nowrap="nowrap" height="21"> 1 </td>
<td class="whiteTdNormal" width="100%" height="21"> </td>
</tr>
</tbody>
</table>
<table class="bordered-table" width="100%" border="0">
<tbody>
<tr>
<td class="grayTd" width="5%" height="21"> Details</td>
<td class="grayTd" width="5%" height="21" align="center"> Orig</td>
<td class="grayTd" width="8%" height="21" align="center"> Location</td>
<td class="grayTd" width="7%" height="21"> Dest</td>
<td class="grayTd" width="5%" height="21" align="center"> Pcs</td>
<td class="grayTd" width="5%" height="21"> Weight(kg)</td>
<td class="grayTd" width="11%" height="21"> Volumetric Weight(kg)</td>
<td class="grayTd" width="9%" height="21"> Date/Time</td>
<td class="grayTd" width="8%" height="21"> Route/Cycle</td>
<td class="grayTd" width="8%" height="21"> Post Code</td>
<td class="grayTd" width="6%" height="21"> Product</td>
<td class="grayTd" width="9%" height="21"> Amount</td>
<td class="grayTd" width="9%" height="21"> Duplicate</td>
</tr>
Here is the way that I am able to do it:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
Console.WriteLine("Table: ");
foreach (HtmlNode tbody in table.SelectNodes("tbody"))
{
if (tbody.ChildNodes.Any(x => x.Name == "tr"))
{
Console.WriteLine("TBody: ");
foreach (HtmlNode cell in tbody.SelectNodes("tr"))
{
Console.WriteLine("TR: ");
if (cell.ChildNodes.Any(c => c.Name == "td"))
{
foreach (var item in cell.SelectNodes("td"))
{
Console.WriteLine("TD: ");
Console.WriteLine(item.InnerHtml);
}
}
Console.WriteLine();
}
}
}
}
This way it doesn't matter how many tr or td tags there are. One thing to note is that you have to add validation if there is a case in which there are no tr or td tags in the tbody.
I hope this helps.
Edited to include validation for tr and td tags. A similar logic can be used for all other tags that might be missing.

C# Get all the id of the html tag and set inner text for <td></td> tag

I have string html, I want to get all id name of tag in string html.
get string html in file text:
<tr>
<td class="X8">
</td>
<td colspan="6" class="X9"></td>
<td colspan="4" class="X12" id="closedate">
</td>
<td colspan="6" class="X9"></td>
<td colspan="4" class="X12" id="startdate">
</td>
<td class="X8">
</td>
<td class="X8" colspan="3">
</td>
<td class="X8">
</td>
<td colspan="9" class="X9"></td>
<td colspan="6" class="X15" id="totalpayment"></td>
<td class="X8">
</td>
<td class="X8">
</td>
</tr>
<tr>
<td class="X17">
</td>
<td class="X17" colspan="8">
</td>
<td class="X17" colspan="33">
</td>
<td class="X17">
</td>
</tr>
<tr>
<td class="X17">
</td>
<td class="X17" colspan="8">
<td class="X17" colspan="16">
</td>
<td class="X17">
</td>
<td colspan="9" class="X20"></td>
<td colspan="6" class="X23" id="approvaldate"></td>
<td class="X17">
</td>
<td class="X17">
</td>
</tr>
expected results:
closedate, startdate,totalpayment, approvaldate.
Then I want to set inner text for id name tag
(Ex:<td colspan="6" class="X23" id="approvaldate">2018/07/18</td>)
Using c#.Help me, please. Thanks a lot.
What I am understood from your question is you need the id of all in string simple Example Created for you
<form id="form1" runat="server">
<input id="Name" type="text" name="Full Name" runat="server" />
<input id="Email" type="text" name="Email Address" runat="server" />
<input id="Phone" type="text" name="Phone Number" runat="server" />
</form>
foreach (var control in Page.Form.Controls)
{
if (control is HtmlInputControl)
{
var htmlInputControl = control as HtmlInputControl;
string controlName = htmlInputControl.Name;
string controlId = htmlInputControl.ID;
}
}
Another Approach:-
HtmlElement table = testWebBrowser.Document.GetElementById("TableID");
if (table != null)
{
foreach (HtmlElement row in table.GetElementsByTagName("TR"))
{
// ...
}
}

Xpath select all tr without table with id=x

Hello i need to select all tr,but in some tr i have a table with id=WHITE_BANKTABLE.
I need to select only Tr that dont't have this table with id.
My html
<table id=mytable_body>
<TR id=TR_ROW_BANKTABLE class=TR_ROW_BANKTABLE style="BACKGROUND-COLOR: #f6f8fa" align=right bgColor=#f6f8fa>
<TD noWrap align=right w_idth="190"> </TD>
<TD align=right>010073/15922</TD>
</TR>
> **//This Tr with TABLE id=WHITE_BANKTABLE i don't need**
<TR>
<TD colSpan=8 align=center>
<TABLE id=WHITE_BANKTABLE cellSpacing=0 borderColorDark=#edf0f5 cellPadding=3 width="100%" bgColor=white borderColorLight=#edf0f5 border=1 isWhiteTable="Y">
<TBODY>
<TR class=TR_BANKTABLE align=right vAlign=top>
<TD> sdfsd </TD>
<TD>sdfs</TD>
</TR>
</TBODY>
</TABLE>
</TD>
</TR>
<TR id=TR_ROW_BANKTABLE class=TR_ROW_BANKTABLE style="BACKGROUND-COLOR: #f6f8fa" align=right bgColor=#f6f8fa>
<TD noWrap align=right w_idth="190"> </TD>
<TD align=right>010073/15922</TD>
</TR>
</table>
Thanx.
Assuming the above is correctly formatted as XML (insert missing double quotes):
var q =
xml.XPathSelectElements(#"/tr[not(descendant::table[#id = 'WHITE_BANKTABLE'])]");

Scraping HTML table to rectangular array using LINQ

I would like to scrape the column headers and rows of data for each column into a two-dimensional array. The data looks like the following:
<div id="content">
<!-- start left col--><div id="LeftCol-wss">
<h1>Aircraft Names</h1>
<h3>Names by Type</h3>
<table cellspacing="1" cellpadding="2" class="data">
<tr valign="top" bgcolor="#FFFFFF">
<td valign="top" width="25%">
<table width="100%" cellpadding="3" cellspacing="0" border="0" class="data">
<tr class="datatop">
<td width="100%">
Fighter</td>
</tr>
<tr>
<td align="top" class="datatop" width="100%">
<br/>
<a href="/page/mig-29.html" >MiG-29</a>
<br/>
<a href="/page/f-15.html" >F-15</a>
<br/>
<a href="/page/f-86.html" >F-86</a>
<br/>
<br>
</td>
</tr>
</table>
</td>
<td valign="top" width="25%">
<table width="100%" cellpadding="3" cellspacing="0" border="0" class="data">
<tr class="datahead">
<td width="100%">
Bomber</td>
</tr>
<tr>
<td align="top" class="datatop" width="100%">
<br/>
<a href="/page/b-52.html" >B-52</a>
<br/>
<a href="/page/b-1b.html" >B-1B</a>
<br/>
<br>
</td>
</tr>
</table>
</td>
</div>
The result I am looking for will be a two-dimensional array that looks like:
Fighter MiG-29
Fighter F-15
Fighter F-86
Bomber B-52
Bomber B-1B
I am using C# and would prefer to use LINQ, but at this point I'll take any suggestions.
If you want to parse HTML in C#, the canonical answer is to use the HTML Agility Pack.

regular expression - find and split links in a table

another question of me about regular expression, its so complicated for me :S So I'm happy for an additional help.
I have a table and I like to read all links inside this table and split it to groups.
The goal should be
Person 1
Status of person 1
Person 2
Status of Person 2
So i have to get the values inside the links in this table
<a class="darklink" href="testlink">Person 2, - Status of Person 2</a>
Is it possible to search just in a table which has a specific Tag before? like this
<p>title</p>(because there are other similar tables at my site)
<p>title</p>
<table cellspacing="0" cellpadding="0" border="0" width="95%">
<tbody>
<tr>
<td bgcolor="#999999" colspan="2"><img height="1" border="0" width="1" src="images/dot_transp.gif" alt=" "/> </td>
</tr>
<tr>
<td><a class="darklink" href="asdfer">Person1, - Status of Person1 </a> </td>
<td valign="bottom"></td>
</tr>
<tr>
<td bgcolor="#999999" colspan="2"><img height="1" border="0" width="1" src="images/dot_transp.gif" alt=" "/> </td>
</tr>
<tr>
<td><a class="darklink" href="aeraseraesr">Person 2, - Status of Person 2</a></td>
<td valign="bottom"> <img hspace="0" height="16" border="0" align="right" width="12" vspace="0" alt=" " src="images/ico_link.gif"/> </td>
</tr>
<tr>
<td bgcolor="#999999" colspan="2"><img height="1" border="0" width="1" src="images/dot_transp.gif" alt=" "/> </td>
</tr>
<tr>
<td><a class="darklink" href="asdfasdf">Person 3. - Status of Person 3</a></td>
<td valign="bottom"> </td>
</tr>
<tr> </tr>
</tbody>
</table>
Your regular expression should be:
<a class="darklink" .*?>(.*?). - (.*?)</a>
or if you get line breaks inside your <a> tag:
<a class="darklink" [\s\S]*?>([\s\S]*?). - *([\s\S]*?)</a>
So, following code should works:
Regex person = new Regex(#"<a class=""darklink"" .*?>(.*?). - (.*?)</a>");
foreach (Match m in person.Matches(input))
{
Console.WriteLine("First group : {0}", m.Groups[1]);
Console.WriteLine("Second group: {0}", m.Groups[2]);
};

Categories