parsing HTML in C# ASP.net - c#

here's my sample HTML...
<html>
<table class="test" border="0" >
<tr bgColor="#e8f4ff">
<td width="50%" align="right">
<b>Invoice ID:</b>
</td>
<td width="50%">
<b>
1622579
</b>
</td>
</tr>
<tr bgColor="#e8f4ff">
<td align="right">
<b>Code:</b>
</td>
<td>
<b>
20475
</b>
</td>
</tr>
</html>
there's no ID so ican't use SelectNodes()
How can i get the Code: 20475 using HTMLAgilitypack or regex?

Using latest HtmlAgilityPack, just using the document structure - this will not be very resilient to changes in the HTML - you should strongly consider adding appropriate ids (if this is your html anyway):
HtmlDocument doc = new HtmlDocument();
doc.Load(#"test.html");
var tds = doc.DocumentNode.Descendants("td").ToArray();
string codeValue = "";
for (int i = 1; i < tds.Length; i++)
{
if (tds[i - 1].Element("b").InnerText == "Code:")
codeValue = tds[i].Element("b").InnerText;
}

Related

Add tbody XML Element to table Element in XDcoument

Want to add <tbody> element in <table> elements if missing on Xdcoument.
<table class="newtable" id="item_559_Table1" cellpadding="0" cellspacing="0" data-its-style="width:11.4624em; border-spacing:0;">
<colgroup data-its-style="width:11.4624em; " />
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="">My dad cooks up a pot of chicken soup, and</p>
</td>
</tr>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="font-weight:normal; ">This cold means I can’t taste a thing today!</p>
</td>
</tr>
</table>
Output should look like
<table class="newtable" id="item_559_Table1" cellpadding="0" cellspacing="0" data-its-style="width:11.4624em; border-spacing:0;">
<colgroup data-its-style="width:11.4624em; " />
<tbody>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="">My dad cooks up a pot of chicken soup, and</p>
</td>
</tr>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="font-weight:normal; ">This cold means I can’t taste a thing today!</p>
</td>
</tr>
</tbody>
</table>
**Not looking for XSLT solution.
One way to do it would be to grab the children of <table>, then add them back they way you want them.
var doc = XDocument.Load("file.xml");
var colgroup = doc.Root.Elements("colgroup");
var tr = doc.Root.Elements("tr");
// Add tr to tbody
var tbody = new XElement("tbody", tr);
// Replace the children of table with colgroup and tbody
doc.Root.ReplaceNodes(colgroup, tbody);

How to read <table> into 'onmouseover' event with C# and HTMLAgilityPack

How to read <table> into onmouseover event with C# and HTMLAgilityPack?
markup code :
<a href="#" class="chan_live_not_free" onclick="return false;" onmouseover="return overlib('
<table>
<tr class=fieldRow>
<td class=posH_col width=40>
<strong>pos</strong>
</td>
<td class=rest_col width=90>
<strong>satellite</strong>
</td>
<td class=freqH_col width=50>
<strong>freq</strong>
</td>
<td class=rest_col width=90>
<strong>symbol</strong>
</td>
<td class=rest_col width=90>
<strong>encryption</strong>
</td>
</tr>
<tr>
<td class="pos_col">39.0°e</td>
<td class=rest_col>Hellas Sat 2</td>
<td class="freq_col">12.606 H</td>
<td class=rest_col>30000 - 2/3</td>
<td class=enc_not_live>MPEG-4 BulCrypt</td>
</tr>
</table>',CAPTION, 'Arena Sport 4 (serbia) – 19/10/14 - 11:30');" onmouseout="return nd();">
Arena Sport 4 (serbia)
</a>
I need to read the table into onmouseover event. How does it read?
You could get the element attribute of the <a> tag with HTML Agility Pack and then using regular expressions get the <table> inside the string, something like the following code :
var html = #"<a href='#' class='chan_live_not_free' onclick='return false;' onmouseover='return overlib(
<table>
<tr class=fieldRow>
<td class=posH_col width=40>
<strong>pos</strong>
</td>
<td class=rest_col width=90>
<strong>satellite</strong>
.
.
.
<tr>
<td class="pos_col">39.0°e</td>
<td class=rest_col>Hellas Sat 2</td>
<td class="freq_col">12.606 H</td>
<td class=rest_col>30000 - 2/3</td>
<td class=enc_not_live>MPEG-4 BulCrypt</td>
</tr>
</table>,CAPTION, 'Arena Sport 4 (serbia) – 19/10/14 - 11:30');' onmouseout='return nd();'>
Arena Sport 4 (serbia)
</a>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("//a[#class='chan_live_not_free']").Attributes["onmouseover"].Value;
var text = Regex.Matches(value, #"<table>([^)]*)</table>")[0].Value;

html agility how to process table in a hyperlink

I am working to get some information from a html table which has many rows like this. The given row is like one piece of info in a table cell. I need to get link, artist name, artist type from this table.
<a href="http://somesite/music/view_album.php?albumid=6468" style="color:#000;" sl-processed="1">
<table width="100%" border="0" bgcolor="#FFFFFF">
<tbody><tr>
<td colspan="2" align="left" valign="top" style="color:#900;">album title</td>
</tr>
<tr> <td width="31%" align="left" valign="top"> <img src="./albums_files/No_cover.png" width="90" height="80" border="0">
</td>
<td width="69%" align="left" valign="top">
<a class="leftcat" href="http://somelink/toartiset" sl-processed="1"> <strong>Rizwan-Muazzam</strong>
</a>
<br>
(<a class="leftcat" href="http://linktoartisttype/" sl-processed="1">
Some Artist Type </a>) <br>
<span class="leftcat">
Rated +: 0<br>
Rated -: 0 </span>
</td>
</tr>
<tr> <td valign="top" align="center" colspan="2">
</td> </tr>
</tbody></table>
</a>
I have done this
HtmlDocument doc = new HtmlDocument();
doc = new HtmlWeb().Load(albumUrl);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
this gives me all the links which I need, now I want to get all the child information under the hyperlink.
Help will be appreciated.
Regards
Parminder
I would suggest using a loop to go through all the rows and then select the links and extract the info from them:
var rows = doc.DocumentNode.SelectNodes("//tr");
foreach (var row in rows)
{
var links = row.SelectNodes(".//a");
var artistLink = links[0].Attributes["href"];
var artistName = links[0].SelectSingleNode(".//strong/text()").InnerText;
var artistTypeLink = links[1].Attributes["href"];
var artistTypeName = links[1].SelectSingleNode(".//text()").InnerText;
// Store the results...
}

Why my code is selecting all text() nodes in Htmldocument

HtmlNode node = doc.DocumentNode.SelectNodes("//tr")[0];
foreach(HtmlTextNode n in node.SelectNodes("//text()"))
Console.WriteLine(n.Text);
HTML:
<table class="infobox" style="width: 17em; font-size: 100%;float: left;">
<tr>
<th style="text-align: center; background: #f08080;" colspan="3">خدیجہ مستور</th>
</tr>
<tr style="text-align: center;">
<td colspan="3"><img alt="خدیجہ مستور" src="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/150px-Khatijamastoor.JPG" width="150" height="203" srcset="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/225px-Khatijamastoor.JPG 1.5x, //upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/300px-Khatijamastoor.JPG 2x"><br>
<div style="font-size: 90%">خدیجہ مستور</div>
</td>
</tr>
<tr>
<th style="background: #f08080;" colspan="3">ادیب</th>
</tr>
<tr>
<td><b>ولادت</b></td>
<td colspan="2">1930ء، لکھنؤ، برطانوی ہندوستان</td>
</tr>
<tr>
<td><b>اصناف ادب</b></td>
<td colspan="2">ناول</td>
</tr>
<tr>
<td><b>معروف تصانیف</b></td>
<td colspan="2">آنگن</td>
</tr>
</table>
Output Should be :
خدیجہ مستور
but i found :
خدیجہ مستور
خدیجہ مستور
ادیب
ولادت
1930ء
،
لکھنؤ
،
برطانوی ہندوستان
اصناف ادب
ناول
معروف تصانیف
آنگن
Why node.selectNodes("//text()") is selecting all text() nodes in document rather text() nodes from just first tr tag??
Because you are adding two forward slashes to the beginning of your XPath (//tr), which selects all of the elements in the document, not just descendants of the selected node.
Try this instead:
foreach (HtmlTextNode n in node.SelectNodes("text()"))
Or just simplify the XPath to:
var node = doc.DocumentNode.SelectSingleNode("//tr[1]/text()");
Console.WriteLine(node.Text);

Modify Existing HTML Table

I have an HTML table that I request from a webserver in C#. I am then displaying the page in my aspx webform. How can I add a prerequisite based on the course ID to the last column in the table without hard-coding the prerequisite? Example of the table design is below.
<tr bgcolor="#E1E1CC">
<td width="7%">003597</td>
<td width="5%">01</td>
<td width="1%">OPT</td>
<td width="8%">MT H </td>
<td width="16%">2:00 pm - 2:50 pm </td>
<td width="17%">08/26/13 - 12/12/13</td>
<td width="8%">
<a href="http://www.mnsu.edu/registrar/building.html"target = _blank>
<b>TR C124</b>
</a>
</td>
<td width="19%">Staff</td>
<td width="4%">22</td>
<td width="4%">6</td>
<td width="4%"><font color="#000000">Open</font></td>
<td width="7%">
<a href=Notes.asp?SpclNote=20143+003597+IT+100 target = _blank>
<b>Notes</b>
</a>
</td>
</tr>
<tr bgcolor="#E1E1CC">
<td colspan="3"> </td>
<td width="8%">M </td>
<td width="16%">10:00 am - 11:50 am</td>
<td width="17%">08/26/13 - 12/09/13</td>
<td width="8%">
<a href="http://www.mnsu.edu/registrar/building.html"target = _blank>
<b>WH 0119 </b>
</a>
</td>
<td width="19%">Staff</td>
<td colspan="4"> </td>
</tr>
If you are getting the page html as a string you could just insert your html into it. Something like:
private void SetValue(string PageHtml, string ID, string TextToInsert)
{
string html = PageHtml;
string sMyHtmlToInsert = TextToInsert;
int iSplitIndex = html.IndexOf(ID);
iSplitIndex = html.IndexOf("{tag}",iSplitIndex);
string sHtml1 = html.SubString(0, iSplitIndex);
string sHtml2 = html.SubString(iSplitIndex);
string sFinalHtml = sHtml1 + sMyHtmlToInsert + sHtml2;
}

Categories