Add tbody XML Element to table Element in XDcoument

Add tbody XML Element to table Element in XDcoument - c#

Want to add <tbody> element in <table> elements if missing on Xdcoument.
<table class="newtable" id="item_559_Table1" cellpadding="0" cellspacing="0" data-its-style="width:11.4624em; border-spacing:0;">
<colgroup data-its-style="width:11.4624em; " />
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="">My dad cooks up a pot of chicken soup, and</p>
</td>
</tr>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="font-weight:normal; ">This cold means I can’t taste a thing today!</p>
</td>
</tr>
</table>
Output should look like
<table class="newtable" id="item_559_Table1" cellpadding="0" cellspacing="0" data-its-style="width:11.4624em; border-spacing:0;">
<colgroup data-its-style="width:11.4624em; " />
<tbody>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="">My dad cooks up a pot of chicken soup, and</p>
</td>
</tr>
<tr>
<td data-its-style="padding:0.2292em; vertical-align:top; ">
<p data-its-style="font-weight:normal; ">This cold means I can’t taste a thing today!</p>
</td>
</tr>
</tbody>
</table>
**Not looking for XSLT solution.

One way to do it would be to grab the children of <table>, then add them back they way you want them.
var doc = XDocument.Load("file.xml");
var colgroup = doc.Root.Elements("colgroup");
var tr = doc.Root.Elements("tr");
// Add tr to tbody
var tbody = new XElement("tbody", tr);
// Replace the children of table with colgroup and tbody
doc.Root.ReplaceNodes(colgroup, tbody);

Related

Failing to retrieve the third td nodes in an html list

I am trying to get the text "Very Good Country views" and "Good" using HTMLAgilityPack.
<div class="property-details-section">
<h5><span id="content_lblFurtherDetails">Further Details</span></h5>
<ul id="features">
<li style="display:block;">
<table border="0" cellpadding="0" cellspacing="0" width="500">
<tr>
<td style="width: 15px;">
<img src="../images/bullet.png" alt="bullet" />
</td>
<td style="width: 185px;">Views</td>
<td style="width: 300px;">Very Good Country views</td>
</tr>
</table>
</li>
</ul>
<li style="display:block;">
<table border="0" cellpadding="0" cellspacing="0" width="500">
<tr>
<td style="width: 15px;">
<img src="../images/bullet.png" alt="bullet" />
</td>
<td style="width: 185px;">Finish</td>
<td style="width: 300px;">Good</td>
<tr>
</table>
</li>
</div>
I have tried the following for "Very Good Country views" with no success:
HtmlNode text =
doc.DocumentNode.SelectSingleNode("//ul[#id='features']/li/table/tr/td[3]");

I am trying to get the text "Very Good Country views" and "Good"
You have to select 2 elements, so you should use SelectNodes instead of SelectSingleNode, if you want get the result at once.
var result = doc.DocumentNode.SelectNodes("//ul[#id='features']/li/*//td[last()]")
.Select(td => td.InnerText)
.ToList();

I think the problem about your XPath is that you should add brackets around the expression:
var text = doc.DocumentNode
.SelectSingleNode("(//ul[#id='features']/li/table/tr/td)[3]");
You can also try using LINQ:
var td = doc.Descendants("ul")
.First(x => x.GetAttributeValue("id","") == "features")
.Descendants("td")
.Skip(2)
.First();
var text = td.InnerText;

How to read <table> into 'onmouseover' event with C# and HTMLAgilityPack

How to read <table> into onmouseover event with C# and HTMLAgilityPack?
markup code :
<a href="#" class="chan_live_not_free" onclick="return false;" onmouseover="return overlib('
<table>
<tr class=fieldRow>
<td class=posH_col width=40>
<strong>pos</strong>
</td>
<td class=rest_col width=90>
<strong>satellite</strong>
</td>
<td class=freqH_col width=50>
<strong>freq</strong>
</td>
<td class=rest_col width=90>
<strong>symbol</strong>
</td>
<td class=rest_col width=90>
<strong>encryption</strong>
</td>
</tr>
<tr>
<td class="pos_col">39.0°e</td>
<td class=rest_col>Hellas Sat 2</td>
<td class="freq_col">12.606 H</td>
<td class=rest_col>30000 - 2/3</td>
<td class=enc_not_live>MPEG-4 BulCrypt</td>
</tr>
</table>',CAPTION, 'Arena Sport 4 (serbia) – 19/10/14 - 11:30');" onmouseout="return nd();">
Arena Sport 4 (serbia)
</a>
I need to read the table into onmouseover event. How does it read?

You could get the element attribute of the <a> tag with HTML Agility Pack and then using regular expressions get the <table> inside the string, something like the following code :
var html = #"<a href='#' class='chan_live_not_free' onclick='return false;' onmouseover='return overlib(
<table>
<tr class=fieldRow>
<td class=posH_col width=40>
<strong>pos</strong>
</td>
<td class=rest_col width=90>
<strong>satellite</strong>
.
.
.
<tr>
<td class="pos_col">39.0°e</td>
<td class=rest_col>Hellas Sat 2</td>
<td class="freq_col">12.606 H</td>
<td class=rest_col>30000 - 2/3</td>
<td class=enc_not_live>MPEG-4 BulCrypt</td>
</tr>
</table>,CAPTION, 'Arena Sport 4 (serbia) – 19/10/14 - 11:30');' onmouseout='return nd();'>
Arena Sport 4 (serbia)
</a>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("//a[#class='chan_live_not_free']").Attributes["onmouseover"].Value;
var text = Regex.Matches(value, #"<table>([^)]*)</table>")[0].Value;

html agility how to process table in a hyperlink

I am working to get some information from a html table which has many rows like this. The given row is like one piece of info in a table cell. I need to get link, artist name, artist type from this table.
<a href="http://somesite/music/view_album.php?albumid=6468" style="color:#000;" sl-processed="1">
<table width="100%" border="0" bgcolor="#FFFFFF">
<tbody><tr>
<td colspan="2" align="left" valign="top" style="color:#900;">album title</td>
</tr>
<tr> <td width="31%" align="left" valign="top"> <img src="./albums_files/No_cover.png" width="90" height="80" border="0">
</td>
<td width="69%" align="left" valign="top">
<a class="leftcat" href="http://somelink/toartiset" sl-processed="1"> <strong>Rizwan-Muazzam</strong>
</a>
<br>
(<a class="leftcat" href="http://linktoartisttype/" sl-processed="1">
Some Artist Type </a>) <br>
<span class="leftcat">
Rated +: 0<br>
Rated -: 0 </span>
</td>
</tr>
<tr> <td valign="top" align="center" colspan="2">
</td> </tr>
</tbody></table>
</a>
I have done this
HtmlDocument doc = new HtmlDocument();
doc = new HtmlWeb().Load(albumUrl);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
this gives me all the links which I need, now I want to get all the child information under the hyperlink.
Help will be appreciated.
Regards
Parminder

I would suggest using a loop to go through all the rows and then select the links and extract the info from them:
var rows = doc.DocumentNode.SelectNodes("//tr");
foreach (var row in rows)
{
var links = row.SelectNodes(".//a");
var artistLink = links[0].Attributes["href"];
var artistName = links[0].SelectSingleNode(".//strong/text()").InnerText;
var artistTypeLink = links[1].Attributes["href"];
var artistTypeName = links[1].SelectSingleNode(".//text()").InnerText;
// Store the results...
}

Why my code is selecting all text() nodes in Htmldocument

HtmlNode node = doc.DocumentNode.SelectNodes("//tr")[0];
foreach(HtmlTextNode n in node.SelectNodes("//text()"))
Console.WriteLine(n.Text);
HTML:
<table class="infobox" style="width: 17em; font-size: 100%;float: left;">
<tr>
<th style="text-align: center; background: #f08080;" colspan="3">خدیجہ مستور</th>
</tr>
<tr style="text-align: center;">
<td colspan="3"><img alt="خدیجہ مستور" src="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/150px-Khatijamastoor.JPG" width="150" height="203" srcset="//upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/225px-Khatijamastoor.JPG 1.5x, //upload.wikimedia.org/wikipedia/ur/thumb/7/7b/Khatijamastoor.JPG/300px-Khatijamastoor.JPG 2x"><br>
<div style="font-size: 90%">خدیجہ مستور</div>
</td>
</tr>
<tr>
<th style="background: #f08080;" colspan="3">ادیب</th>
</tr>
<tr>
<td><b>ولادت</b></td>
<td colspan="2">1930ء، لکھنؤ، برطانوی ہندوستان</td>
</tr>
<tr>
<td><b>اصناف ادب</b></td>
<td colspan="2">ناول</td>
</tr>
<tr>
<td><b>معروف تصانیف</b></td>
<td colspan="2">آنگن</td>
</tr>
</table>
Output Should be :
خدیجہ مستور
but i found :
خدیجہ مستور
خدیجہ مستور
ادیب
ولادت
1930ء
،
لکھنؤ
،
برطانوی ہندوستان
اصناف ادب
ناول
معروف تصانیف
آنگن
Why node.selectNodes("//text()") is selecting all text() nodes in document rather text() nodes from just first tr tag??

Because you are adding two forward slashes to the beginning of your XPath (//tr), which selects all of the elements in the document, not just descendants of the selected node.
Try this instead:
foreach (HtmlTextNode n in node.SelectNodes("text()"))
Or just simplify the XPath to:
var node = doc.DocumentNode.SelectSingleNode("//tr[1]/text()");
Console.WriteLine(node.Text);

Grabbing a timesheet HTMLAgilityPack

I need to grab a timesheets from a website. I want to store/add this timesheet to a data table in my C# Application.
The structure of the data table looks like this:
1. | Day | Time | Status |
2. ..1.......7:00.........IN
3. ..1.......9:45.......OUT
4. ..1......10:15........IN
5. ..1......15:45......OUT
6. ..1.......8:45.....TOTAL
7. ..2 .. ..
My C# code for the DataTable:
DataTable table = new DataTable("Worksheet");
table.Columns.Add("Day");
table.Columns.Add("Time");
table.Columns.Add("Status");
I tried different variants and I always mess up with all the data.
For testing purpose I made a new Winform with a "textbox" (for the sitepath) and "button"(to start the process)
Then I want HTMLAgilityPack to get all the data. one example:
public string[] GREYsource;
public Form1()
{
InitializeComponent();
}
private void btnSubmit_Click(object sender, EventArgs e)
{
var doc = new HtmlAgilityPack.HtmlDocument();
var fileName = txtPath.Text; // I downloaded the HTML-File
doc.Load(fileName);
string strGREYInner;
foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//tr[#class=\"tblDataGreyNH\"]"))
{
strGREYInner = td.InnerText.Trim();
string shorted = strGREYInner.Replace("\t", ""); string shorted2 = shorted.Replace("\n\n\n\n", "\n\n\n"); string shorted3 = shorted2.Replace("\n\n\n", "\n\n"); string shorted4 = shorted3.Replace("\n\n", "\n");
GREYsource = shorted4.Split(new Char[] { '\n', });
}
foreach (string str in GREYsource)
{
...
}
}
Problem: the result contains a lot of tabs(/t) and newlines(/n) I need to trim.
Problem: This isn't a good way to do it, IMO. And this would just grab the Totaltimes.
It can be done better.
This is just a example I tried (other codes just went a pile of junk)
I attached the HTML-structure below:
Overview(picture):
A bit more in depth:
<html>
<head>
</head>
<style type="text/css">
</style>
<body id="body" onload="handleMenuOverlapLogo();onload_column_expand();;firstElementFocus();">
<.. some (java)scripts> /* has to be ignoered. not necessary */
<.. some other divs> /* has to be ignoered. not necessary */
<div id="rowContent"> /* This <div> contains the content i need */
<div id="titleTab"> /* Title is not necessary */
</div>
<div id="rowContentInner"> /* Here the content starts */
<table class="tblList">
<tbody>
<tr> /* not necessary */
<tr class="tblHeader"> /* not necessary */
<tr class="tblHeader"> /* not necessary */
<tr class="tblDataWhiteNH"> /* IN : */
<td class="tblHeader" style="font-weight: bold; text-align: right"> In </td>
<td nowrap=""> /* "tblDataWhiteNH" always contains 7 "td nowrap"
<td nowrap="">
<td nowrap=""> /* Example: if it contains a value */
<table width="100%" border="0" align="center">
<tbody>
<tr>
<td width="25%" align="left"> </td>
<td nowrap="" width="50%" align="center"> 7:53 </td> /* value = 7:53 (THIS!) */
<td width="25%" align="right"> </td>
</tr>
</tbody>
</table>
</td>
<td nowrap="">
<td nowrap=""> /* Example: if it contains no value */
<table width="100%" border="0" align="center">
<tbody>
<tr>
<td width="25%" align="left"> </td>
<td nowrap="" width="50%" align="center"> /* no value = 0:00 (THIS!) */
<td width="25%" align="right"> </td>
</tr>
</tbody>
</table>
</td>
<td nowrap="">
<td nowrap="">
<tr class="tblDataWhiteNH"> /* OUT : */
<td class="tblHeader" style="font-weight: bold; text-align: right"> Out </td>
<td nowrap=""> /* "tblDataWhiteNH" always contains 7 "td nowrap".
<td nowrap="">
<td nowrap=""> /* Example: if it contains a value */
<table width="100%" border="0" align="center">
<tbody>
<tr>
<td width="25%" align="left"> </td>
<td nowrap="" width="50%" align="center"> 7:53 </td> /* value = 7:53 (THIS!) */
<td width="25%" align="right"> </td>
</tr>
</tbody>
</table>
</td>
<td nowrap="">
<td nowrap=""> /* Example: if it contains no value */
<table width="100%" border="0" align="center">
<tbody>
<tr>
<td width="25%" align="left"> </td>
<td nowrap="" width="50%" align="center"> /* no value = 0:00 (THIS!) */
<td width="25%" align="right"> </td>
</tr>
</tbody>
</table>
</td>
<td nowrap="">
<td nowrap="">
<tr class="tblDataGreyNH"> /* IN : */
<tr class="tblDataGreyNH"> /* OUT : */
... /* "tblDataGreyNH" is built up the same way like "tblDataWhiteNH".
... /* sometimes there could be more "tblDataWhiteNH" and "tblDataGreyNH". */
... /* Usally there are just the "tblDataWhiteNH"(IN/OUT) */
<tr class="tblHeader"> /* not necessary */
/* It continues f.egs. with "tblDataWhite" if the last above header was a "tblDatagrey" */
/* and versa vice ("grey" if there was a "white" before.) */
<tr class="tblDataWhiteNH"> /* Worked : */
<td class="tblHeader" style="font-weight: bold; text-align: right"> Total Time </td>
<td> 07:47 </td> /* value = 7:47 (THIS!) */
<td> 04:48 </td>
<td> 00:00 </td> /* no value = 0:00 (THIS!) */
<td> 00:00 </td>
<td> 07:42 </td>
<td> 00:00 </td>
<td> 00:00 </td>
</tr>
<tr class="tblDataGreyNH"> /* Total : */
<td class="tblHeader" style="font-weight: bold; text-align: right"> Regular Time </td>
<td> 07:47 </td> /* value = 7:47 (THIS!) */
<td> 04:48 </td>
<td> </td> /* no value = 0:00 (THIS!) */
<td> </td>
<td> 07:42 </td>
<td> </td>
<td> </td>
</tr>
<tr class="tblHeader"> /* not necessary */
<tr valign="top"> /* not necessary */
</tbody>
</table>
</div>
</div>
</body>
</html>
a copy of the original HTML: http://time.wnb.dk/123/
I Hope anyone could help me get this to work.
Okay let me explain it with a picture. https://www.abload.de/img/eeeqnuwu.png
On the Picture you see the website + a table below, how the result should look like.
Declaring the Datatable isnt the problem.
The main problem is I can't get htmlagility to spit out right results and if it did, its almost buggy.
Some of the selectnodes I tried got the output messed up after a while. As yet I wasn't able to get "all" data from the table on the website, just some values, but often buggy.
So I'm actually searching for someone who could take a look on this and maybe help me to find the right selectnodes.

Not sure I fully understand what you want to do but here is a sample code that should help you get started. I strongly suggest you have a look at XPATH to understand it.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourFile);
// get all TR with a specific class name, starting from root (/), and recursively (//)
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[#class='tblDataGreyNH' or #class='tblDataWhiteNH']"))
{
// get all TD below the current node with a specific class name
HtmlNode inOrOut = node.SelectSingleNode("td[#class='tblHeader']");
if (inOrOut != null)
{
string io = inOrOut.InnerText.Trim();
Console.WriteLine(io.ToUpper());
if (io.Contains("Time"))
{
// normalize-space gets rid or whitespaces (\r,\n, etc.)
// text() gets the node's inner text
foreach (HtmlNode td in node.SelectNodes("td[normalize-space(#class)='' and normalize-space(text())!='' and normalize-space(text())!='00:00']"))
{
Console.WriteLine("value:" + td.InnerText.Trim());
}
}
}
// gets all TD below the current node that define the NOWRAP attribute
HtmlNodeCollection tdNoWraps = node.SelectNodes("td[#nowrap]");
if (tdNoWraps != null)
{
foreach (HtmlNode tdNoWrap in tdNoWraps)
{
string value = tdNoWrap.InnerText.Trim();
if (value == string.Empty)
continue;
Console.WriteLine("value:" + value);
}
}
}
It will output this from your sample page:
IN
value:7:47
value:7:46
value:7:45
value:7:51
OUT
value:15:35
value:15:33
value:12:38
value:8:59
IN
value:12:38
value:8:59
OUT
value:15:35
TOTAL TIME
value:07:48
value:07:47
value:07:50
value:01:08
REGULAR TIME
value:07:48
value:07:47
value:07:50
value:01:08

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Add tbody XML Element to table Element in XDcoument - c#

Related

Failing to retrieve the third td nodes in an html list

How to read <table> into 'onmouseover' event with C# and HTMLAgilityPack

html agility how to process table in a hyperlink

Why my code is selecting all text() nodes in Htmldocument

Grabbing a timesheet HTMLAgilityPack

Categories

Resources