parse table with href html agility pack - c#

hi i want to parse table but I can't get the information completely
I used the following code that does not return the href link
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]//tbody");
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
}
i need to get href too, how can i do this?
<table>
<tbody>
<tr>
<td class="a1">
<a href="/subtitles/joker-2019/farsi_persian/2110062">
<span class="l r positive-icon">
Farsi/Persian
</span>
<span>
Joker.2019.WEBRip.XviD.MP3-SHITBOX
</span>
</a>
</td>
<td class="a3">
</td>
<td class="a40">
</td>
<td class="a5">
<a href="/u/695804">
meisam_t72
</a>
</td>
<td class="a6">
<div>
►► زیرنویس از میثم ططری - ویرایش شده ◄◄ - meisam_t72 کانال تلگرام </div>
</td>
</tr>
</tbody>
</table>

Inside your foreach you need to check if the content of your cell contains a <a> tag. If it contains just get the attribute href from this tag.
Something like this (untested)
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
var links = cell.SelectNodes(".//a");
if (links == null || !links.Any())
{
continue;
}
foreach (var link in links)
{
var href = link.Attributes["href"].Value;
// do whatever you want with the link.
}
}

Related

Blazor App Not Using Nested If Statements Properly

I know, I suck because I am using a table and not div tags but I could not get the div tags to display properly and my deadline was some time last week...
I am trying to layout a bunch of devices along with their statuses and other simple options and yet I cannot get the ìf statements to work. Here is my code:
#if (CurrentSystem == null)
{
<p><em>Loading...</em></p>
}
else
{
#foreach (Device thisDevice in CurrentSystem.LocalDevices)
{
menuCounter++;
divCounter++;
if (divCounter == 1)
{
//Starting with the first column
<tr><td class=cardBox>
}
else
{
//Starting with the last column
<tr><td class=outSideColumns></td>
<td></td>
<td class=cardBox>
}
targetName = "target" + #menuCounter;
targetNameWithDot = "." + #targetName;
menuId = "contextMenu" + #menuCounter;
modalID = "modalWindow" + #menuCounter;
<table>
<tr>
<td></td>
<td>
<div class="targetName" style="text-align:right;justify-content:right;">
...
</div>
</td>
</tr>
<tr>
<td colspan="2" align="center">
<h3>
thisDevice.DeviceName
</h3>
<img src="#ReturnDeviceImage(thisDevice)" class=deviceImage />
#if (thisDevice.CurrentStatus == Device.DeviceStatus.Alert)
{
<h4 Class="cardAlert">Status: Alert</h4>
}
else if (thisDevice.CurrentStatus == Device.DeviceStatus.Inactive)
{
<h4 Class="cardInactive">Status: Inactive</h4>
}
else if (thisDevice.CurrentStatus == Device.DeviceStatus.Unknown)
{
<h4 Class="cardUnknown">Status: Unknown</h4>
}
else if (thisDevice.CurrentStatus == Device.DeviceStatus.Normal)
{
<h4 Class=cardNormal>Status: Normal</h4>
}
else if (thisDevice.CurrentStatus == Device.DeviceStatus.Updating)
{
<h4 Class=cardUpdating>Status: Normal</h4>
}
else
{
<h4>Status: thisDevice.CurrentStatus</h4>
}
<TelerikContextMenu IdField="#menuId" Selector="#targetNameWithDot" Data="#MenuItems" OnClick="#((ContextMenuItem item) => OnItemClick(item, #thisDevice.DeviceIDHash))">
</TelerikContextMenu>
</td>
</tr>
<tr>
<td class=nameDivs>
Device Type:
</td>
<td>
#thisDevice.DeviceType
</td>
</tr>
<tr>
<td class=nameDivs>
Hostname:
</td>
<td>
#thisDevice.DeviceHostname
</td>
</tr>
<tr>
<td class=nameDivs>
Communications:
</td>
#if (thisDevice.UsingEncryption)
{
<td class=cardNormal>Are Encrypted</td>
}
else
{
<td> class=cardAlert>Are Not Encrypted</td>
}
</tr>
<tr>
<td class=nameDivs>
Anomaly Response Level:
</td>
<td>
#thisDevice.AnomalyResponse
</td>
</tr>
</table>
if (divCounter == 1)
{
//Ending the first column
</td>
<td></td>
<td class=outSideColumns></td>
</tr>
}
else
{
//Ending the last column
</td></tr>
divCounter = 0;
}
}
</table>
}
The beginning if statement and the if statements that run the CurrentStatus and UsingEncryption seem to be working, however the last if statement is simply writing text to the screen.
If I add # signs to the first and/or last if statements, I get a ton
of errors about not having closing tags, objects not being defined,
etc...
If I remove the # signs from the CurrentStatus and UsingEncryption if
statements, those statements stop working.
If I remove the # from the foreach statement, nothing prints out.
What am I doing wrong?!?
To use tag helpers, your html structure must mirror your control flow. You can't just start a tag inside an if test without also closing it within that if test.
While you can escape unmatched html tags with #: (eg Razor doesn't understand unclosed html tags), with a little effort, you can eliminate your unmatched tags;
<tr>
#if(divCounter != 1)
{
//Starting with the last column
<td class=outSideColumns></td>
<td></td>
}
<td class=cardBox>
While Jeremy provided an excellent answer. I ended up running into issues where I absolutely needed open tags and, in fact, I had to re-write everything back into DIV tags to make things work.
That is when I discovered my new bestest friend - the MarkupString - that I can use to insert any HTML code I desire without blowing up the IDE!
Here is a link that explains how to use it - https://www.meziantou.net/rendering-raw-unescaped-html-in-blazor.htm

Select specific html with "Html Agility pack"

I'm using html-agility-pack and trying to select out a specific html in it.
The part I want to get is every GTIN-number in these blocks:
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
-The part I want is the numbers after the ending span-tag. Ex: 07330155011068. Below is my html, and my c#-method:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
<th>Brand</th>
<th>GTIN</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Dalapannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Dessertpannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155114059</td>
</tr>
</tbody>
</table>
</div>
And I'm using this method to trying to get my values. The problem is I don't know what code to write in the SelectNode() to get the innerHtml containing the GTIN-numbers.
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Desktop/test.html");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("TODO: Add code to select all GTIN"))
{
}
doc.Save("file.htm");
}
Use Xpath to select fourth cells from body of table with id tableSearchArticle. Then get inner text of cells (it will be without html tags, like GTIN:07330155114059) and remove GTIN prefix:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var gtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
Output:
[
"07330155011068",
"07330155114059"
]
SelectNodes receives an Xpath expression. So, you could start with this (untested):
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes(
"//div[#class='table-wrapper']/table[#id='tableSearchArticle']/tbody/tr"))
{
Console.WriteLine(tr.InnerHtml);
Console.WriteLine(tr.SelectSingleNode(".//a").GetAttribute("href"));
Console.WriteLine(tr.SelectSingleNode(".//td[last()]").InnerText);
}

Get <a href="https://www.google.se/"> adress width html agility pack

I'm having this html:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Apple
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Banana
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
</tr>
</tbody>
</table>
And this is my method that is supposed to get all a href adresses in the table. But Now I only get a list of Article name. My list returns Apple, Banana. I want to return a list of the a href - http-adresses. How can I do that?
public List<string> GetListOfHrefs()
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[1]//#href";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", "")).ToList();
return listOfGtins;
}
Two problems in your XPath - href is attribute of a element, not of td element, and you cannot select attributes with XPath - you should select elements:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td/a[#href]";
var links = doc.DocumentNode.SelectNodes(xpath)
.Select(a => a.Attributes["href"].Value);
Output:
[
"http://www.dabas.com/ProductSheet/Detail.ashx/121308",
"http://www.dabas.com/ProductSheet/Detail.ashx/124494"
]

html agility how to process table in a hyperlink

I am working to get some information from a html table which has many rows like this. The given row is like one piece of info in a table cell. I need to get link, artist name, artist type from this table.
<a href="http://somesite/music/view_album.php?albumid=6468" style="color:#000;" sl-processed="1">
<table width="100%" border="0" bgcolor="#FFFFFF">
<tbody><tr>
<td colspan="2" align="left" valign="top" style="color:#900;">album title</td>
</tr>
<tr> <td width="31%" align="left" valign="top"> <img src="./albums_files/No_cover.png" width="90" height="80" border="0">
</td>
<td width="69%" align="left" valign="top">
<a class="leftcat" href="http://somelink/toartiset" sl-processed="1"> <strong>Rizwan-Muazzam</strong>
</a>
<br>
(<a class="leftcat" href="http://linktoartisttype/" sl-processed="1">
Some Artist Type </a>) <br>
<span class="leftcat">
Rated +: 0<br>
Rated -: 0 </span>
</td>
</tr>
<tr> <td valign="top" align="center" colspan="2">
</td> </tr>
</tbody></table>
</a>
I have done this
HtmlDocument doc = new HtmlDocument();
doc = new HtmlWeb().Load(albumUrl);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
this gives me all the links which I need, now I want to get all the child information under the hyperlink.
Help will be appreciated.
Regards
Parminder
I would suggest using a loop to go through all the rows and then select the links and extract the info from them:
var rows = doc.DocumentNode.SelectNodes("//tr");
foreach (var row in rows)
{
var links = row.SelectNodes(".//a");
var artistLink = links[0].Attributes["href"];
var artistName = links[0].SelectSingleNode(".//strong/text()").InnerText;
var artistTypeLink = links[1].Attributes["href"];
var artistTypeName = links[1].SelectSingleNode(".//text()").InnerText;
// Store the results...
}

Parsing error in cshtml file in ASP.NET MVC4

In the view, I am trying to display at most 5 images in a row.
The idea is to introduce a new row by using the </tr><tr> html tags to close the present row and start a new one as shown below, but this gives a Parser error.
Parser Error Message: The code block is missing a closing "}"
character. Make sure you have a matching "}" character for all the
"{" characters within this block, and that none of the "}" characters
are being interpreted as markup.
How can I correct this?
<table>
<tr>
#{
int indx = 0;
foreach(var item in Model) {
indx++;
<td>
<a href ="#Url.Action("ShowPic", "ViewPhotos", new { id = item.ID })">
<img src="#String.Format("data:image/jpg;base64,{0}", Convert.ToBase64String(item.Image))" />
</a>
<br />
#Html.DisplayFor(modelItem => item.Caption)
</td>
if(indx%5==0) {
</tr><tr><!--Error here-->
}
}
}
</tr>
</table>
Thanks.
Try adding this to the row in question
#: </tr><tr><!--Error here-->
Because you are wrapping it in a HTML tag element, Razor cannot determine that the content within the if is the start of a content block. By using #: we are indicating that the contents of the statement should be treated as content.
Your first <tr> and last </tr> is outside of the scope of it's opening/closing tag:
<table>
#{
<tr>
int indx = 0; foreach (var item in Model) { indx++;
<td>
<a href ="#Url.Action("ShowPic", "ViewPhotos", new { id = item.ID })">
<img src="#String.Format("data:image/jpg;base64,{0}", Convert.ToBase64String(item.Image))" />
</a>
<br />
#Html.DisplayFor(modelItem => item.Caption)
</td>
if(indx%5==0) {
</tr>
<tr>
} }
</tr>
}
</table>

Categories