Select specific html with "Html Agility pack" - c#

I'm using html-agility-pack and trying to select out a specific html in it.
The part I want to get is every GTIN-number in these blocks:
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
-The part I want is the numbers after the ending span-tag. Ex: 07330155011068. Below is my html, and my c#-method:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
<th>Brand</th>
<th>GTIN</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Dalapannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Dessertpannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155114059</td>
</tr>
</tbody>
</table>
</div>
And I'm using this method to trying to get my values. The problem is I don't know what code to write in the SelectNode() to get the innerHtml containing the GTIN-numbers.
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Desktop/test.html");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("TODO: Add code to select all GTIN"))
{
}
doc.Save("file.htm");
}

Use Xpath to select fourth cells from body of table with id tableSearchArticle. Then get inner text of cells (it will be without html tags, like GTIN:07330155114059) and remove GTIN prefix:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var gtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
Output:
[
"07330155011068",
"07330155114059"
]

SelectNodes receives an Xpath expression. So, you could start with this (untested):
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes(
"//div[#class='table-wrapper']/table[#id='tableSearchArticle']/tbody/tr"))
{
Console.WriteLine(tr.InnerHtml);
Console.WriteLine(tr.SelectSingleNode(".//a").GetAttribute("href"));
Console.WriteLine(tr.SelectSingleNode(".//td[last()]").InnerText);
}

Related

parse table with href html agility pack

hi i want to parse table but I can't get the information completely
I used the following code that does not return the href link
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]//tbody");
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
}
i need to get href too, how can i do this?
<table>
<tbody>
<tr>
<td class="a1">
<a href="/subtitles/joker-2019/farsi_persian/2110062">
<span class="l r positive-icon">
Farsi/Persian
</span>
<span>
Joker.2019.WEBRip.XviD.MP3-SHITBOX
</span>
</a>
</td>
<td class="a3">
</td>
<td class="a40">
</td>
<td class="a5">
<a href="/u/695804">
meisam_t72
</a>
</td>
<td class="a6">
<div>
►► زیرنویس از میثم ططری - ویرایش شده ◄◄ - meisam_t72 کانال تلگرام </div>
</td>
</tr>
</tbody>
</table>
Inside your foreach you need to check if the content of your cell contains a <a> tag. If it contains just get the attribute href from this tag.
Something like this (untested)
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
var links = cell.SelectNodes(".//a");
if (links == null || !links.Any())
{
continue;
}
foreach (var link in links)
{
var href = link.Attributes["href"].Value;
// do whatever you want with the link.
}
}

Scrape html located directly below div

I have some html and want to scrape some data from it.
The HTML is structured in the following way
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
I need to be able to scrape the Text value located in the span where class="someOtherClass" (I've already implemented this portion)
I then need to be able to scrape the table directly below the div. Since the "parent" div doesn't actually contain the table, I'm having some issues implementing this.
I need to be able to scrape the Text value located in the span
You don't need regex. An Xpath query is enough.
var text = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']")
.Select(x => x.InnerText)
.ToList();
I then need to be able to scrape the table directly below the div.
using a similar xpath
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var tables = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']/following::table").ToList();
foreach (var table in tables)
{
var list = table.Descendants("tr")
.Select(tr => tr.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();
}

Get <a href="https://www.google.se/"> adress width html agility pack

I'm having this html:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Apple
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Banana
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
</tr>
</tbody>
</table>
And this is my method that is supposed to get all a href adresses in the table. But Now I only get a list of Article name. My list returns Apple, Banana. I want to return a list of the a href - http-adresses. How can I do that?
public List<string> GetListOfHrefs()
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[1]//#href";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", "")).ToList();
return listOfGtins;
}
Two problems in your XPath - href is attribute of a element, not of td element, and you cannot select attributes with XPath - you should select elements:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td/a[#href]";
var links = doc.DocumentNode.SelectNodes(xpath)
.Select(a => a.Attributes["href"].Value);
Output:
[
"http://www.dabas.com/ProductSheet/Detail.ashx/121308",
"http://www.dabas.com/ProductSheet/Detail.ashx/124494"
]

Retrieve element within html hierarchy

I have this piece of html code. I want to get the text inside the <div> tag using WatiN. The C# code is below, but I'm pretty sure it could be done way better than my solution. Any suggestions?
HTML:
<table id="someId" cellspacing="0" border="1" style="border-collapse:collapse;" rules="all">
<tbody>
<tr>
<th scope="col"> </th>
</tr>
<tr>
<td>
<div>Some text</div>
</td>
</tr>
</tbody>
</table>
C#
// Get the table ElementContainer
IElementContainer diagnosisElementContainer = (IElementContainer)_control.GetElementById("someId");
// Get the tbody element
IElementContainer tbodyElementContainer = (IElementContainer)diagnosisElementContainer.ChildrenWithTag("tbody");
// Get the <tr> children
ElementCollection trElementContainer = tbodyElementContainer.ChildrenWithTag("tr");
// Get the <td> child of the last <tr>
IElementContainer tdElementContainer = (IElementContainer)trElementContainer.ElementAt<Element>(trElementContainer.Count - 1);
// Get the <div> element inside the <td>
Element divElement = tdElementContainer.Divs[0];
Based on the given, something like this is how I'd go for IE.
IE myIE = new IE();
myIE.GoTo("[theurl]");
string theText = myIE.Table("someId").Divs[0].Text;
The above is working on WatiN 2.1, Win7, IE9.

Html Agility Pack - loop through rows and columns

How can I loop through table and row that have an attribute id or name to get inner text in deep down in each td cell? I work on asp.net, c#, and the newest html agility package. Please guide. Thank you.
An html file have several tables. One of them has an attribute id=main-part. In that identified table, there are many rows. Some of those rows have same attribute name=display. In those named rows, there are many columns which I have to extract text from. Something like this:
<body>
<table>
...
</table>
<table>
...
</table>
<table id="main-part">
<tr>
<td></td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Jan</td>
<td>Feb</td>
<td>Mar</td>
...
</tr>
<tr name="display">
<td>Apr</td>
<td>May</td>
<td>June</td>
...
</tr>
<tr name="display">
<td>Jul</td>
<td>Aug</td>
<td>Sep</td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Oct</td>
<td>Nov</td>
<td>Dec</td>
...
</tr>
<tr>
<td></td>
...
</tr>
</table>
<table>
...
</table>
</body>
You need to select these nodes using xpath:
foreach(HtmlNode cell in doc.DocumentElement.SelectNodes("//tr[#name='display']/td")
{
// get cell data
}
It worked! Thank you very much Oded.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:/samplefolder/sample.htm");
foreach(HtmlNode cell in doc.DocumentNode.SelectNodes("//tr[#name='display']/td"))
{
string test = cell.InnerText;
Response.Write(test);
}
It showed result like JanFebMarAprMayJuneJulAugSepOctNovDec. How can I sort them out, separate by a space or a tab? Thank you.

Categories