I'm having this html:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Apple
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Banana
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
</tr>
</tbody>
</table>
And this is my method that is supposed to get all a href adresses in the table. But Now I only get a list of Article name. My list returns Apple, Banana. I want to return a list of the a href - http-adresses. How can I do that?
public List<string> GetListOfHrefs()
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[1]//#href";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", "")).ToList();
return listOfGtins;
}
Two problems in your XPath - href is attribute of a element, not of td element, and you cannot select attributes with XPath - you should select elements:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td/a[#href]";
var links = doc.DocumentNode.SelectNodes(xpath)
.Select(a => a.Attributes["href"].Value);
Output:
[
"http://www.dabas.com/ProductSheet/Detail.ashx/121308",
"http://www.dabas.com/ProductSheet/Detail.ashx/124494"
]
Related
hi i want to parse table but I can't get the information completely
I used the following code that does not return the href link
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]//tbody");
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
}
i need to get href too, how can i do this?
<table>
<tbody>
<tr>
<td class="a1">
<a href="/subtitles/joker-2019/farsi_persian/2110062">
<span class="l r positive-icon">
Farsi/Persian
</span>
<span>
Joker.2019.WEBRip.XviD.MP3-SHITBOX
</span>
</a>
</td>
<td class="a3">
</td>
<td class="a40">
</td>
<td class="a5">
<a href="/u/695804">
meisam_t72
</a>
</td>
<td class="a6">
<div>
►► زیرنویس از میثم ططری - ویرایش شده ◄◄ - meisam_t72 کانال تلگرام </div>
</td>
</tr>
</tbody>
</table>
Inside your foreach you need to check if the content of your cell contains a <a> tag. If it contains just get the attribute href from this tag.
Something like this (untested)
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
var links = cell.SelectNodes(".//a");
if (links == null || !links.Any())
{
continue;
}
foreach (var link in links)
{
var href = link.Attributes["href"].Value;
// do whatever you want with the link.
}
}
I have some html and want to scrape some data from it.
The HTML is structured in the following way
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
I need to be able to scrape the Text value located in the span where class="someOtherClass" (I've already implemented this portion)
I then need to be able to scrape the table directly below the div. Since the "parent" div doesn't actually contain the table, I'm having some issues implementing this.
I need to be able to scrape the Text value located in the span
You don't need regex. An Xpath query is enough.
var text = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']")
.Select(x => x.InnerText)
.ToList();
I then need to be able to scrape the table directly below the div.
using a similar xpath
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var tables = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']/following::table").ToList();
foreach (var table in tables)
{
var list = table.Descendants("tr")
.Select(tr => tr.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();
}
I'm using html-agility-pack and trying to select out a specific html in it.
The part I want to get is every GTIN-number in these blocks:
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
-The part I want is the numbers after the ending span-tag. Ex: 07330155011068. Below is my html, and my c#-method:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
<th>Brand</th>
<th>GTIN</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Dalapannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Dessertpannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155114059</td>
</tr>
</tbody>
</table>
</div>
And I'm using this method to trying to get my values. The problem is I don't know what code to write in the SelectNode() to get the innerHtml containing the GTIN-numbers.
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Desktop/test.html");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("TODO: Add code to select all GTIN"))
{
}
doc.Save("file.htm");
}
Use Xpath to select fourth cells from body of table with id tableSearchArticle. Then get inner text of cells (it will be without html tags, like GTIN:07330155114059) and remove GTIN prefix:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var gtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
Output:
[
"07330155011068",
"07330155114059"
]
SelectNodes receives an Xpath expression. So, you could start with this (untested):
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes(
"//div[#class='table-wrapper']/table[#id='tableSearchArticle']/tbody/tr"))
{
Console.WriteLine(tr.InnerHtml);
Console.WriteLine(tr.SelectSingleNode(".//a").GetAttribute("href"));
Console.WriteLine(tr.SelectSingleNode(".//td[last()]").InnerText);
}
I am working to get some information from a html table which has many rows like this. The given row is like one piece of info in a table cell. I need to get link, artist name, artist type from this table.
<a href="http://somesite/music/view_album.php?albumid=6468" style="color:#000;" sl-processed="1">
<table width="100%" border="0" bgcolor="#FFFFFF">
<tbody><tr>
<td colspan="2" align="left" valign="top" style="color:#900;">album title</td>
</tr>
<tr> <td width="31%" align="left" valign="top"> <img src="./albums_files/No_cover.png" width="90" height="80" border="0">
</td>
<td width="69%" align="left" valign="top">
<a class="leftcat" href="http://somelink/toartiset" sl-processed="1"> <strong>Rizwan-Muazzam</strong>
</a>
<br>
(<a class="leftcat" href="http://linktoartisttype/" sl-processed="1">
Some Artist Type </a>) <br>
<span class="leftcat">
Rated +: 0<br>
Rated -: 0 </span>
</td>
</tr>
<tr> <td valign="top" align="center" colspan="2">
</td> </tr>
</tbody></table>
</a>
I have done this
HtmlDocument doc = new HtmlDocument();
doc = new HtmlWeb().Load(albumUrl);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
this gives me all the links which I need, now I want to get all the child information under the hyperlink.
Help will be appreciated.
Regards
Parminder
I would suggest using a loop to go through all the rows and then select the links and extract the info from them:
var rows = doc.DocumentNode.SelectNodes("//tr");
foreach (var row in rows)
{
var links = row.SelectNodes(".//a");
var artistLink = links[0].Attributes["href"];
var artistName = links[0].SelectSingleNode(".//strong/text()").InnerText;
var artistTypeLink = links[1].Attributes["href"];
var artistTypeName = links[1].SelectSingleNode(".//text()").InnerText;
// Store the results...
}
here's my sample HTML...
<html>
<table class="test" border="0" >
<tr bgColor="#e8f4ff">
<td width="50%" align="right">
<b>Invoice ID:</b>
</td>
<td width="50%">
<b>
1622579
</b>
</td>
</tr>
<tr bgColor="#e8f4ff">
<td align="right">
<b>Code:</b>
</td>
<td>
<b>
20475
</b>
</td>
</tr>
</html>
there's no ID so ican't use SelectNodes()
How can i get the Code: 20475 using HTMLAgilitypack or regex?
Using latest HtmlAgilityPack, just using the document structure - this will not be very resilient to changes in the HTML - you should strongly consider adding appropriate ids (if this is your html anyway):
HtmlDocument doc = new HtmlDocument();
doc.Load(#"test.html");
var tds = doc.DocumentNode.Descendants("td").ToArray();
string codeValue = "";
for (int i = 1; i < tds.Length; i++)
{
if (tds[i - 1].Element("b").InnerText == "Code:")
codeValue = tds[i].Element("b").InnerText;
}