Scrape html located directly below div - c#

I have some html and want to scrape some data from it.
The HTML is structured in the following way
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
I need to be able to scrape the Text value located in the span where class="someOtherClass" (I've already implemented this portion)
I then need to be able to scrape the table directly below the div. Since the "parent" div doesn't actually contain the table, I'm having some issues implementing this.

I need to be able to scrape the Text value located in the span
You don't need regex. An Xpath query is enough.
var text = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']")
.Select(x => x.InnerText)
.ToList();
I then need to be able to scrape the table directly below the div.
using a similar xpath
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var tables = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']/following::table").ToList();
foreach (var table in tables)
{
var list = table.Descendants("tr")
.Select(tr => tr.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();
}

Related

Razor get value of input HTML bounded by foreach

I have one following table html, the data of rows in the table is looped by foreach.
Can I use Razor to get the value of each row into a List or an array in C#?
My C# code Razor (I tried this so far)
#{
var l = new List<string>();
l.Add(#<input id="updated_value2" data-bind="value:value,visible:isEditing()" />);
}
Here's my table
<table class="table table-hover">
<tbody data-bind="foreach: $root.mapJsons(parameters())">
<tr class="data-hover">
<td>
<strong>
<span data-bind="text:key" />
</strong>
</td>
<td>
#*display label and input for dictionary<value> false DIS true APP*#
<input id="updated_value" data-bind="value:value,visible:isEditing()" />
<label id="display_value" data-bind="text:value,visible:!isEditing()" />
</td>
</tr>
</tbody>
<thead>
<tr>
<th style="width: 30%">
Name
</th>
<th style="width: 30%">
Value
</th>
<th></th>
</tr>
</thead>
</table>
Your foreach loop is being executed client-side (looks like a KnockoutJS binding?) rather than server-side, so any Razor code you embed in the table is only going to be called once as it's rendered by the server. So the answer is no, you cannot populate a server-side list with this particular foreach loop.

Select specific html with "Html Agility pack"

I'm using html-agility-pack and trying to select out a specific html in it.
The part I want to get is every GTIN-number in these blocks:
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
-The part I want is the numbers after the ending span-tag. Ex: 07330155011068. Below is my html, and my c#-method:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
<th>Brand</th>
<th>GTIN</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Dalapannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Dessertpannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155114059</td>
</tr>
</tbody>
</table>
</div>
And I'm using this method to trying to get my values. The problem is I don't know what code to write in the SelectNode() to get the innerHtml containing the GTIN-numbers.
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Desktop/test.html");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("TODO: Add code to select all GTIN"))
{
}
doc.Save("file.htm");
}
Use Xpath to select fourth cells from body of table with id tableSearchArticle. Then get inner text of cells (it will be without html tags, like GTIN:07330155114059) and remove GTIN prefix:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var gtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
Output:
[
"07330155011068",
"07330155114059"
]
SelectNodes receives an Xpath expression. So, you could start with this (untested):
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes(
"//div[#class='table-wrapper']/table[#id='tableSearchArticle']/tbody/tr"))
{
Console.WriteLine(tr.InnerHtml);
Console.WriteLine(tr.SelectSingleNode(".//a").GetAttribute("href"));
Console.WriteLine(tr.SelectSingleNode(".//td[last()]").InnerText);
}

How to get the count of tables in an html file with C# and html-agility-pack

This is a newbie question so please provide working code.
How do I count the tables in an html file using C# and the html-agility-pack?
(I will need to get values from specific tables in an html file based on the count of tables. I will then perform some math on the values retrieved.)
Here is a sample file with three tables for your convenience:
<html>
<head>
<title>Tables</title>
</head>
<body>
<table border="1">
<tr>
<th>Name</th>
<th>Phone</th>
<th>City</th>
<th>Number</th>
</tr>
<tr>
<td>Scott</td>
<td>555-2345</td>
<td>Chicago</td>
<td>42</td>
</tr>
<tr>
<td>Bill</td>
<td>555-1243</td>
<td>Detroit</td>
<td>23</td>
</tr>
<tr>
<td>Ted</td>
<td>555-3567</td>
<td>Columbus</td>
<td>9</td>
</tr>
</table>
<p></p>
<table border="1">
<tr>
<th>Name</th>
<th>Year</th>
</tr>
<tr>
<td>Abraham</td>
<td>1865</td>
</tr>
<tr>
<td>Martin</td>
<td>1968</td>
</tr>
<tr>
<td>John</td>
<td>1963</td>
</tr>
</table>
<p></p>
<table border="1">
<tr>
<th>Animal</th>
<th>Location</th>
<th>Number</th>
</tr>
<tr>
<td>Tiger</td>
<td>Jungle</td>
<td>8</td>
</tr>
<tr>
<td>Hippo</td>
<td>River</td>
<td>4</td>
</tr>
<tr>
<td>Camel</td>
<td>Desert</td>
<td>3</td>
</tr>
</table>
</body>
</html>
If you would, please SHOW how to send the results to a new text file.
Thanks!
I think this can be a starting point
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var tables = doc.DocumentNode.Descendants("table");
int tablesCount = tables.Count();
foreach (var table in tables)
{
var rows = table.Descendants("tr")
.Select(tr => tr.Descendants("td").Select(td => td.InnerText).ToList())
.ToList();
foreach(var row in rows)
Console.WriteLine(String.Join(",", row));
}
Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(myTestFile);
// get all TABLE elements recursively
int count = doc.DocumentNode.SelectNodes("//table").Count;
// output to a text file
File.WriteAllText("output.txt", count.ToString());

Retrieve element within html hierarchy

I have this piece of html code. I want to get the text inside the <div> tag using WatiN. The C# code is below, but I'm pretty sure it could be done way better than my solution. Any suggestions?
HTML:
<table id="someId" cellspacing="0" border="1" style="border-collapse:collapse;" rules="all">
<tbody>
<tr>
<th scope="col"> </th>
</tr>
<tr>
<td>
<div>Some text</div>
</td>
</tr>
</tbody>
</table>
C#
// Get the table ElementContainer
IElementContainer diagnosisElementContainer = (IElementContainer)_control.GetElementById("someId");
// Get the tbody element
IElementContainer tbodyElementContainer = (IElementContainer)diagnosisElementContainer.ChildrenWithTag("tbody");
// Get the <tr> children
ElementCollection trElementContainer = tbodyElementContainer.ChildrenWithTag("tr");
// Get the <td> child of the last <tr>
IElementContainer tdElementContainer = (IElementContainer)trElementContainer.ElementAt<Element>(trElementContainer.Count - 1);
// Get the <div> element inside the <td>
Element divElement = tdElementContainer.Divs[0];
Based on the given, something like this is how I'd go for IE.
IE myIE = new IE();
myIE.GoTo("[theurl]");
string theText = myIE.Table("someId").Divs[0].Text;
The above is working on WatiN 2.1, Win7, IE9.

Html Agility Pack - loop through rows and columns

How can I loop through table and row that have an attribute id or name to get inner text in deep down in each td cell? I work on asp.net, c#, and the newest html agility package. Please guide. Thank you.
An html file have several tables. One of them has an attribute id=main-part. In that identified table, there are many rows. Some of those rows have same attribute name=display. In those named rows, there are many columns which I have to extract text from. Something like this:
<body>
<table>
...
</table>
<table>
...
</table>
<table id="main-part">
<tr>
<td></td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Jan</td>
<td>Feb</td>
<td>Mar</td>
...
</tr>
<tr name="display">
<td>Apr</td>
<td>May</td>
<td>June</td>
...
</tr>
<tr name="display">
<td>Jul</td>
<td>Aug</td>
<td>Sep</td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Oct</td>
<td>Nov</td>
<td>Dec</td>
...
</tr>
<tr>
<td></td>
...
</tr>
</table>
<table>
...
</table>
</body>
You need to select these nodes using xpath:
foreach(HtmlNode cell in doc.DocumentElement.SelectNodes("//tr[#name='display']/td")
{
// get cell data
}
It worked! Thank you very much Oded.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:/samplefolder/sample.htm");
foreach(HtmlNode cell in doc.DocumentNode.SelectNodes("//tr[#name='display']/td"))
{
string test = cell.InnerText;
Response.Write(test);
}
It showed result like JanFebMarAprMayJuneJulAugSepOctNovDec. How can I sort them out, separate by a space or a tab? Thank you.

Categories