How can I loop through table and row that have an attribute id or name to get inner text in deep down in each td cell? I work on asp.net, c#, and the newest html agility package. Please guide. Thank you.
An html file have several tables. One of them has an attribute id=main-part. In that identified table, there are many rows. Some of those rows have same attribute name=display. In those named rows, there are many columns which I have to extract text from. Something like this:
<body>
<table>
...
</table>
<table>
...
</table>
<table id="main-part">
<tr>
<td></td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Jan</td>
<td>Feb</td>
<td>Mar</td>
...
</tr>
<tr name="display">
<td>Apr</td>
<td>May</td>
<td>June</td>
...
</tr>
<tr name="display">
<td>Jul</td>
<td>Aug</td>
<td>Sep</td>
...
</tr>
<tr>
<td></td>
...
</tr>
<tr name="display">
<td>Oct</td>
<td>Nov</td>
<td>Dec</td>
...
</tr>
<tr>
<td></td>
...
</tr>
</table>
<table>
...
</table>
</body>
You need to select these nodes using xpath:
foreach(HtmlNode cell in doc.DocumentElement.SelectNodes("//tr[#name='display']/td")
{
// get cell data
}
It worked! Thank you very much Oded.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"C:/samplefolder/sample.htm");
foreach(HtmlNode cell in doc.DocumentNode.SelectNodes("//tr[#name='display']/td"))
{
string test = cell.InnerText;
Response.Write(test);
}
It showed result like JanFebMarAprMayJuneJulAugSepOctNovDec. How can I sort them out, separate by a space or a tab? Thank you.
Related
I have some html and want to scrape some data from it.
The HTML is structured in the following way
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
<table>
<tbody>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
<tr>
<td>label</td>
<td>data</td>
</tr>
</tbody>
</table>
<div class="someClass"><span class="someOtherClass">Text</span></div>
I need to be able to scrape the Text value located in the span where class="someOtherClass" (I've already implemented this portion)
I then need to be able to scrape the table directly below the div. Since the "parent" div doesn't actually contain the table, I'm having some issues implementing this.
I need to be able to scrape the Text value located in the span
You don't need regex. An Xpath query is enough.
var text = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']")
.Select(x => x.InnerText)
.ToList();
I then need to be able to scrape the table directly below the div.
using a similar xpath
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var tables = doc.DocumentNode
.SelectNodes("//span[#class='someOtherClass']/following::table").ToList();
foreach (var table in tables)
{
var list = table.Descendants("tr")
.Select(tr => tr.Descendants("td")
.Select(td => td.InnerText).ToList())
.ToList();
}
I'm not a professional in C# and ASP.Net so please have some patience with me.
I have the following problem.
I'm using ASP.Net WebForm API with C# for creating a dashboard.
I have a generic HTML table (taken out from a sql query) which will be displayed. Now I want to implement the feature, that when the user clicks on a cell for example in the column ID, he should get an details view which is a bootstrap modal.
For that I need the ID value which is in this cell. How can I get this value?
With the value I will start a new sql query and more other specific informations are going to be shown.
Here is my aspx. structure:
<table id="MyTable" class="table table-striped table-bordered table-condensed table-responsive">
<thead>
<tr>
<th>ID</th>
<th>Name</th>
<th>Typ</th>
<th>Something else</th>
<th>Date</th>
</tr>
</thead>
<tbody>
<%=Tabelle.GetTable.dataTable_all%>
</tbody>
</table>
<script type="text/javascript">
$(document).ready(function () {
$('#MyTable').DataTable();
});
</script>
the variable dataTable_all is a string. So this is my table in HTML Code.
My Result for <tbody> is 366 rows big and here is an extract:
<tr>
<td>154789</td>
<td>Testproject X</td>
<td>Good</td>
<td>greencolored</td>
<td>01.01.2015</td>
</tr>
<tr>
<td>189365</td>
<td>Testproject B</td>
<td>Good</td>
<td>redcolored</td>
<td>08.01.2015</td>
</tr>
<tr>
<td>136471</td>
<td>Testproject Y</td>
<td>Bad</td>
<td>pinkcolored</td>
<td>15.04.2015</td>
</tr>
So how can I do it that when I click on for example ID 136471 that the value will be given to a variable in my c# code?
Change to:
<tr data-id="154789">
<td>154789</td>
<td>Testproject X</td>
<td>Good</td>
<td>greencolored</td>
<td>01.01.2015</td>
</tr>
<tr data-id="189365">
<td>189365</td>
<td>Testproject B</td>
<td>Good</td>
<td>redcolored</td>
<td>08.01.2015</td>
</tr>
<tr data-id="136471">
<td>136471</td>
<td>Testproject Y</td>
<td>Bad</td>
<td>pinkcolored</td>
<td>15.04.2015</td>
</tr>
Then use:
$('tbody tr').click(function() {
alert($(this).data('id'));
});
Working demo
https://jsfiddle.net/jknysneo/
I'm using html-agility-pack and trying to select out a specific html in it.
The part I want to get is every GTIN-number in these blocks:
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
-The part I want is the numbers after the ending span-tag. Ex: 07330155011068. Below is my html, and my c#-method:
<div class="table-wrapper" style='display: block;'>
<table id="tableSearchArticle">
<thead>
<tr>
<th>Article</th>
<th>art.nr.</th>
<th>Brand</th>
<th>GTIN</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/121308" target="_blank">
Dalapannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11068</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155011068</td>
</tr>
<tr>
<td>
<a href="http://www.dabas.com/ProductSheet/Detail.ashx/124494" target="_blank">
Dessertpannkaka fryst ca100st 6kg
</a>
</td>
<td><span class="mobile-only">Tillverkarens art.nr:</span>11405</td>
<td><span class="mobile-only">Varumärke:</span>test</td>
<td><span class="mobile-only">GTIN:</span>07330155114059</td>
</tr>
</tbody>
</table>
</div>
And I'm using this method to trying to get my values. The problem is I don't know what code to write in the SelectNode() to get the innerHtml containing the GTIN-numbers.
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Desktop/test.html");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("TODO: Add code to select all GTIN"))
{
}
doc.Save("file.htm");
}
Use Xpath to select fourth cells from body of table with id tableSearchArticle. Then get inner text of cells (it will be without html tags, like GTIN:07330155114059) and remove GTIN prefix:
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var gtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
Output:
[
"07330155011068",
"07330155114059"
]
SelectNodes receives an Xpath expression. So, you could start with this (untested):
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes(
"//div[#class='table-wrapper']/table[#id='tableSearchArticle']/tbody/tr"))
{
Console.WriteLine(tr.InnerHtml);
Console.WriteLine(tr.SelectSingleNode(".//a").GetAttribute("href"));
Console.WriteLine(tr.SelectSingleNode(".//td[last()]").InnerText);
}
This is a newbie question so please provide working code.
How do I count the tables in an html file using C# and the html-agility-pack?
(I will need to get values from specific tables in an html file based on the count of tables. I will then perform some math on the values retrieved.)
Here is a sample file with three tables for your convenience:
<html>
<head>
<title>Tables</title>
</head>
<body>
<table border="1">
<tr>
<th>Name</th>
<th>Phone</th>
<th>City</th>
<th>Number</th>
</tr>
<tr>
<td>Scott</td>
<td>555-2345</td>
<td>Chicago</td>
<td>42</td>
</tr>
<tr>
<td>Bill</td>
<td>555-1243</td>
<td>Detroit</td>
<td>23</td>
</tr>
<tr>
<td>Ted</td>
<td>555-3567</td>
<td>Columbus</td>
<td>9</td>
</tr>
</table>
<p></p>
<table border="1">
<tr>
<th>Name</th>
<th>Year</th>
</tr>
<tr>
<td>Abraham</td>
<td>1865</td>
</tr>
<tr>
<td>Martin</td>
<td>1968</td>
</tr>
<tr>
<td>John</td>
<td>1963</td>
</tr>
</table>
<p></p>
<table border="1">
<tr>
<th>Animal</th>
<th>Location</th>
<th>Number</th>
</tr>
<tr>
<td>Tiger</td>
<td>Jungle</td>
<td>8</td>
</tr>
<tr>
<td>Hippo</td>
<td>River</td>
<td>4</td>
</tr>
<tr>
<td>Camel</td>
<td>Desert</td>
<td>3</td>
</tr>
</table>
</body>
</html>
If you would, please SHOW how to send the results to a new text file.
Thanks!
I think this can be a starting point
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var tables = doc.DocumentNode.Descendants("table");
int tablesCount = tables.Count();
foreach (var table in tables)
{
var rows = table.Descendants("tr")
.Select(tr => tr.Descendants("td").Select(td => td.InnerText).ToList())
.ToList();
foreach(var row in rows)
Console.WriteLine(String.Join(",", row));
}
Something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(myTestFile);
// get all TABLE elements recursively
int count = doc.DocumentNode.SelectNodes("//table").Count;
// output to a text file
File.WriteAllText("output.txt", count.ToString());
I have this piece of html code. I want to get the text inside the <div> tag using WatiN. The C# code is below, but I'm pretty sure it could be done way better than my solution. Any suggestions?
HTML:
<table id="someId" cellspacing="0" border="1" style="border-collapse:collapse;" rules="all">
<tbody>
<tr>
<th scope="col"> </th>
</tr>
<tr>
<td>
<div>Some text</div>
</td>
</tr>
</tbody>
</table>
C#
// Get the table ElementContainer
IElementContainer diagnosisElementContainer = (IElementContainer)_control.GetElementById("someId");
// Get the tbody element
IElementContainer tbodyElementContainer = (IElementContainer)diagnosisElementContainer.ChildrenWithTag("tbody");
// Get the <tr> children
ElementCollection trElementContainer = tbodyElementContainer.ChildrenWithTag("tr");
// Get the <td> child of the last <tr>
IElementContainer tdElementContainer = (IElementContainer)trElementContainer.ElementAt<Element>(trElementContainer.Count - 1);
// Get the <div> element inside the <td>
Element divElement = tdElementContainer.Divs[0];
Based on the given, something like this is how I'd go for IE.
IE myIE = new IE();
myIE.GoTo("[theurl]");
string theText = myIE.Table("someId").Divs[0].Text;
The above is working on WatiN 2.1, Win7, IE9.