Taking different Table of Elements with HtmlAgilityPack - c#

I have this loop structure several times.
Table 1
<table>
<tbody>
<tr>
<th>titulo</th>
</tr>
</tbody>
</table>
Table 2
<table>
<tbody>
<tr>
<th>Texto</th>
<th>Texto</th>
<th>Texto</th>
<th>Texto</th>
</tr>
</tbody>
</table>
This pattern is repeated several times.
How do I switch them to an array and a list for me to get the values ​​of each ?

Short Demo using a Console App:
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("Demo.html");
var result = doc.DocumentNode.SelectNodes("//table")
.Select(table => new //create anonymous type
{
Table = table,
HeaderNodes = table.SelectNodes("./tbody/tr/th").ToList() //the th subnodes
});
foreach (var table in result)
{
foreach (HtmlNode headerNode in table.HeaderNodes)
{
Console.WriteLine( headerNode.InnerText);
}
Console.WriteLine("--------------------------");
}
}
}
Output:
titulo
--------------------------
Texto
Texto
Texto
Texto
--------------------------

Related

Find indexes in String using multiple search items and one single iteration

I have the following HTML sample document:
.....
<div class="TableElement">
<table>
<tr>
<th class="boxToolTip" title="La quotazione di A2A è in rialzo o in ribasso?"> </th>
..
<th class="boxToolTip" class="ColumnLast" title="Trades più recenti su A2A">Ora <img title='' alt='' class='quotePageRTupgradeLink' href='#quotePageRTupgradeContainer' id='cautionImageEnt' src='/common/images/icons/caution_sign.gif'/></th>
</tr>
<tr class="odd">
..
<td align="center"><span id="quoteElementPiece6" class="PriceTextUp">1,619</span></td>
<td align="center"><span id="quoteElementPiece7" class="">1,6235</span></td>
<td align="center"><span id="quoteElementPiece8" class="">1,591</span></td>
<td align="center"><span id="quoteElementPiece9" class="">1,5995</span></td>
..
</tr>
</table>
</div>
......
I need to get the values corresponding at quoteElementPiece 6,7,8,9 and 17 (currently further in the document) section.
I am simply searching one by one in the code at the moment:
int index6 = doc.IndexOf("quoteElementPiece6");
..
int index17 = doc.IndexOf("quoteElementPiece17");
I want to improve this by scanning in one go and having all the indexes for the substrings I need. Example:
var searchstrings = new string[]
{
"quoteElementPiece6",
"quoteElementPiece7",
"quoteElementPiece8",
"quoteElementPiece9",
"quoteElementPiece17"
};
int[] indexes = getIndexes(document,searchstrings); //indexes should be sorted accordingly to the order in searchstrings
Is there anything native in .NET doing this (LinQ for istance)?
I know there are HTML Parser libraries but I prefer avoiding using those, I would like to learn how to do this for each kind of document.
var words = new []{
"quoteElementPiece6",
"quoteElementPiece7"};
// I take for granted your `document` is a string and not an `HtmlDocument` or whatnot.
var result = words.Select(word=>document.IndexOf(word));
Console.WriteLine(string.Join(",", result));
you can do this with LINQ. check my solution
var doc = "this is my document";
List<string> searchstrings = new List<string>
{
"quoteElementPiece6",
"quoteElementPiece7",
"quoteElementPiece8",
"quoteElementPiece9",
"quoteElementPiece17"
};
var lastIndexOfList = new List<int>(searchstrings.Count);
searchstrings.ForEach(x => lastIndexOfList.Add(doc.LastIndexOf(x)));
var pattern = #"(?s)<tr class=""odd"">.+?</tr>";
var tr = Regex.Match(html, pattern).Value.Replace(" ", "");
var xml = XElement.Parse(tr);
var nums = xml
.Descendants()
.Where(n => (string)n.Attribute("id") != null)
.Where(n => n.Attribute("id").Value.StartsWith("quoteElementPiece"))
.Select(n => Regex.Match(n.Attribute("id").Value, "[0-9]+").Value);

Html Agility Pack parsing table into object

So I have HTML like this:
<tr class="row1">
<td class="id">123</td>
<td class="date">2014-08-08</td>
<td class="time">12:31:25</td>
<td class="notes">something here</td>
</tr>
<tr class="row0">
<td class="id">432</td>
<td class="date">2015-02-09</td>
<td class="time">12:22:21</td>
<td class="notes">something here</td>
</tr>
And it continues like that for each customer row. I want to parse contents of each table row to an object. I've tried few methods but I can't seem to get it work right.
This is what I have currently
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='customerlist']//tr"))
{
Customer cust = new Customer();
foreach (HtmlNode info in row.SelectNodes("//td"))
{
if (info.GetAttributeValue("class", String.Empty) == "id")
{
cust.ID = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "date")
{
cust.DateAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "time")
{
cust.TimeAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
cust.Notes = info.InnerText;
}
}
Console.WriteLine(cust.ID + " " + cust.TimeAdded + " " + cust.DateAdded + " " + cust.Notes);
}
It works to the point that it prints info of the last row of the table on each loop. I'm just missing something very simple but cannot see what.
Also is my way of creating the object fine, or should I use a constructor and create the object from variables? E.g.
string Notes = String.Empty;
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
Notes = info.InnerText;
}
..
Customer cust = new Customer(id, other_variables, Notes, etc);
Your XPath query is wrong. You need to use td instead of //td:
foreach (HtmlNode info in row.SelectNodes("td"))
Passing //td to SelectNodes() will match all <td> elements in the document, hence your inner loop runs 8 times instead of 4 times, and the last 4 times always overrides the values previously set in your Customer object.
See XPath Examples

HTML Agility Pack: How to access HTML attributes?

I've following html code:
<tr>
<td headers="header1"><b>TITLE </b></td>
<td headers="header2"></td>
<td headers="header3" class="centrato">23/04/2014</td>
</tr>
I need to store in a datatable:
HREF VALUE in "Link" column;
TITLE in "Title" column;
23/04/2014 in "Date" column;
I tried this:
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//td[#headers='header1']"))
{
table.Rows.Add();
table.Rows[i]["Post"] = node.InnerText;
i++;
}
This code allow me to add all Title in the datatable but I'm not able to add DATE and HREF, can you help me please?
You can do this way :
//select all `<tr>` that contains specific `<td>`
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr[td[#headers='header1']]"))
{
table.Rows.Add();
//get <td headers='header1'> in current <tr>
var header1 = node.SelectSingleNode("./td[#headers='header1']");
table.Rows[i]["Title"] = header1.InnerText;
//get <a> in header1 then get it's href attribute value
table.Rows[i]["Link"] = header1.SelectSingleNode(".//a").GetAttributeValue("href", "");
//get innerText of <td headers='header1'> in current <tr>
table.Rows[i]["Post"] = node.SelectSingleNode("./td[#headers='header3']").InnerText;
i++;
}
InnerText just gives you the text between the Tag. to access Href or Id or ... you should use GetAttributeValue method.
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr"))
{
table.Rows.Add();
table.Rows[i]["Link"] = node.SelectSingleNode("//a").GetAttributeValue("href", "");
table.Rows[i]["Title"] = node.SelectSingleNode("//a").InnerText;
table.Rows[i]["Date"] = node.SelectSingleNode("//td[#headers='header3']").InnerText;
i++;
}

How to Get element that inside another element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span>01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span>10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack,with this i can get info from span that have class
private static List<string> GetListDataByClass(string HtmlSourse, string Class)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//span[#class='" + Class + "']"))
{
if(node.InnerText!=null) data.Add(node.InnerText);
}
return data;
}
,but in my case td have the class i tryied
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']"))
but this not worked
Sow i need to read this data to get the date 01.03.14 and 10.02.14
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Just change the XPath query to:
DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']/span")
This will select all the spans that are inside a td element with the corresponding class.

C# HtmlAgilityPack Select table from specific h2

I have some html:
<h2>Results</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
<h2>Grades</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
I was wondering how would I get the table under "Results"
I've tried:
var nodes = doc.DocumentNode.SelectNodes("//h2");
foreach (var o in nodes)
{
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}
It works but it also gets the table under Grades h2
Note that the div is not hierarchically inside the header, so it doesn't make sense to look for it there.
This can work for you - it finds the next element after the title:
if (o.InnerText.Equals("Results"))
{
var nextDiv = o.NextSibling;
while (nextDiv != null && nextDiv.NodeType != HtmlNodeType.Element)
nextDiv = nextDiv.NextSibling;
// nextDiv should be correct here.
}
You can also write a more specific xpath to find just that div:
doc.DocumentNode.SelectNodes("//h2[text()='Results']/following-sibling::div[1]");
var nodes = doc.DocumentNode.SelectNodes("//h2");
if (nodes.FirstOrDefault()!=null)
{
var o=nodes.FirstOrDefault();
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}

Categories