Html Agility Pack parsing table into object - c#

So I have HTML like this:
<tr class="row1">
<td class="id">123</td>
<td class="date">2014-08-08</td>
<td class="time">12:31:25</td>
<td class="notes">something here</td>
</tr>
<tr class="row0">
<td class="id">432</td>
<td class="date">2015-02-09</td>
<td class="time">12:22:21</td>
<td class="notes">something here</td>
</tr>
And it continues like that for each customer row. I want to parse contents of each table row to an object. I've tried few methods but I can't seem to get it work right.
This is what I have currently
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='customerlist']//tr"))
{
Customer cust = new Customer();
foreach (HtmlNode info in row.SelectNodes("//td"))
{
if (info.GetAttributeValue("class", String.Empty) == "id")
{
cust.ID = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "date")
{
cust.DateAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "time")
{
cust.TimeAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
cust.Notes = info.InnerText;
}
}
Console.WriteLine(cust.ID + " " + cust.TimeAdded + " " + cust.DateAdded + " " + cust.Notes);
}
It works to the point that it prints info of the last row of the table on each loop. I'm just missing something very simple but cannot see what.
Also is my way of creating the object fine, or should I use a constructor and create the object from variables? E.g.
string Notes = String.Empty;
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
Notes = info.InnerText;
}
..
Customer cust = new Customer(id, other_variables, Notes, etc);

Your XPath query is wrong. You need to use td instead of //td:
foreach (HtmlNode info in row.SelectNodes("td"))
Passing //td to SelectNodes() will match all <td> elements in the document, hence your inner loop runs 8 times instead of 4 times, and the last 4 times always overrides the values previously set in your Customer object.
See XPath Examples

Related

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!
var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}
It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

HTML Agility Pack: How to access HTML attributes?

I've following html code:
<tr>
<td headers="header1"><b>TITLE </b></td>
<td headers="header2"></td>
<td headers="header3" class="centrato">23/04/2014</td>
</tr>
I need to store in a datatable:
HREF VALUE in "Link" column;
TITLE in "Title" column;
23/04/2014 in "Date" column;
I tried this:
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//td[#headers='header1']"))
{
table.Rows.Add();
table.Rows[i]["Post"] = node.InnerText;
i++;
}
This code allow me to add all Title in the datatable but I'm not able to add DATE and HREF, can you help me please?
You can do this way :
//select all `<tr>` that contains specific `<td>`
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr[td[#headers='header1']]"))
{
table.Rows.Add();
//get <td headers='header1'> in current <tr>
var header1 = node.SelectSingleNode("./td[#headers='header1']");
table.Rows[i]["Title"] = header1.InnerText;
//get <a> in header1 then get it's href attribute value
table.Rows[i]["Link"] = header1.SelectSingleNode(".//a").GetAttributeValue("href", "");
//get innerText of <td headers='header1'> in current <tr>
table.Rows[i]["Post"] = node.SelectSingleNode("./td[#headers='header3']").InnerText;
i++;
}
InnerText just gives you the text between the Tag. to access Href or Id or ... you should use GetAttributeValue method.
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr"))
{
table.Rows.Add();
table.Rows[i]["Link"] = node.SelectSingleNode("//a").GetAttributeValue("href", "");
table.Rows[i]["Title"] = node.SelectSingleNode("//a").InnerText;
table.Rows[i]["Date"] = node.SelectSingleNode("//td[#headers='header3']").InnerText;
i++;
}

How to Get element that inside another element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span>01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span>10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack,with this i can get info from span that have class
private static List<string> GetListDataByClass(string HtmlSourse, string Class)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//span[#class='" + Class + "']"))
{
if(node.InnerText!=null) data.Add(node.InnerText);
}
return data;
}
,but in my case td have the class i tryied
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']"))
but this not worked
Sow i need to read this data to get the date 01.03.14 and 10.02.14
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Just change the XPath query to:
DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']/span")
This will select all the spans that are inside a td element with the corresponding class.

How to Get element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span class="DateHover">01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span class="DateHover" >10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack
private static string GetDataByIClass(string HtmlIn, string ClassToGet)
{
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlIn);
HtmlAgilityPack.HtmlNode InputNode = DocToParse.GetElementbyId(ClassToGet);//here is the problem i dont have method DocToParse.GetElementbyClass
if (InputNode != null)
{
if (InputNode.Attributes["value"].Value != null)
{
return InputNode.Attributes["value"].Value;
}
}
return null;
}
Sow i need to read this data to get the date 01.03.14 and 10.02.14 for be able to save this to array list (and then to xml file)
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Html Agility Pack has XPATH support, so you can do something like this:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}
This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.

C# HtmlAgilityPack Select table from specific h2

I have some html:
<h2>Results</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
<h2>Grades</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
I was wondering how would I get the table under "Results"
I've tried:
var nodes = doc.DocumentNode.SelectNodes("//h2");
foreach (var o in nodes)
{
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}
It works but it also gets the table under Grades h2
Note that the div is not hierarchically inside the header, so it doesn't make sense to look for it there.
This can work for you - it finds the next element after the title:
if (o.InnerText.Equals("Results"))
{
var nextDiv = o.NextSibling;
while (nextDiv != null && nextDiv.NodeType != HtmlNodeType.Element)
nextDiv = nextDiv.NextSibling;
// nextDiv should be correct here.
}
You can also write a more specific xpath to find just that div:
doc.DocumentNode.SelectNodes("//h2[text()='Results']/following-sibling::div[1]");
var nodes = doc.DocumentNode.SelectNodes("//h2");
if (nodes.FirstOrDefault()!=null)
{
var o=nodes.FirstOrDefault();
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}

Categories