HTML Agility Pack: How to access HTML attributes? - c#

I've following html code:
<tr>
<td headers="header1"><b>TITLE </b></td>
<td headers="header2"></td>
<td headers="header3" class="centrato">23/04/2014</td>
</tr>
I need to store in a datatable:
HREF VALUE in "Link" column;
TITLE in "Title" column;
23/04/2014 in "Date" column;
I tried this:
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//td[#headers='header1']"))
{
table.Rows.Add();
table.Rows[i]["Post"] = node.InnerText;
i++;
}
This code allow me to add all Title in the datatable but I'm not able to add DATE and HREF, can you help me please?

You can do this way :
//select all `<tr>` that contains specific `<td>`
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr[td[#headers='header1']]"))
{
table.Rows.Add();
//get <td headers='header1'> in current <tr>
var header1 = node.SelectSingleNode("./td[#headers='header1']");
table.Rows[i]["Title"] = header1.InnerText;
//get <a> in header1 then get it's href attribute value
table.Rows[i]["Link"] = header1.SelectSingleNode(".//a").GetAttributeValue("href", "");
//get innerText of <td headers='header1'> in current <tr>
table.Rows[i]["Post"] = node.SelectSingleNode("./td[#headers='header3']").InnerText;
i++;
}

InnerText just gives you the text between the Tag. to access Href or Id or ... you should use GetAttributeValue method.
int i = 0;
foreach (HtmlNode node in tmlDoc.DocumentNode.SelectNodes("//tr"))
{
table.Rows.Add();
table.Rows[i]["Link"] = node.SelectSingleNode("//a").GetAttributeValue("href", "");
table.Rows[i]["Title"] = node.SelectSingleNode("//a").InnerText;
table.Rows[i]["Date"] = node.SelectSingleNode("//td[#headers='header3']").InnerText;
i++;
}

Related

Using HTML Agility Pack to load all data into listboxes?

I have a page with 300-something rows and wanting to load them all into a list box, but different lists.
I want to put the date in one box, and the other 2 numbers in 2 other boxes also.
HTML ex:
<table>
<tr>
<td>01/01/2017</td>
<td>100</td>
<td>500</td>
</tr>
<tr>
<td>01/02/2017</td>
<td>200</td>
<td>400</td>
</tr>
</table>
My code that pulls this:
private void LoadHTML()
{
int count = 0;
var link = #"http://example.com/data";
HtmlWeb Web = new HtmlWeb();
var htmlDoc = Web.Load(link);
var node = htmlDoc.DocumentNode.SelectNodes("//td");
foreach (var x in node)
{
count = count + 1;
if (count > 5)
{
listBox1.Items.Add(x.InnerText);
}
}
}
listbox1 add's all the data from x, since everything is a td. tr would add each row, but I have nothing to split the data. The count after 5 is where my data starts. There is headers but I don't know how to pull the data from the specific headers in this form.
First you need to get a tr nodes.
Next, iterate it and get the td nodes.
var trNodes = htmlDoc.DocumentNode.SelectNodes("//tr");
foreach (var tr in trNodes)
{
var tdNodes = tr.SelectNodes("./td");
listBox1.Items.Add(tdNodes[0].InnerText);
listBox2.Items.Add(tdNodes[1].InnerText);
listBox3.Items.Add(tdNodes[2].InnerText);
}

Html Agility Pack parsing table into object

So I have HTML like this:
<tr class="row1">
<td class="id">123</td>
<td class="date">2014-08-08</td>
<td class="time">12:31:25</td>
<td class="notes">something here</td>
</tr>
<tr class="row0">
<td class="id">432</td>
<td class="date">2015-02-09</td>
<td class="time">12:22:21</td>
<td class="notes">something here</td>
</tr>
And it continues like that for each customer row. I want to parse contents of each table row to an object. I've tried few methods but I can't seem to get it work right.
This is what I have currently
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='customerlist']//tr"))
{
Customer cust = new Customer();
foreach (HtmlNode info in row.SelectNodes("//td"))
{
if (info.GetAttributeValue("class", String.Empty) == "id")
{
cust.ID = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "date")
{
cust.DateAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "time")
{
cust.TimeAdded = info.InnerText;
}
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
cust.Notes = info.InnerText;
}
}
Console.WriteLine(cust.ID + " " + cust.TimeAdded + " " + cust.DateAdded + " " + cust.Notes);
}
It works to the point that it prints info of the last row of the table on each loop. I'm just missing something very simple but cannot see what.
Also is my way of creating the object fine, or should I use a constructor and create the object from variables? E.g.
string Notes = String.Empty;
if (info.GetAttributeValue("class", String.Empty) == "notes")
{
Notes = info.InnerText;
}
..
Customer cust = new Customer(id, other_variables, Notes, etc);
Your XPath query is wrong. You need to use td instead of //td:
foreach (HtmlNode info in row.SelectNodes("td"))
Passing //td to SelectNodes() will match all <td> elements in the document, hence your inner loop runs 8 times instead of 4 times, and the last 4 times always overrides the values previously set in your Customer object.
See XPath Examples

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!
var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}
It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

How to Get element that inside another element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span>01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span>10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack,with this i can get info from span that have class
private static List<string> GetListDataByClass(string HtmlSourse, string Class)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//span[#class='" + Class + "']"))
{
if(node.InnerText!=null) data.Add(node.InnerText);
}
return data;
}
,but in my case td have the class i tryied
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']"))
but this not worked
Sow i need to read this data to get the date 01.03.14 and 10.02.14
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Just change the XPath query to:
DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']/span")
This will select all the spans that are inside a td element with the corresponding class.

How to Get element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span class="DateHover">01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span class="DateHover" >10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack
private static string GetDataByIClass(string HtmlIn, string ClassToGet)
{
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlIn);
HtmlAgilityPack.HtmlNode InputNode = DocToParse.GetElementbyId(ClassToGet);//here is the problem i dont have method DocToParse.GetElementbyClass
if (InputNode != null)
{
if (InputNode.Attributes["value"].Value != null)
{
return InputNode.Attributes["value"].Value;
}
}
return null;
}
Sow i need to read this data to get the date 01.03.14 and 10.02.14 for be able to save this to array list (and then to xml file)
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Html Agility Pack has XPATH support, so you can do something like this:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}
This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.

Categories