How to Get element by class in HtmlAgilityPack - c#

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span class="DateHover">01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span class="DateHover" >10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack
private static string GetDataByIClass(string HtmlIn, string ClassToGet)
{
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlIn);
HtmlAgilityPack.HtmlNode InputNode = DocToParse.GetElementbyId(ClassToGet);//here is the problem i dont have method DocToParse.GetElementbyClass
if (InputNode != null)
{
if (InputNode.Attributes["value"].Value != null)
{
return InputNode.Attributes["value"].Value;
}
}
return null;
}
Sow i need to read this data to get the date 01.03.14 and 10.02.14 for be able to save this to array list (and then to xml file)
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?

Html Agility Pack has XPATH support, so you can do something like this:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[#class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}
This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.

Related

I need to get a text without tags in my HTML, using xpath

I can get to the tag, the problem is that I cant get the text right after the tag.
I've already tried:
y.SelectSingleNode($"td[5]/font/b[.='{labelCampo}']")
y.SelectSingleNode($"td[5]/font/b[.='{labelCampo}']/text()").InnerText
And many other forms.
<td align="left" width="623">
<font class="normal">
<b>Protocolo:</b>
"850160251675 (09/11/2016)"
<br>
I want to get this number and date and not the text "Protocolo:"
enter code here
You can select "normal" class and use innerText attribute to take number and date.
I have tried this code and it's totally work.
var html =
#"<html>
<body><p>
<table class=\'foo\'>
<tr><th>hello</th></tr>
<tr><td>world</td></tr>
</table>
<div class=\'test\'>
<b>Protocolo:</b>
850160251675 (09/11/2016)
</div>
</body></html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var body = htmlDoc.DocumentNode.SelectSingleNode("//div[contains(#class, 'test')]");
Console.WriteLine(body.InnerText);
https://dotnetfiddle.net/d4chXg
I solve it!
I just needed to use the .Nextsibiling.innertext in my original statement
y.SelectSingleNode($"td[5]/font/b[.='{labelCampo}']").NextSibling.InnerText

How to extract a text version of the HTML in an XML document?

Suppose I have an XML document that looks something like (basically represents an HTML report):
<html>
<head>...</head>
<body>
<div>
<table>
<tr>
<td>Stuff</td>
</tr>
<tr>
<td>More stuff<br /><br />More stuff on another line and some whitespace... </td>
</tr>
<tr>
<td> Some leading whitespace before this stuff<br />Stuff</td>
</tr>
</table>
</div>
</body>
</html>
I want to (using C#) convert this document into a simple text string that looks something like:
Stuff
More stuff
More stuff on another line and some whitespace...
Some leading whitespace before this stuff
Stuff
It should be smart enough to convert table rows into new lines and insert new lines where any inline br tags were added within a cell. It should also keep any whitespace in the table cells intact. I tried using the XmlDocument class and used the InnerText method on the body node, but it doesn't seem to create the output I am looking for (newlines and whitespace are not intact). Is there a simple way to do this? I know one way to do this would be to extract the HTML as one string and do several regular expressions on it to handle the newlines and whitespace. Thanks!
Try this please:
var doc = XElement.Load("test.xml");
var sb = new StringBuilder();
foreach (var text in doc.DescendantNodes().Where(node => node.NodeType == XmlNodeType.Text))
{
sb.AppendLine(((XText)text).Value);
}
More concise:
foreach (var text in doc.DescendantNodes().OfType<XText>())
{
sb.AppendLine(text.ToString());
}

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!
var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}
It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

How to Get element that inside another element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span>01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span>10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack,with this i can get info from span that have class
private static List<string> GetListDataByClass(string HtmlSourse, string Class)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//span[#class='" + Class + "']"))
{
if(node.InnerText!=null) data.Add(node.InnerText);
}
return data;
}
,but in my case td have the class i tryied
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']"))
but this not worked
Sow i need to read this data to get the date 01.03.14 and 10.02.14
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Just change the XPath query to:
DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']/span")
This will select all the spans that are inside a td element with the corresponding class.

xpath and htmlagility pack

I figured it out! I will leave this posted just in case some other newbie like myself has the same question.
Answer: **("./td[2]/span[#class='smallfont']")***
I am a novice at xpath and html agility. I am so close yet so far.
GOAL: to pull out 4:30am
by using the following with htmlagility pack:
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#id='weekdays']/tr[2]")){
string time = table.SelectSingleNode("./td[2]").InnerText;
I get it down to "\r\n\t\t\r\n\t\t\t4:30am\r\n\t\t\r\n\t" when I try doing anything with the span I get xpath exceptions. What must I add to the ("./td[2]") to just end up with the 4:30am?
HTML
<td class="alt1 espace" nowrap="nowrap" style="text-align: center;">
<span class="smallfont">4:30am</span>
</td>
I don't know if Linq is an option, but you could have also done something like this:
var time = string.Empty;
var html =
"<td class=\"alt1 espace\" nowrap=\"nowrap\" style=\"text-align: center;\"><span class=\"smallfont\">4:30am</span></td>";
var document = new HtmlDocument() { OptionWriteEmptyNodes = true, OptionOutputAsXml = true };
document.LoadHtml(html);
var timeSpan =
document.DocumentNode.Descendants("span").Where(
n => n.Attributes["class"] != null && n.Attributes["class"].Value == "smallfont").FirstOrDefault();
if (timeSpan != null)
time = timeSpan.InnerHtml;

Categories