HtmlAgilityPack Get all links inside a DIV - c#

I want to be able to get 2 links from inside a div.
Currently I can select one but whene there's more it doesn't seem to work.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if (node != null)
{
foreach (HtmlNode type in node.SelectNodes("//a#href"))
{
recipe.type += type.InnerText;
}
}
else
recipe.type = "Error fetching type.";
Trying to get it from this piece of HTML:
<div class="myclass">
<h3>Not Relevant Header</h3>
This text,
and this text
</div>
Any help is appreciated, Thanks in advance.

var div = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if(div!=null)
{
var links = div.Descendants("a")
.Select(a => a.InnerText)
.ToList();
}

Use this XPath:
//div[#class = 'myclass']//a
It grabs all descendant a elements in div with class = 'myclass'.
And //a#href is incorrect XPath.

Use:
//div[contains(concat(' ', #class, ' '), ' myclass ')]//a
This selects any a element that is a descendant of any div whose class attribute contains a classname of "myclass".
The classname may be single, or the attribute may also contain other classnames. In this case the classname may be the starting one, or the last one or may be surrounded by other classnames -- the above XPath expression correctly selects the wanted nodes in all of these different cases.

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

get all nodes and its content using htmldocument/HtmlAgilityPack

I need to get all nodes from a html, then from that nodes I need to get the text and sub-nodes, and the same thing but from that sub-sub-nodes.
For example, I have this HTML:
<p>This <b>is a Link</b> with <b>bold</b></p>
So I need a way to get the p node, then the non-formatted text (this), the only-bold text (is a), the bolded link (Link) and the rest formatted and not formatted text.
I know that with the htmldocument I can select all nodes and sub-nodes, but, how Can I get the text before the sub-node, then the sub-node, and its text/sub-nodes so I can make the rendered version of the html ("This is a Link with bold")?
Please note that the above example is a simple one. The HTML would have more complex things like list, frames, numbered list, triple-formatted text, etc. Also note that the rendered thing is not a problem. I have already done that but in another way. What I need is the part to get the nodes and its content only.
Also, I can't ignore any node, so I can't filter by nothing. And the main node could start as p, div, frame, ul, etc.
After looking in the htmldoc and its properties, and thanks to #HungCao 's observation, I got a working simple way to interpretate a HTML code.
My code is a little more complex to add it as example, so I will post a lite version of it.
First of all, the htmlDoc has to be loaded. It could be on any function:
HtmlDocument htmlDoc = new HtmlDocument();
string html = #"<p>This <b>is a Link</b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);
Then we need to interpretate each "main" node (p in this case) and, depending its type, we need to load a LoopFunction (InterNode)
HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;
foreach (HtmlNode node in nodes)
{
if(node.Name.ToLower() == "p") //Low the typeName just in case
{
Paragraph newPPara = new Paragraph();
foreach(HtmlNode childNode in node.ChildNodes)
{
InterNode(childNode, ref newPPara);
}
richTextBlock.Blocks.Add(newPPara);
}
}
Please note that there is a property called "NodeType", but it will not return the correct type. So, instead use the "Name" property (Also note that the Name property in htmlNode is not the same as the Name attribute in HTML).
Finally, we have the InterNode function that will add inlines to the referred (ref) Paragraph
public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
string htmlNodeName = htmlNode.Name.ToLower();
List<string> nodeAttList = new List<string>();
HtmlNode parentNode = htmlNode.ParentNode;
while (parentNode != null) {
nodeAttList.Add(parentNode.Name);
parentNode = parentNode.ParentNode;
} //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.
Inline newRun = new Run();
foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
{
switch (noteAttStr)
{
case ("b"):
case ("strong"):
{
newRun.FontWeight = FontWeights.Bold;
break;
}
case ("i"):
case ("em"):
{
newRun.FontStyle = FontStyle.Italic;
break;
}
}
}
if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks #HungCao
{
((Run)newRun).Text = htmlNode.InnerText;
} else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
{
foreach (HtmlNode childNode in htmlNode.ChildNodes)
{
InterNode(childNode, ref originalPar);
}
}
return true;
}
Note: I know that I said that my app need to render the HTML in another way that a webview does, and I know that this example code generate the same thing as a Webview, but, as I said before, this is just a lite version of my final code. In fact, my original/full code is working as I need to and this is just the base.

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

GetElementsByTagName in Htmlagilitypack

How do I select an element for e.g. textbox if I don't know its id?
If I know its id then I can simply write:
HtmlAgilityPack.HtmlNode node = doc.GetElementbyId(id);
But I don't know textbox's ID and I can't find GetElementsByTagName method in HtmlagilityPack which is available in webbrowser control.
In web browser control I could have simply written:
HtmlElementCollection elements = browser[i].Document.GetElementsByTagName("form");
foreach (HtmlElement currentElement in elements)
{
}
EDIT
Here is the HTML form I am talking about
<form id="searchform" method="get" action="/test.php">
<input name="sometext" type="text">
</form>
Please note I don't know the ID of form. And there can be several forms on same page. The only thing I know is "sometext" and I want to get this element using just this name. So I guess I will have to parse all forms one by one and then find this name "sometext" but how do I do that?
If you're looking for the tag by its tagName (such as form for <form name="someForm">), then you can use:
var forms = document.DocumentNode.Descendants("form");
If you're looking for the tag by its name property (such as someForm for <form name="someForm">, then you can use:
var forms = document.DocumentNode.Descendants().Where(node => node.Name == "formName");
For the last one you could create a simple extension method:
public static class HtmlNodeExtensions
{
public static IEnumerable<HtmlNode> GetElementsByName(this HtmlNode parent, string name)
{
return parent.Descendants().Where(node => node.Name == name);
}
public static IEnumerable<HtmlNode> GetElementsByTagName(this HtmlNode parent, string name)
{
return parent.Descendants(name);
}
}
Note: You can also use SelectNodes and XPath to query your document:
var nodes = doc.DocumentNode.SelectNodes("//form//input");
Would give you all inputs on the page that are in a form tag.
var nodes = doc.DocumentNode.SelectNodes("//form[1]//input");
Would give you all the inputs of the first form on the page
I think you are looking for something like this
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("....");
var inputs = doc.DocumentNode.Descendants("input")
.Where(n => n.Attributes["name"]!=null && n.Attributes["name"].Value == "sometext")
.ToArray();
Any node by name:
doc.DocumentNode.SelectNodes("//*[#name='name']")
Input nodes by name:
doc.DocumentNode.SelectNodes("//input[#name='name']")

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories