Cannot extract <link> element using HtmlAgilityPack and XPath

Cannot extract <link> element using HtmlAgilityPack and XPath - c#

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("item/link");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
Does anybody have an understanding of why this element behaves differently to the others?
Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.
Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.
<site><link>Hello World</link><name>Fred</name></site>
I am certain its because of the world "link". If I change it to "linkz" it works perfectly.
The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
rssSource = rssSource.Replace("<link>", "<link-renamed>");
rssSource = rssSource.Replace("</link>", "</link-renamed>");
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

If you print the DocumentNode.OuterHtml, you will see the problem :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<site><link>Hello World<name>Fred</name></site>
link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link"); //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
Console.WriteLine(link.InnerText);
}
Dotnetfiddle Demo
output :
<site><link>Hello World</link><name>Fred</name></site>
Hello World
*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items

When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp

I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

HtmlAgilityPack cannot find node

I am trying to get Start called span from Here
Chrome gives me this xPath: //*[#id="guide-pages"]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2
But HtmlAgilityPack returns null, after I tried remove them one by one; this works: //*[#id="guide-pages"]/div[2]/div[1] , but not the rest of them.
My full Code:
HtmlDocument doc = new HtmlDocument();
var text = await ReadUrl();
doc.LoadHtml(text);
Console.WriteLine($"Getting Data From: {doc.DocumentNode.SelectSingleNode("//head/title").InnerText}"); //Works fine
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='guide-pages']/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2") == null);
Output:
Getting Data From: Miss Fortune Build Guide : [7.11] KOREAN MF Build - Destroy the Carry! [Added Support] :: League of Legends Strategy Builds
True

Don't use xpath from Chrome. Use LINQ in HtmlAgilityPack instead. For example
.Descendants("div") will give you all the div under 1 html node. Each html node will have meta data like id, attributes(classes...), and you can query your wanted div from there.
This is one handy method to check if a HtmlNode has classes or not.
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}

How do I read the value for and XML node and check if the value is correct

I am fairly new to programming and need to check a Single node in a XML file for a certain value and need to check if that value is correct.
I need to validate these 3 Nodes in 3 different classes:
<RunCodeAnalysis>true</RunCodeAnalysis>
I need to select this specific node from a xml file and need to check if the value of the node is true, i hope to get some help with the validation if the value is acutally true
private void CodeAnalysisEnabled(XDocument xmlDoc)
{
var codeAnalysis = from doc in xmlDoc.Root?.Descendants("RunCodeAnalysis") select doc;
foreach (var codeAnalysisNode in codeAnalysis)
{
codeAnalysisNode.Value = "true";
}
}

Load XML and use XPath to search for tags:
var document = new XmlDocument();
document.Load(fileName);
var root = document.DocumentElement
var nodes = root.SelectNodes("//RunCodeAnalysis");
This sample selects all RunCodeAnalysis nodes from XML. Using XPath you can select exactly what you want. Check this MSDN article.

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?

This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.

In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.

HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.

the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Cannot extract <link> element using HtmlAgilityPack and XPath - c#

Related

Parse HTML class in individual items with htmlagilitypack

HtmlAgilityPack cannot find node

How do I read the value for and XML node and check if the value is correct

Get text that lies after pattern without class or id

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

Categories

Resources