HtmlAgilityPack cannot find node - c#

I am trying to get Start called span from Here
Chrome gives me this xPath: //*[#id="guide-pages"]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2
But HtmlAgilityPack returns null, after I tried remove them one by one; this works: //*[#id="guide-pages"]/div[2]/div[1] , but not the rest of them.
My full Code:
HtmlDocument doc = new HtmlDocument();
var text = await ReadUrl();
doc.LoadHtml(text);
Console.WriteLine($"Getting Data From: {doc.DocumentNode.SelectSingleNode("//head/title").InnerText}"); //Works fine
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='guide-pages']/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2") == null);
Output:
Getting Data From: Miss Fortune Build Guide : [7.11] KOREAN MF Build - Destroy the Carry! [Added Support] :: League of Legends Strategy Builds
True

Don't use xpath from Chrome. Use LINQ in HtmlAgilityPack instead. For example
.Descendants("div") will give you all the div under 1 html node. Each html node will have meta data like id, attributes(classes...), and you can query your wanted div from there.
This is one handy method to check if a HtmlNode has classes or not.
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

Cannot extract <link> element using HtmlAgilityPack and XPath

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("item/link");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
Does anybody have an understanding of why this element behaves differently to the others?
Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.
Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.
<site><link>Hello World</link><name>Fred</name></site>
I am certain its because of the world "link". If I change it to "linkz" it works perfectly.
The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
rssSource = rssSource.Replace("<link>", "<link-renamed>");
rssSource = rssSource.Replace("</link>", "</link-renamed>");
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
If you print the DocumentNode.OuterHtml, you will see the problem :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<site><link>Hello World<name>Fred</name></site>
link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link"); //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
Console.WriteLine(link.InnerText);
}
Dotnetfiddle Demo
output :
<site><link>Hello World</link><name>Fred</name></site>
Hello World
*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.

C# Null Exception using HTML Agility Pack

I have a function where I am trying to get some text from this webpage:
http://www.nla.gd/winning-numbers/
public static string get_webpage(string url)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string date = doc.DocumentNode.InnerText;
string lotto_winning_numbers = doc.DocumentNode.SelectNodes("//[#id=\"main\"]/div/strong/div/div[2]/div[1]/div[1]").ToString();
return lotto_winning_numbers;
}
When I run the function I get a NULL Exception.
Is my xpath correct?
You can't have filter by itself in XPath (like [#id='main']). You need to apply filter to collection of nodes like div or *.
Note that you also want to combine values elements in resulting collection, not convert collection itself to string.
Something like:
// Note "*" in front of filter
var lotto_winning_numbers = doc.DocumentNode.SelectNodes(
"//*[#id=\"main\"]/div/strong/div/div[2]/div[1]/div[1]");
// lotto_winning_numbers is collection of nodes here.
return lotto_winning_numbers == null ? String.Empty :
String.Join(", ", lotto_winning_numbers);
Check MSDN article XPath Examples or many other tutorials to learn more.

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

How can I read an HTML file a Paragraph at a time?

I reckon it would be something like (pseudocode):
var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
par = getNextParagraph();
pars.Add(par);
}
...where getNextParagraph() looks for the next "<p>" and continues until it finds "</p>", burning its bridges behind it ("cutting" the paragraph so that it is not found over and over again). Or some such.
Does anybody have insight on how exactly to do this / a better methodology?
UPDATE
I tried to use Aurelien Souchet's code.
I have the following usings:
using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;
...but this code:
HtmlDocument doc = new HtmlDocument();
is unwanted ("Cannot access private constructor 'HtmlDocument' here")
Also, both "doc.LoadHtml()" and "doc.DocumentNode" give the old "Cannot resolve symbol 'Bla'" err msg
UPDATE 2
Okay, I had to prepend "HtmlAgilityPack." so that the ambiguous reference was disambiguated.
As people suggests in the comments, I think HtmlAgilityPack is the best choice, it's easy to use and to find good examples or tutorials.
Here is what I would write:
//don't forgot to add the reference
using HtmlAgilityPack;
//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
//first create an HtmlDocument
HtmlDocument doc = new HtmlDocument();
//load the html (from a string)
doc.LoadHtml(sourceHtml);
//Select all the <p> nodes in a HtmlNodeCollection
HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");
//Iterates on every Node in the collection
foreach (HtmlNode paragraph in paragraphs)
{
//Add the InnerText to the list
pars.Add(paragraph.InnerText);
//Or paragraph.InnerHtml depends what you want
}
return pars;
}
It's just a basic example, you can have some nested paragraphs in your html then this code maybe won't work as expected, it all depends the html you are parsing and what you want to do with it.
Hope it helps!

Categories