HtmlAgilityPack issue - c#

Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2

It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}

Related

selecting href from <a> node using HtmlAgilityPack

Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them
<div class=Tabpanel>
<div class=G ridW>
<div class=G ridCell>
<article>
<div class=s mallerCell>
<a href="..........">
</div>
</article>
</div>
</div>
<div class=r andom>
</div>
<div class=r andom>
</div>
</div>
This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?
httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);
var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("GridW")).ToList();
var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();
var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}
var finalList3 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList2.Count; i++) {
finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}
If you can probably make things a lot simpler by using XPath.
If you want all the links in article tags, you can do the following.
var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();
I think it is Value. Check with docs.
If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[#class='smallerClass']/a.
you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.
Simplest way I'd go about it would be this...
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodesWithARef = doc.DocumentNode.Descendants("a");
foreach (HtmlNode node in nodesWithARef)
{
Console.WriteLine(node.GetAttributeValue("href", ""));
}
Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.
Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).
var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[#class='smallerCell']");
if (nodesWithSmallerCells != null)
foreach (HtmlNode node in nodesWithSmallerCells)
{
HtmlNodeCollection children = node.SelectNodes(".//a");
if (children != null)
foreach (HtmlNode child in children)
Console.WriteLine(child.GetAttributeValue("href", ""));
}

Combine all the content from the multiple <p> tag inside a div, into a single string

I have the html content as:
<div class="editor-box">
<div class="insert-ad">
Some ad content
</div>
<p>paragraph 1</p>
<p>paragraph2</p>
<p>paragraph3</p>
<div class="media ad-item">
Another Ad Content
</div>
<p>Paragraph4</p>
<p>Paragraph5/p>
<p></p>
</div>
I wanted to merge all the text inside the <p> element into a single string at once.
My final OutputString as:
string Output = "paragraph 1 paragraph2 paragraph3 Paragraph4 Paragraph5"
I have tried:
var doc = await GetAsync(href);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='editor-box']/p"))
{
string text = node.InnerText;
}
I have got the text from the individual <p> element, But Is there any way to select all the content from <p> with a single query so that the i do not need to loop all the node and merge into a another string object.
For any reason if you don't want to manually loop over all the paragraph contents, you can always use LINQ and string.Join to achieve the same results.:
//1. Get the document
var doc = await GetAsync(href);
//2. Select all the paragraphs:
var paragraphNodes = doc.DocumentNode.SelectNodes("//div[#class='editor-box']/p");
//3. Select the content inside them:
var paragraphContentList = paragraphNodes.Select(node => node.InnerText);
//4. Join all the contents in a single string
var finalString = string.Join(" ", paragraphContentList);
//5. Done!
Console.WriteLine(finalString);
Remember to use the LINQ namespace using System.Linq;
You may try this...
If you assign some id to your div element and also add runat=server.
System.IO.StringWriter sw = new System.IO.StringWriter();
System.Web.UI.HtmlTextWriter htmltext = new System.Web.UI.HtmlTextWriter(sw);
DivId.RenderControl(htmltext);
string str = sw.GetStringBuilder().ToString();
Here DivId is the id assigned to the div.

html agility pack getting same output twice c#

<div class="header">
<span id="content">test1</span>
</div>
<div class="header">
<span id="content">test2</span>
</div>
var web = new HtmlWeb();
var doc = web.Load(url)
var value = doc.DocumentNode.SelectNodes("//div[#class='header']")
foreach(var v in value)
{
var name = v.SelectSingleNode("//span[#id='content']")
Console.Writeline(name.OuterHtml);
}
the code above gives me as output twice <span id="content">test1</span>instead of <span id="content">test2</span> as second output. So it gets the correct number of nodes but not the correct output.
Using // and / in XPath will query the root node even you are using the current node.
Please see my fix in your code.
var value = doc.DocumentNode.SelectNodes("//div[#class='header']");
foreach (var v in value)
{
var name = v.SelectSingleNode("span[#id='content']");
Console.WriteLine(name.OuterHtml);
}
See this fiddle. https://dotnetfiddle.net/nih2lw
A side note, id attribute should always be unique in the document. Use class instead.

Get URLs inside a HTML page with HTML Agility Pack

I have this code:
foreach (HtmlNode node in hd.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a"))
{
string s=("node:" + node.GetAttributeValue("href", string.Empty));
}
I want to get urls in tags like this:
<div class="compTitle options-toggle">
<a class=" ac-algo fz-l ac-21th lh-24" href="http://www.bestbuy.com">
<b>Huawei</b> Products - Best Buy
</a>
</div>
I want to get "http://www.bestbuy.com" and "Huawei Products - Best Buy"
what should I do? Is my code correct?
this is an example of working code
var document = new HtmlDocument();
document.LoadHtml("<div class=\"compTitle options-toggle\"><a class=\" ac-algo fz-l ac-21th lh-24\" href=\"http://www.bestbuy.com\"><b>Huawei</b> Products - Best Buy</a></div>");
var tags = document.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a").ToList();
foreach (var tag in tags)
{
var link = tag.Attributes["href"].Value; // http://www.bestbuy.com
var text = tag.InnerText; // Huawei Products - Best Buy
}
The closing double quote should fix the selecting (it worked for me).
Get the plain text as
string contentText = node.InnerText;
or having the Huawei word in bold, like this:
string contentHtml = node.InnerHtml;

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

Categories