C# count paragraphs in div from a website's html source code

C# count paragraphs in div from a website's html source code - c#

Using Html Agility Pack i have been trying to count the number of paragraphs tags in each div tag and get the div id and class(if they exist) of the one that has the most paragraphs but i'm having trouble with the syntax.
My code looks like this:
// HtmlDocument is stored in doc
HtmlAgilityPack.HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div");
foreach (HtmlAgilityPack.HtmlNode divNode in div)
{
var x = divNode.DescendantNodes("p").Count; // doesn't actually work
// x should also be stored in a list
}
If anyone could point me to right direction or provide me with examples, it would really help. Thanks!

How about this way :
//get the maximum number of paragraph
int maxNumberOfParagraph =
doc.DocumentNode
.SelectNodes("//div[.//p]")
.Max(o => o.SelectNodes(".//p").Count);
//get divs having number of containing paragraph equals maxNumberOfParagraph
var divs = doc.DocumentNode
.SelectNodes("//div[.//p]")
.Where(o => o.SelectNodes(".//p").Count == maxNumberOfParagraph);

Related

how do i get all the value of a table from a website

string Url = "http://www.dsebd.org/latest_share_price_scroll_l.php";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string a = doc.DocumentNode.SelectNodes("//iframe*[#src=latest_share_price_all\"]//html/body/div/table/tbody")[0].InnerText;
i have tried, but null value found in string a.

Ok this one confused me for a while but I've got it now. Instead of pulling the whole page from http://www.dsebd.org/latest_share_price_scroll_l.php, you can get just the table data from http://www.dsebd.org/latest_share_price_all.php.
There was some strange behaviour with trying to select child elements of the #document node under the iframe element. Someone with more xpath experience might be able to explain this.
Now you can get all the table row nodes by using the following xpath:
string url = "http://www.dsebd.org/latest_share_price_all.php";
HtmlDocument doc = new HtmlWeb().Load(url);
HtmlNode docNode = doc.DocumentNode;
var nodes = docNode.SelectNodes("//body/div/table/tr");
That will give you all the table row nodes. Then you need to go through each node you just got and get the values you want.
Just for example if you wanted to get the trading code, high, and volume you would do the following:
//Remove the first node because it is the header row at the top of the table
nodes.RemoveAt(0);
foreach(HtmlNode rowNode in nodes)
{
HtmlNode tradingCodeNode = rowNode.SelectSingleNode("td[2]/a");
string tradingCode = tradingCodeNode.InnerText;
HtmlNode highNode = rowNode.SelectSingleNode("td[4]");
string highValue = highNode.InnerText;
HtmlNode volumeNode = rowNode.SelectSingleNode("td[11]");
string volumeValue = volumeNode.InnerText;
//Do whatever you want with the values here
//Put them in a class or add them to a list
}
XPath uses 1-based indices so when you are referring to a particular cell in a table row by number the first element is at index 1, instead of using index 0 as in a C# array.

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210

I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

Cannot extract <link> element using HtmlAgilityPack and XPath

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("item/link");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}
Does anybody have an understanding of why this element behaves differently to the others?
Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.
Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.
<site><link>Hello World</link><name>Fred</name></site>
I am certain its because of the world "link". If I change it to "linkz" it works perfectly.
The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.
public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
rssSource = rssSource.Replace("<link>", "<link-renamed>");
rssSource = rssSource.Replace("</link>", "</link-renamed>");
//Create a new document.
var document = new HtmlDocument();
//Populate the document with an rss file.
document.LoadHtml(rssSource);
//Select out all of the required nodes.
var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
//If zero nodes were found, return an empty list, otherwise return the content of those nodes.
return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

If you print the DocumentNode.OuterHtml, you will see the problem :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
output :
<site><link>Hello World<name>Fred</name></site>
link happen to be one of some special tags* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :
var html = #"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link"); //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
Console.WriteLine(link.InnerText);
}
Dotnetfiddle Demo
output :
<site><link>Hello World</link><name>Fred</name></site>
Hello World
*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are <meta>, <img>, <frame>, <input>, <form>, <option>, etc.

Extracting content from Webpage

I am attempting to use HTMLagilitypack to extract all the content from the webpage.
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
sb.AppendLine(node.Text);
}
When i try to parse google.com using above code i get lots of javascript. All i want is to extract the content in the webpage like in h or p tags. Like taking the question,answer,comments on this page and removing everything else.
I am really new to XPath and don't exactly know where to move forward. So any help would be appreciated.

You can filter for the non-wanted tags by name and remove them from your document.
doc = page.Load("http://www.google.com");
doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());

You could use this XPath expression:
//body//*[local-name() != 'script']/text()
It takes only the elements inside the body and skips the script elements

HtmlAgilityPack Get all links inside a DIV

I want to be able to get 2 links from inside a div.
Currently I can select one but whene there's more it doesn't seem to work.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if (node != null)
{
foreach (HtmlNode type in node.SelectNodes("//a#href"))
{
recipe.type += type.InnerText;
}
}
else
recipe.type = "Error fetching type.";
Trying to get it from this piece of HTML:
<div class="myclass">
<h3>Not Relevant Header</h3>
This text,
and this text
</div>
Any help is appreciated, Thanks in advance.

var div = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if(div!=null)
{
var links = div.Descendants("a")
.Select(a => a.InnerText)
.ToList();
}

Use this XPath:
//div[#class = 'myclass']//a
It grabs all descendant a elements in div with class = 'myclass'.
And //a#href is incorrect XPath.

Use:
//div[contains(concat(' ', #class, ' '), ' myclass ')]//a
This selects any a element that is a descendant of any div whose class attribute contains a classname of "myclass".
The classname may be single, or the attribute may also contain other classnames. In this case the classname may be the starting one, or the last one or may be surrounded by other classnames -- the above XPath expression correctly selects the wanted nodes in all of these different cases.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# count paragraphs in div from a website's html source code - c#

Related

how do i get all the value of a table from a website

Find specific link in html doc c# using HTML Agility Pack

Cannot extract <link> element using HtmlAgilityPack and XPath

Extracting content from Webpage

HtmlAgilityPack Get all links inside a DIV

Categories

Resources