Find specific link in html doc c# using HTML Agility Pack

Find specific link in html doc c# using HTML Agility Pack - c#

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210

I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

Related

selecting href from <a> node using HtmlAgilityPack

Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them
<div class=Tabpanel>
<div class=G ridW>
<div class=G ridCell>
<article>
<div class=s mallerCell>
<a href="..........">
</div>
</article>
</div>
</div>
<div class=r andom>
</div>
<div class=r andom>
</div>
</div>
This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?
httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);
var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("GridW")).ToList();
var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();
var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}
var finalList3 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList2.Count; i++) {
finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}

If you can probably make things a lot simpler by using XPath.
If you want all the links in article tags, you can do the following.
var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();
I think it is Value. Check with docs.
If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[#class='smallerClass']/a.
you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.

Simplest way I'd go about it would be this...
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodesWithARef = doc.DocumentNode.Descendants("a");
foreach (HtmlNode node in nodesWithARef)
{
Console.WriteLine(node.GetAttributeValue("href", ""));
}
Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.
Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).
var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[#class='smallerCell']");
if (nodesWithSmallerCells != null)
foreach (HtmlNode node in nodesWithSmallerCells)
{
HtmlNodeCollection children = node.SelectNodes(".//a");
if (children != null)
foreach (HtmlNode child in children)
Console.WriteLine(child.GetAttributeValue("href", ""));
}

Get URLs inside a HTML page with HTML Agility Pack

I have this code:
foreach (HtmlNode node in hd.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a"))
{
string s=("node:" + node.GetAttributeValue("href", string.Empty));
}
I want to get urls in tags like this:
<div class="compTitle options-toggle">
<a class=" ac-algo fz-l ac-21th lh-24" href="http://www.bestbuy.com">
<b>Huawei</b> Products - Best Buy
</a>
</div>
I want to get "http://www.bestbuy.com" and "Huawei Products - Best Buy"
what should I do? Is my code correct?

this is an example of working code
var document = new HtmlDocument();
document.LoadHtml("<div class=\"compTitle options-toggle\"><a class=\" ac-algo fz-l ac-21th lh-24\" href=\"http://www.bestbuy.com\"><b>Huawei</b> Products - Best Buy</a></div>");
var tags = document.DocumentNode.SelectNodes("//div[#class='compTitle options-toggle']//a").ToList();
foreach (var tag in tags)
{
var link = tag.Attributes["href"].Value; // http://www.bestbuy.com
var text = tag.InnerText; // Huawei Products - Best Buy
}

The closing double quote should fix the selecting (it worked for me).
Get the plain text as
string contentText = node.InnerText;
or having the Huawei word in bold, like this:
string contentHtml = node.InnerHtml;

AngleSharp Parsing

Can't find many examples of using AngleSharp for parsing when you don't have a class name or id to use.
HTML
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
I want to find the href from any <a> tags that have a title = Bing
In Python BeautifulSoup I would use
item_needed = a_row.find('a', {'title': 'Bing'})
and then grab the href attribute
or jQuery
a[title='Bing']
But, I'm stuck using AngleSharp
eg. following example
https://github.com/AngleSharp/AngleSharp/wiki/Examples#getting-certain-elements
c# AngleSharp
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.Parse(#"<span><span class=""icon icon_none""></span></span>< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none""></span></a></span><span><span class=""icon icon_none""></span></span>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "a" && //stuck);

Looks like there was problem in your HTML markup that cause AngleSharp failed to find the target element i.e the spaces around angle-brackets :
< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none"">
Having the HTML fixed, both LINQ and CSS selector successfully select the target link :
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.ParseDocument(#"<span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span>");
//LINQ example
var blueListItemsLinq = document.All
.Where(m => m.LocalName == "a" &&
m.GetAttribute("title") == "Bing"
);
//LINQ equivalent CSS selector example
var blueListItemsCSS = document.QuerySelectorAll("a[title='Bing']");
//print href attributes value to console
foreach (var item in blueListItemsCSS)
{
Console.WriteLine(item.GetAttribute("href"));
}

HtmlAgilityPack scraping "href"

I wrote this code.:
Warning, the link point to adult site!!!
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://xhamster.com/movies/2808613/jewel_is_a_sexy_cougar_who_loves_to_fuck_lucky_younger_guys.html");
var aTags = document.DocumentNode.SelectNodes("//div[contains(#class,'noFlash')]");
if (aTags != null)
foreach (var aTag in aTags)
{
var href = aTag.Attributes["href"].Value;
textBox2.Text = href;
}
I got an error when i try run this programm.
If i put other things in "var href" for example.:
var href = aTag.InnerHtml
I got the inner text, and i can see there the "href=" link, and some other datas.
But i need only the link after the href!

You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use:
var hrefs = aTags.Descendants("a")
.Select(node => node.GetAttributeValue("href",""))
.ToList();

Parsing dl with HtmlAgilityPack

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).
<div class="content-div">
<dl>
<dt>
<b>1</b>
</dt>
<dd> First Entry</dd>
<dt>
<b>2</b>
</dt>
<dd> Second Entry</dd>
<dt>
<b>3</b>
</dt>
<dd> Third Entry</dd>
</dl>
</div>
The Values I want are :
The hyperlink -> 1.html
The Anchor Text ->1
Inner Text od dd -> First Entry
(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )
This is the code I am using currently,
var webGet = new HtmlWeb();
var document = webGet.Load(url2);
var parsedValues=
from info in document.DocumentNode.SelectNodes("//div[#class='content-div']")
from content in info.SelectNodes("dl//dd")
from link in info.SelectNodes("dl//dt/b/a")
.Where(x => x.Attributes.Contains("href"))
select new
{
Text = content.InnerText,
Url = link.Attributes["href"].Value,
AnchorText = link.InnerText,
};
GridView1.DataSource = parsedValues;
GridView1.DataBind();
The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:
First Entry 1.html 1
First Entry 2.html 2
First Entry 3.html 3
Second Entry 1.html 1
Second Entry 2.html 2
Second Entry 3.html 3
Third Entry 1.html 1
Third Entry 2.html 2
Third Entry 3.html 3
Whereas I am trying to get
First Entry 1.html 1
Second Entry 2.html 2
Third Entry 3.html 3
I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

Solution 1
I have defined a function that given a dt node will return the next dd node after it:
private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
var currentNode = dtElement;
while (currentNode != null)
{
currentNode = currentNode.NextSibling;
if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
return currentNode;
}
return null;
}
and now the LINQ code can be transformed to:
var parsedValues =
from info in document.DocumentNode.SelectNodes("//div[#class='content-div']")
from dtElement in info.SelectNodes("dl/dt")
let link = dtElement.SelectSingleNode("b/a[#href]")
let ddElement = GetNextDDSibling(dtElement)
where link != null && ddElement != null
select new
{
Text = ddElement.InnerHtml,
Url = link.GetAttributeValue("href", ""),
AnchorText = link.InnerText
};
Solution 2
Without additional functions:
var infoNode =
document.DocumentNode.SelectSingleNode("//div[#class='content-div']");
var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");
var parsedValues = dts.Zip(dds,
(dt, dd) => new
{
Text = dd.InnerHtml,
Url = dt.SelectSingleNode("b/a[#href]").GetAttributeValue("href", ""),
AnchorText = dt.SelectSingleNode("b/a[#href]").InnerText
});

Just a e.g. of how can you parse some elements using Html Agility Pack
public string ParseHtml()
{
string output = null;
HtmlDocument htmldocument = new HtmlDocument();
htmldocument.LoadHtml(YourHTML);
HtmlNode node = htmldocument.DocumentNode;
HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags
HtmlNodeCollection anchors = node.SelectNodes("//b/a[#href]"); //Select all 'a' tags that contais href attribute
for (int i = 0; i < dds.Count; i++)
{
string atributteValue = null.
Text = dds[i].InnerText;
Url = anchors[i].GetAttributeValue("href", atributteValue);
AnchorText = anchors[i].InnerText;
//Your code...
}
return output;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find specific link in html doc c# using HTML Agility Pack - c#

Related

selecting href from <a> node using HtmlAgilityPack

Get URLs inside a HTML page with HTML Agility Pack

AngleSharp Parsing

HtmlAgilityPack scraping "href"

Parsing dl with HtmlAgilityPack

Categories

Resources