HtmlAgilityPack scraping "href"

HtmlAgilityPack scraping "href" - c#

I wrote this code.:
Warning, the link point to adult site!!!
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://xhamster.com/movies/2808613/jewel_is_a_sexy_cougar_who_loves_to_fuck_lucky_younger_guys.html");
var aTags = document.DocumentNode.SelectNodes("//div[contains(#class,'noFlash')]");
if (aTags != null)
foreach (var aTag in aTags)
{
var href = aTag.Attributes["href"].Value;
textBox2.Text = href;
}
I got an error when i try run this programm.
If i put other things in "var href" for example.:
var href = aTag.InnerHtml
I got the inner text, and i can see there the "href=" link, and some other datas.
But i need only the link after the href!

You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use:
var hrefs = aTags.Descendants("a")
.Select(node => node.GetAttributeValue("href",""))
.ToList();

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items

When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp

I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210

I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

AngleSharp Parsing

Can't find many examples of using AngleSharp for parsing when you don't have a class name or id to use.
HTML
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
I want to find the href from any <a> tags that have a title = Bing
In Python BeautifulSoup I would use
item_needed = a_row.find('a', {'title': 'Bing'})
and then grab the href attribute
or jQuery
a[title='Bing']
But, I'm stuck using AngleSharp
eg. following example
https://github.com/AngleSharp/AngleSharp/wiki/Examples#getting-certain-elements
c# AngleSharp
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.Parse(#"<span><span class=""icon icon_none""></span></span>< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none""></span></a></span><span><span class=""icon icon_none""></span></span>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "a" && //stuck);

Looks like there was problem in your HTML markup that cause AngleSharp failed to find the target element i.e the spaces around angle-brackets :
< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none"">
Having the HTML fixed, both LINQ and CSS selector successfully select the target link :
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.ParseDocument(#"<span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span>");
//LINQ example
var blueListItemsLinq = document.All
.Where(m => m.LocalName == "a" &&
m.GetAttribute("title") == "Bing"
);
//LINQ equivalent CSS selector example
var blueListItemsCSS = document.QuerySelectorAll("a[title='Bing']");
//print href attributes value to console
foreach (var item in blueListItemsCSS)
{
Console.WriteLine(item.GetAttribute("href"));
}

strip all tags from string except anchor have class videoLink c#

i am trying to strip all tags from string paragraph except anchor tag which have class Videolink with regex.replace function can anybody help me out...!! thanks in advance... text is in urdu
before i am using this function but it is deleting all tags
public string ScrubHtml(string value)
{
var step1 = System.Text.RegularExpressions.Regex.Replace(value, #"<[^>]+>| ", "").Trim();
var Message_ = System.Text.RegularExpressions.Regex.Replace(step1, #"\s{2,}", " ");
return Message_;
}

Use a real html parser like HtmlAgilityPack, instead of Regex
Here is an example to get all links from a site
HttpClient client = new HttpClient();
var html = await client.GetStringAsync("http://google.com");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNode.Descendants()
.Where(x => x.Name == "a")
.Select(x=>x.Attributes["href"].Value)
.ToList();

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}

The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}

The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack scraping "href" - c#

You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use: var hrefs = aTags.Descendants("a") .Select(node => node.GetAttributeValue("href","")) .ToList();

Related

Parse HTML class in individual items with htmlagilitypack

Find specific link in html doc c# using HTML Agility Pack

AngleSharp Parsing

strip all tags from string except anchor have class videoLink c#

htmlagilitypack xpath incorrect

Categories

Resources