On the web page i have
<meta name="description" content="Learn about 94.100.179.159" />
how can i get exactly the text "Learn about 94.100.179.159" via Xpath or HtmlAgilityPack
i've tried
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
string s = link.InnerText;
Console.WriteLine(s);
}
Console.ReadLine();
but that gives me not that i want, how to solve that?
//meta[#name = 'description']/#content
is the XPATH for the attribute you specified
string s = link.Value;
should return the attribute content.
Meta tags don't have any inner text, they have attributes.
Try this:
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://whois.domaintools.com/94.100.179.159");
foreach (HtmlNode link in htmldocObject.DocumentNode.SelectNodes("//meta"))
{
Console.WriteLine("-META-");
var attribDump=link.Attributes.Select(a=>a.Name+" : "+a.Value);
foreach (var x in attribDump)
{
Console.WriteLine(x);
}
}
Select the nodes as follows
SelectNodes("//*[local-name()='meta')]"))
Then, for each HtmlNode,
Console.WriteLine(link.Attributes["content"].Value);
Related
Using vs 2019 and .net 4.8, the c# code below get for following html node and I'm having trouble getting the href value.
The href attribute has a full url but the only text I'm getting is "/".
Can someone please let me know where I'm going wrong and how to get the full url text?
Thank you.
The node:
<h2 class="n">
3.
<a class="business-name" href="/santa-monica-ca/mip/specialists-in-custom-software-16438720" data-analytics="{"target":"name","feature_click":""}" rel="" data-impressed="1">
<span>Specialists In Custom Software</span>
</a>
</h2>
My code:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']").ToArray();
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectNodes("//a//span").First().InnerText);
Console.WriteLine(node.SelectNodes("//a").First().Attributes["href"].Value);
}
That should do it, although I didn't understand why it doesn't work the way it is:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("https://www.yellowpages.com/search?search_terms=custom+software&geo_location_terms=Los+Angeles%2C+CA");
var nodes = document.DocumentNode.SelectNodes("//h2 [#class='n']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.SelectSingleNode("a/span").InnerText);
Console.WriteLine(node.SelectSingleNode("a").Attributes["href"].Value);
Console.WriteLine();
}
Fiddle: https://dotnetfiddle.net/dtoZGl
How do I use the HTML Agility Pack?
My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.
First, install the HTMLAgilityPack nuget package into your project.
Then, as an example:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
// Do something with bodyNode
}
}
}
(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)
The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)
HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.
Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.
There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.
I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.
HtmlAgilityPack Article Series
Introduction To The HtmlAgilityPack Library
Easily extracting links from a snippet of html with HtmlAgilityPack
The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.
HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp
To parse
<h2>
Jack
</h2>
<ul>
<li class="tel">
81 75 53 60
</li>
</ul>
<h2>
Roy
</h2>
<ul>
<li class="tel">
44 52 16 87
</li>
</ul>
I did this:
string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[#class='tel']//a"))
{
phones.Add(node.ChildNodes[0].InnerHtml);
}
Main HTMLAgilityPack related code is as follows
using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace GetMetaData
{
/// <summary>
/// Summary description for MetaDataWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.ComponentModel.ToolboxItem(false)]
// To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
[System.Web.Script.Services.ScriptService]
public class MetaDataWebService: System.Web.Services.WebService
{
[WebMethod]
[ScriptMethod(UseHttpGet = false)]
public MetaData GetMetaData(string url)
{
MetaData objMetaData = new MetaData();
//Get Title
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(url);
objMetaData.PageTitle = Regex.Match(sourceUrl, #
"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
//Method to get Meta Tags
objMetaData.MetaDescription = GetMetaDescription(url);
return objMetaData;
}
private string GetMetaDescription(string url)
{
string description = string.Empty;
//Get Meta Tags
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach(var tag in metaTags)
{
if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
{
description = tag.Attributes["content"].Value;
}
}
}
else
{
description = string.Empty;
}
return description;
}
}
}
public string HtmlAgi(string url, string key)
{
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[#name='{0}']", key));
if (ourNode != null)
{
return ourNode.GetAttributeValue("content", "");
}
else
{
return "not fount";
}
}
Getting Started - HTML Agility Pack
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
try this
string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
private string ParseHmlBody(string html)
{
string body = string.Empty;
try
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
body = htmlBody.OuterHtml;
}
catch (Exception ex)
{
dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
}
return body;
}
c# code:`
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
textBox1.Text=ournode.InnerHtml;
`
html code :
//< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<img alt="" class="head" id="face_56578735" src="http://img1.ask.fm/assets2/091/364/883/712/thumb_tiny/11094711_919135961470973_149663457_njpg720960png1280963.png" />
I want to see the following in the text box
/sudenur3434
/leylaulucay
I have added an additional line to your code:
var node = new HtmlWeb();
var doc = node.Load("http://ask.fm/");
HtmlNode ournode = doc.DocumentNode.SelectSingleNode("//div[#id='heads']")
var val = ournode.Attributes["href"].Value;
textBox1.Text=val;
This would let you get the href attribute. Simply use the same code to get the other nodes href value and then add them to your textbox
Since a text box is usually used for one liners, I am giving you an example that will simply write all links in the direct output window of VS.
If you use e.g. a ListBox instead of a text box you can replace Debug.Print by e.g. ListBox1.Items.Add(href.Value)
This here will give you all href urls from all a children in div id="heads":
var site = new HtmlWeb();
var htmldoc = site.Load("http://ask.fm/");
var headDiv = htmldoc.DocumentNode.SelectSingleNode("//div[#id='heads']");
if (headDiv != null)
{
var anchors = headDiv.SelectNodes("a");
foreach (HtmlNode aNode in anchors)
{
var href = aNode.Attributes.AttributesWithName("href").FirstOrDefault();
if (href != null)
Debug.Print(href.Value);
}
}
< div id="heads" >
<img alt="" class="head" id="face_30132803" src="http://img3.ask.fm/assets2/103/548/655/872/thumb_tiny/IMG_20150513_192250.jpg" />
<a href="/leylaulucay" data-rlt-aid="welcome_head"><img alt="" class="head" id="face_5
how to agility pack parse in textbox
I have 2 lists:
public List<string> my_link = new List<string>();
public List<string> english_word = new List<string>();
I am scraping some links from a page and save them onto "my_link";for this I am using these codes like:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://search.freefind.com/find.html?id=59478474&pid=r&ics=1&query=" + x);
HtmlNodeCollection nodes=doc.DocumentNode.SelectNodes("//font[#class='search-results']//a");
try
{
foreach (HtmlNode n in nodes)
{
link = n.InnerHtml;
link = link.Trim();
my_link.Add(link);
}
}
catch (NullReferenceException )
{
MessageBox.Show("NO link found ");
}
Then i am scraping some content going on that links which I scrapped and I stored that content of each link on a english_word.Add(q); It can scrape content from all links except the last one.my code is like that
foreach (string ss in my_link)
{
HtmlWeb web2 = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc2 = web2.Load(ss);
HtmlNodeCollection nodes2 = doc2.DocumentNode.SelectNodes("//table[#id='table1']//tr[position()>1]//td[position()=2]");
try
{
foreach (HtmlNode nn in nodes2)
{
q = nn.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
english_word.Add(q);
}
}
catch (NullReferenceException ex)
{
MessageBox.Show("No english word is found ");
}
}
For last link only it shows "No english word is found "
What am I doing wrong?
First, catching a NullReferenceException here is not a very good idea. It's better to check for null where you're expecting nulls.
Second, most probably you get this exception because of HtmlNode.SelectNodes method returns null (not an empty collection of nodes, as you've been expected) if no nodes found. See HTML Agility Pack Null Reference, C#/ Html Agility pack error “Value cannot be null. Parameter name: Source.”, and a discussion on CodePlex.
So, instead of a try .. catch block you could use something like:
if (nodes2 != null)
{
foreach (HtmlNode nn in nodes2)
{
q = nn.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
english_word.Add(q);
}
}
else
{
MessageBox.Show("No english word is found ");
}
Change your catch statement to catch all exceptions, not just NullReferenceException.
Debugger is your friend, use it. I'm guessing that you get exception somewhere before adding a new word to the list. Set a breakpoint in your foreach loop.
How do I use the HTML Agility Pack?
My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.
First, install the HTMLAgilityPack nuget package into your project.
Then, as an example:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
// Do something with bodyNode
}
}
}
(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)
The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)
HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.
Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.
There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.
I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.
HtmlAgilityPack Article Series
Introduction To The HtmlAgilityPack Library
Easily extracting links from a snippet of html with HtmlAgilityPack
The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.
HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp
To parse
<h2>
Jack
</h2>
<ul>
<li class="tel">
81 75 53 60
</li>
</ul>
<h2>
Roy
</h2>
<ul>
<li class="tel">
44 52 16 87
</li>
</ul>
I did this:
string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[#class='tel']//a"))
{
phones.Add(node.ChildNodes[0].InnerHtml);
}
Main HTMLAgilityPack related code is as follows
using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace GetMetaData
{
/// <summary>
/// Summary description for MetaDataWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.ComponentModel.ToolboxItem(false)]
// To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
[System.Web.Script.Services.ScriptService]
public class MetaDataWebService: System.Web.Services.WebService
{
[WebMethod]
[ScriptMethod(UseHttpGet = false)]
public MetaData GetMetaData(string url)
{
MetaData objMetaData = new MetaData();
//Get Title
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(url);
objMetaData.PageTitle = Regex.Match(sourceUrl, #
"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
//Method to get Meta Tags
objMetaData.MetaDescription = GetMetaDescription(url);
return objMetaData;
}
private string GetMetaDescription(string url)
{
string description = string.Empty;
//Get Meta Tags
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach(var tag in metaTags)
{
if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
{
description = tag.Attributes["content"].Value;
}
}
}
else
{
description = string.Empty;
}
return description;
}
}
}
public string HtmlAgi(string url, string key)
{
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[#name='{0}']", key));
if (ourNode != null)
{
return ourNode.GetAttributeValue("content", "");
}
else
{
return "not fount";
}
}
Getting Started - HTML Agility Pack
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
try this
string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
private string ParseHmlBody(string html)
{
string body = string.Empty;
try
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
body = htmlBody.OuterHtml;
}
catch (Exception ex)
{
dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
}
return body;
}