Getting text from h1 className - c#

im trying to get the following text 'getthis' by using the agility pack.
<h1 class="point">getthis< span class="level">Niveau 0</span> </h1>
I've tried:
var links = webBrowser1.Document.GetElementsByTagName("point");
foreach (HtmlElement link in links)
{
Console.WriteLine("AAAAA");
}
var varit = doc.DocumentNode
.SelectSingleNode("//h1[#class='point']")
.InnerText;
Console.WriteLine("WB2 TRUE ", varit.ToString());*
var varit = doc.DocumentNode
.SelectSingleNode("//h1[#class='point']")
.InnerHtml;
Console.WriteLine("WB2 TRUE ", varit.ToString());*
Why isn't my code working?

try this:
htmlDoc.DocumentNode.Descendants("h1")
.Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("point"))
.First().InnerHtml;
this code:
string html = "<h1 class=\"point\">getthis< span class=\"level\">Niveau 0</span> </h1>";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var str = htmlDoc.DocumentNode.Descendants("h1").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("point")).First().ChildNodes[0].InnerText;
returns: "getthis"

I've fixed this question after finding this out:
var a = doc.DocumentNode.SelectNodes("//h1").Skip(0).Take(1).Single();

Related

Html Agility Pack get contents from table

I need to get the location, address, and phone number from "http://anytimefitness.com/find-gym/list/AL" So far I have this...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(stateURLs[0].ToString());
var BlankNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='']");
var GrayNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='gray_bk']");
I have looked around stackoverflow for a while but none of the present post regarding htmlagilitypack has really helped. I have also have been using http://www.w3schools.com/xpath/xpath_syntax.asp
Since <div> you're after is not direct child of root node, you need to use // instead of /. Then you can combine XPath for BlankNode and GrayNode using or operator, for example :
var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;
var AllNode =
htmlDoc.DocumentNode.SelectNodes("//div[#class='segmentwhite']/table//tr[#class='' or #class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
var location = node.SelectSingleNode("./td[2]").InnerText;
var address = node.SelectSingleNode("./td[3]").InnerText;
var phone = node.SelectSingleNode("./td[4]").InnerText;
//do something with above informations
}
Here's an example I tested in LinqPad.
string url = #"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);
var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[#class='' or #class='gray_bk']");
foreach (var gym in gyms) {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
}
Since the HtmlAgilityPack also supports Linq, you could also do something like:
string [] classes = {"", "gray_bk"};
var gyms = htmlDoc
.DocumentNode
.Descendants("tr")
.Where(t => classes.Contains(t.Attributes["class"].Value))
.ToList();
gyms.ForEach(gym => {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
});

Get text from webpage using HtmlAgilityPack

I would like to get the innertext from the tags < p >. I use HtmlAgilityPack to get the html code from the website. But this isn't working what am I doing wrong?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml("urlwebsite");
var itemList = doc.DocumentNode.SelectNodes("//p")
.Select(p => p.InnerText)
.ToList();
Please try following
HtmlAgilityPack.HtmlDocument doc;
var web = new HtmlAgilityPack.HtmlWeb();
doc = web.Load("urlwebsite");
var itemList = doc.DocumentNode.SelectNodes("//p")
.Select(p => p.InnerText)
.ToList();

Get a value of an attribute by HtmlAgilityPack

I want to get a value of an attribute by HtmlAgilityPack. Html code:
<link href="style.css">
<link href="anotherstyle.css">
<link href="anotherstyle2.css">
<link itemprop="thumbnailUrl" href="http://image.jpg">
<link href="anotherstyle5.css">
<link href="anotherstyle7.css">
I want to get last href attribute.
My c# code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldoc = web.Load(Url);
htmldoc.OptionFixNestedTags = true;
var navigator = (HtmlNodeNavigator)htmldoc.CreateNavigator();
string xpath = "//link/#href";
string val = navigator.SelectSingleNode(xpath).Value;
But that code return first href value.
Following XPath selects link elements which have href attribute defined. Then from links you are selecting last one:
var link = doc.DocumentNode.SelectNodes("//link[#href]").LastOrDefault();
// you can also check if link is not null
var href = link.Attributes["href"].Value; // "anotherstyle7.css"
You can also use last() XPath operator
var link = doc.DocumentNode.SelectSingleNode("/link[#href][last()]");
var href = link.Attributes["href"].Value;
UPDATE: If you want to get last element which has both itemprop and href attributes, then use XPath //link[#href and #itemprop][last()] or //link[#href and #itemprop] if you'll go with first approach.
load the webpage as Htmldocument and directly select the last link tag.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var output = doc.DocumentNode.SelectNodes("//link[#href]").LastOrDefault();
var data = output.Attributes["href"].Value;
or
load the webpage as Htmldocument and get the collection of all selected link tags
then travel using loop then access last select tag attribute.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
int count = 0;
string data = "";
var output = doc.DocumentNode.SelectNodes("//link[#href]");
foreach (var item in output)
{
count++;
if (count == output.Count)
{
data=item.Attributes["href"].Value;
break;
}
}
you need something like that:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldoc = web.Load(Url);
htmldoc.OptionFixNestedTags = true;
var navigator = (HtmlNodeNavigator)htmldoc.CreateNavigator();
string xpath = "//link[#itemprop]/#href";
string val = navigator.SelectSingleNode(xpath).Value;
Get a HtmlNode by attribute value:
public static class Extensions
{
public static HtmlNode GetNodeByAttributeValue(this HtmlNode htmlNode, string attributeName, string attributeValue)
{
if (htmlNode.Attributes.Contains(attributeName))
{
if (string.Compare(htmlNode.Attributes[attributeName].Value, attributeValue, true) == 0)
{
return htmlNode;
}
}
foreach (var childHtmlNode in htmlNode.ChildNodes)
{
var resultNode = GetNodeByAttributeValue(childHtmlNode, attributeName, attributeValue);
if (resultNode != null) return resultNode;
}
return null;
}
}
Usage
var searchResultsDiv = pageDocument.DocumentNode.GetNodeByAttributeValue("someattributename", "resultsofsearch");
Ok, I came to this:
var link = htmldoc.DocumentNode.SelectSingleNode("//link[#itemprop='thumbnailUrl']");
var href = link.Attributes["href"].Value;

Use HtmlAgilityPack to determine if string contains ONLY tags from list of allowed tags

cf/ Finding HTML strings in document and similar questions.
I have seen examples of using HtmlAgilityPack to parse through a string looking for specific tags, but what if I want to make sure that the input string contains ONLY strings from a list List<string> AllowedTags?
In other words, how can I iterate over doc.DocumentNode.Descendants to identify the tag name and check if it is in the list?
var allowedTags = new List<string> { "html", "head", "body", "div" };
bool containsOnlyAllowedTags =
doc.DocumentNode
.Descendants()
.Where(n => n.NodeType == HtmlNodeType.Element)
.All(n => allowedTags.Contains(n.Name));
List<string> AllowedTags = new List<string>() { "br", "a" };
HtmlDocument goodDoc = new HtmlDocument();
goodDoc.LoadHtml("<a href='asdf'>asdf</a><br /><a href='qwer'>qwer</a>");
bool containsBadTags = goodDoc.DocumentNode .Descendants()
.Where(node => node.NodeType == HtmlNodeType.Element)
.Select(node => node.Name)
.Except(AllowedTags)
.Any();
HtmlDocument badDoc = new HtmlDocument();
badDoc.LoadHtml("<a href='asdf'><b>asdf</b></a><br /><a href='qwer'>qwer</a>");
containsBadTags = badDoc.DocumentNode .Descendants()
.Where(node => node.NodeType == HtmlNodeType.Element)
.Select(node => node.Name)
.Except(AllowedTags)
.Any();

extracting values of text from html source file

in this code var TempTxt holds An Html Body Content
as string
how can i extract element <table> or <td> inner text/ html using lambada syntax ?
public string ExtractPageValue(IWebDriver DDriver, string url="")
{
if(string.IsNullOrEmpty(url))
url = #"http://www.boi.org.il/he/Markets/ExchangeRates/Pages/Default.aspx";
var service = InternetExplorerDriverService.CreateDefaultService(directory);
service.LogFile = directory + #"\seleniumlog.txt";
service.LoggingLevel = InternetExplorerDriverLogLevel.Trace;
var options = new InternetExplorerOptions();
options.IntroduceInstabilityByIgnoringProtectedModeSettings = true;
DDriver = new InternetExplorerDriver(service, options, TimeSpan.FromSeconds(60));
DDriver.Navigate().GoToUrl(url);
var TempTxt = DDriver.PageSource;
return "";//Math.Round(Convert.ToDouble( TempTxt.Split(' ')[10]),2).ToString();
}
If you are open to try HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectNodes("//table/tr")
.Select(tr => tr.Elements("td").Select(td => td.InnerText).ToList())
.ToList();

Categories