I want to get a value of an attribute by HtmlAgilityPack. Html code:
<link href="style.css">
<link href="anotherstyle.css">
<link href="anotherstyle2.css">
<link itemprop="thumbnailUrl" href="http://image.jpg">
<link href="anotherstyle5.css">
<link href="anotherstyle7.css">
I want to get last href attribute.
My c# code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldoc = web.Load(Url);
htmldoc.OptionFixNestedTags = true;
var navigator = (HtmlNodeNavigator)htmldoc.CreateNavigator();
string xpath = "//link/#href";
string val = navigator.SelectSingleNode(xpath).Value;
But that code return first href value.
Following XPath selects link elements which have href attribute defined. Then from links you are selecting last one:
var link = doc.DocumentNode.SelectNodes("//link[#href]").LastOrDefault();
// you can also check if link is not null
var href = link.Attributes["href"].Value; // "anotherstyle7.css"
You can also use last() XPath operator
var link = doc.DocumentNode.SelectSingleNode("/link[#href][last()]");
var href = link.Attributes["href"].Value;
UPDATE: If you want to get last element which has both itemprop and href attributes, then use XPath //link[#href and #itemprop][last()] or //link[#href and #itemprop] if you'll go with first approach.
load the webpage as Htmldocument and directly select the last link tag.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
var output = doc.DocumentNode.SelectNodes("//link[#href]").LastOrDefault();
var data = output.Attributes["href"].Value;
or
load the webpage as Htmldocument and get the collection of all selected link tags
then travel using loop then access last select tag attribute.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
int count = 0;
string data = "";
var output = doc.DocumentNode.SelectNodes("//link[#href]");
foreach (var item in output)
{
count++;
if (count == output.Count)
{
data=item.Attributes["href"].Value;
break;
}
}
you need something like that:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmldoc = web.Load(Url);
htmldoc.OptionFixNestedTags = true;
var navigator = (HtmlNodeNavigator)htmldoc.CreateNavigator();
string xpath = "//link[#itemprop]/#href";
string val = navigator.SelectSingleNode(xpath).Value;
Get a HtmlNode by attribute value:
public static class Extensions
{
public static HtmlNode GetNodeByAttributeValue(this HtmlNode htmlNode, string attributeName, string attributeValue)
{
if (htmlNode.Attributes.Contains(attributeName))
{
if (string.Compare(htmlNode.Attributes[attributeName].Value, attributeValue, true) == 0)
{
return htmlNode;
}
}
foreach (var childHtmlNode in htmlNode.ChildNodes)
{
var resultNode = GetNodeByAttributeValue(childHtmlNode, attributeName, attributeValue);
if (resultNode != null) return resultNode;
}
return null;
}
}
Usage
var searchResultsDiv = pageDocument.DocumentNode.GetNodeByAttributeValue("someattributename", "resultsofsearch");
Ok, I came to this:
var link = htmldoc.DocumentNode.SelectSingleNode("//link[#itemprop='thumbnailUrl']");
var href = link.Attributes["href"].Value;
Related
I want to load content into a htmlDocument list using HTML Agility pack.
I have successfully achieved what I want using:
var htmllist = new List<HtmlDocument>();
int counter = 0;
foreach(var c in content)
{
htmllist.Add(new HtmlDocument());
htmllist[counter].LoadHtml(c);
counter += 1;
}
How can i write this in a Lambda expression? I tried:
var htmllist = content.Select(p => new HtmlDocument() {Text = p })
You need to add a ToList() to execute the query like
var htmllist = content.Select(p => new HtmlDocument() { Text = p }).ToList();
Per your comment and another comment: you can change your existing code a bit like
private HtmlDocument LoadHtmlFromContent(string content)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
return doc;
}
Now call this in your Linq query like
var htmllist = content.Select(p => this.LoadHtmlFromContent(p)).ToList();
Enumerable.Selectaccepts an arbitrary selector as Func<TSource,TResult>. So you can inline the conversion method, but imho it really doesn't look great…
content.Select(c => {var doc = new HtmlDocument(); doc.LoadHtml(c); return doc;});
If you're using C# >=7.0 you could think about using a local function for that. E.g.
void Convert(IEnumerable<string> content)
{
var htmls = content.Select(ConvertToHtml);
HtmlDocument ConvertToHtml(string c)
{
var doc = new HtmlDocument();
doc.LoadHtml(c);
return doc;
}
}
That looks more maintainable to me.
I have a file which contain text in that. I need to search for a string and extract the href on that line.
file.txt is the file which contain basic wordpress homepage
finally I want the http://example.com like link.
I tried several ways like
DateTime dateTime = DateTime.UtcNow.Date;
string stringpart = dateTime.ToString("-dd-M-yyyy");
string finalword = "candy" + stringpart;
List<List<string>> groups = new List<List<string>>();
List<string> current = null;
foreach (var line in File.ReadAllLines(#"E:/file.txt"))
{
if (line.Contains("-22-8-2014") && current == null)
current = new List<string>();
else if (line.Contains("candy") && current != null)
{
groups.Add(current);
current = null;
}
if (current != null)
current.Add(line);
}
foreach (object o in groups)
{
Console.WriteLine(o);
}
Console.ReadLine();
}
To do this correctly, you must parse this html-file. Use something like CSquery, HTML Agility Pack, or SgmlReader.
Solution of your problem with CSQuery:
public IEnumerable<string> ExtractLinks(string htmlFile)
{
var page = CQ.CreateFromFile(htmlFile);
return page.Select("a[href]").Select(tag => tag.GetAttribute("href"));
}
In case you decided to use HtmlAgilityPack, this should be easy :
var doc = new HtmlDocument();
//load your HTML file to HtmlDocument
doc.Load("path_to_your_html.html");
//select all <a> tags containing href attribute
var links = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in links)
{
//print value of href attribute
Console.WriteLine(link.GetAttributeValue("href", "");
}
I need to get the location, address, and phone number from "http://anytimefitness.com/find-gym/list/AL" So far I have this...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(stateURLs[0].ToString());
var BlankNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='']");
var GrayNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='gray_bk']");
I have looked around stackoverflow for a while but none of the present post regarding htmlagilitypack has really helped. I have also have been using http://www.w3schools.com/xpath/xpath_syntax.asp
Since <div> you're after is not direct child of root node, you need to use // instead of /. Then you can combine XPath for BlankNode and GrayNode using or operator, for example :
var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;
var AllNode =
htmlDoc.DocumentNode.SelectNodes("//div[#class='segmentwhite']/table//tr[#class='' or #class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
var location = node.SelectSingleNode("./td[2]").InnerText;
var address = node.SelectSingleNode("./td[3]").InnerText;
var phone = node.SelectSingleNode("./td[4]").InnerText;
//do something with above informations
}
Here's an example I tested in LinqPad.
string url = #"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);
var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[#class='' or #class='gray_bk']");
foreach (var gym in gyms) {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
}
Since the HtmlAgilityPack also supports Linq, you could also do something like:
string [] classes = {"", "gray_bk"};
var gyms = htmlDoc
.DocumentNode
.Descendants("tr")
.Where(t => classes.Contains(t.Attributes["class"].Value))
.ToList();
gyms.ForEach(gym => {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
});
I'm having this method to select out specific html and put it In a list.
Works perfect when I'm using a html-file saved on my computer. But how can a load a content from a website
This is my method loading the .html-file, witch works:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Jonathan/Desktop/laggen.html");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}
But I want to load a website instead of a file, like this:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen"); <--- this is the site I want to load
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}
Use
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
foreach (string gtin in listOfGtins)
{
Console.WriteLine(gtin);
}
if you want to load HTML over HTTP from a URL.
Please help-me with this question:
private void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
var doc = new HtmlDocument();
doc.LoadHtml("http://www.unnu.com/popular-music-videos");
//var query = doc.DocumentNode.Descendants("img");
MessageBox.Show("chegou");
foreach (HtmlNode linkNode in doc.DocumentNode.SelectNodes("#//img[#src]"))
{
HtmlAttribute link = linkNode.Attributes[#"href"];
HtmlNode imageNode = linkNode.SelectSingleNode(#"//.php?src");
HtmlAttribute src = imageNode.Attributes[#"src"];
string Link = link.Value;
Uri imageUrl = new Uri(src.Value);
MessageBox.Show("chegou");
}
}
I need to get the all images and titles with your respective urls. I'm Using windows phone 7.5. The dll is the same.
doc.LoadHtml expects html, not the url. Are you looking for something like this?
var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://www.unnu.com/popular-music-videos");
var imgs = doc.DocumentNode.SelectNodes(#"//img[#src]")
.Select(img => new
{
Link = img.Attributes["src"].Value,
Title = img.Attributes["alt"].Value
})
.ToList();
If HtmlAgilityPack's WP7 version doesn't support HtmlWeb you can also use WebClient to get the html string and it can be used as parameter to doc.LoadHtml