I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?
You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;
Html Agility Pack does not support attribute selection.
You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}
Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342
Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}
I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.
I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});
Related
I'm using Html Agility Pack (1.4.9.5) to remove a node within a specified class:
var document = new HtmlDocument();
document.LoadHtml("<p><div class=\"remove-it\"></div></p>");
var nodesToRemove = document.QuerySelectorAll(".remove-it");
if (nodesToRemove != null)
{
foreach (var node in nodesToRemove)
{
node.Remove();
}
}
var res = document.DocumentNode.OuterHtml;
The problem is that at the end res is equal to:
<p>
but it should be:
<p></p>
How can I fix this?
Almost there! You are missing
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed; before document.LoadHtml("<p><div class=\"remove-it\"></div></p>");.
What that does is that the p element will be automatically closed when parsing the document.
Is it possible to change the name of a tag from code? Something like this:
var tag = doc.QuerySelector("i");
tag.TagName = "em";
This won't work, because TagName is read-only.
But, what are my options for getting to the same end? Would I have to construct an entirely new tag and set the InnerHtml to the contents of the old tag, then delete and swap? Is this even possible?
If you mean to replace the elements in html string then it can be done this way:
private static string RefineImageElement(string htmlContent)
{
var parser = new HtmlParser();
var document = parser.ParseDocument(htmlContent);
foreach (var element in document.All)
{
if (element.LocalName == "img")
{
var newElement = document.CreateElement("v-img");
newElement.SetAttribute("src", element.Attributes["src"] == null ? "" :
element.Attributes["src"].Value);
newElement.SetAttribute("alt", "Article Image");
element.Insert(AdjacentPosition.BeforeBegin, newElement.OuterHtml);
element.Remove();
}
}
return document.FirstElementChild.OuterHtml;
}
To change element name, you can replace OuterHtml of initial element with combination of :
new element opening tag
initial element's InnerHtml
new element closing tag
Here is an example :
var raw = #"<div>
<i>foo</i>
</div>";
var parser = new AngleSharp.Parser.Html.HtmlParser();
var doc = parser.Parse(raw);
var tag = doc.QuerySelector("i");
tag.OuterHtml = $"<em>{tag.InnerHtml}</em>";
Console.WriteLine(doc.DocumentElement.OuterHtml);
Output :
<html><head></head><body><div>
<em>foo</em>
</div></body></html>
I was having the same question the other day. Later the only solution came to me is simply create another element and copy every attributes from the original element. Then remove the original element.
I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)
I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.
I am attempting to use the Agility pack to parse certain bits of info from various pages. I am kind of worried that using this might be overkill for what I need, if that is case feel free to let me know. Anyway, I am attempting to parse a page from motley fool to get the name of a company based on the ticker. I will be parsing several pages to get stock info in a similar way.
The HTML that I want to parse looks like:
<h1 class="subHead">
Microsoft Corp <span>(NASDAQ:MSFT)</span>
</h1>
Also, the page I want to parse is: http://caps.fool.com/Ticker/MSFT.aspx
So, I guess my question is how do I simply get the Microsoft Corp from the html and should I even be using the agility pack to do things like this?
Edit: Current code
public String getStockName(String ticker)
{
String text ="";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://caps.fool.com/Ticker/" + ticker + ".aspx");
var node = doc.DocumentNode.SelectSingleNode("/h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
}
This would give you a list of all stock names, for your sample Html just of Microsoft:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("test.html");
var nodes = doc.DocumentNode.SelectNodes("//h1[#class='subHead']");
foreach (var node in nodes)
{
string text = node.FirstChild.InnerText; //output: "Microsoft Corp"
string textAll = node.InnerText; //output: "Microsoft Corp (NASDAQ:MSFT)"
}
Edit based on updated question - this should work for you:
string text = "";
HtmlWeb web = new HtmlWeb();
string url = string.Format("http://caps.fool.com/Ticker/{0}.aspx", ticker);
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var node = doc.DocumentNode.SelectSingleNode("//h1[#class='subHead']");
text = node.FirstChild.InnerText.Trim();
return text;
Use an xpath expression to select the element then pickup the text.
foreach (var element in doc.DocumentNode.SelectNodes("//h1[#clsss='subHead']/span"))
{
Console.WriteLine (element.InnerText);
}