i want to get the google results as div. but i take error "Object reference not set to an instance of an object."
my code:
var doc = new HtmlWeb().Load("http://www.google.com/search?q=love");
var div = doc.DocumentNode.SelectSingleNode("//div[#id='resultStats']");
var text = div.InnerHtml.ToString(); <--- this line
textBox1.Text = div.ToString();
var matches = Regex.Matches(text, #"About ([0-9,]+) ");
var total = matches[0].Groups[1].Value;
i try this code:
int counter = 0;
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=love");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
counter = counter + 1;
}
MessageBox.Show(counter.ToString());
i see 97 in the messagebox.
but i try this code:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=love");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
MessageBox.Show(link.ToString());
}
i see "HtmlAgiliytPack.HtmlNode" in the messagebox 97 times.
The first error is because your loaded HTML doesn't contain <div> element with id='resultStats'. That's why your div variable is null, and hence div.InnerHtml gives you a NullReferenceException.
As to the second issue: by using link.ToString() you call .ToString() method of variable of type HtmlNode which seems is not overloaded, and returns just a short type name. I suspect you want to output the link node itself. To do this just use .OuterHtml property on your link:
MessageBox.Show(link.OuterHtml);
Just a side note: the HtmlNode.InnerHtml property is a type of string, so calling ToString() method on a type of string is not necessary here.
Related
I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}
I wrote this code.:
Warning, the link point to adult site!!!
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://xhamster.com/movies/2808613/jewel_is_a_sexy_cougar_who_loves_to_fuck_lucky_younger_guys.html");
var aTags = document.DocumentNode.SelectNodes("//div[contains(#class,'noFlash')]");
if (aTags != null)
foreach (var aTag in aTags)
{
var href = aTag.Attributes["href"].Value;
textBox2.Text = href;
}
I got an error when i try run this programm.
If i put other things in "var href" for example.:
var href = aTag.InnerHtml
I got the inner text, and i can see there the "href=" link, and some other datas.
But i need only the link after the href!
You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use:
var hrefs = aTags.Descendants("a")
.Select(node => node.GetAttributeValue("href",""))
.ToList();
I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)
I want to replace a node with a new node. How can I get the exact position of the node and do a complete replace?
I've tried the following, but I can't figured out how to get the index of the node or which parent node to call ReplaceChild() on.
string html = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var bolds = document.DocumentNode.Descendants().Where(item => item.Name == "b");
foreach (var item in bolds)
{
string newNodeHtml = GenerateNewNodeHtml();
HtmlNode newNode = new HtmlNode(HtmlNodeType.Text, document, ?);
item.ParentNode.ReplaceChild( )
}
To create a new node, use the HtmlNode.CreateNode() factory method, do not use the constructor directly.
This code should work out for you:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
Note that we need to call ToList() on the query, we will be modifying the document so it would fail if we don't.
If you wish to replace with this string:
"some text <b>node</b> <strong>another node</strong>"
The problem is that it is no longer a single node but a series of nodes. You can parse it fine using HtmlNode.CreateNode() but in the end, you're only referencing the first node of the sequence. You would need to replace using the parent node.
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodesStr = "some text <b>node</b> <strong>another node</strong>";
var newHeadNode = HtmlNode.CreateNode(newNodesStr);
item.ParentNode.ReplaceChild(newHeadNode.ParentNode, item);
}
Have Implemented the following solution to achieve the same.
var htmlStr = "<b>bold_one</b><div class='LatestLayout'><div class='olddiv'><strong>strong</strong></div></div><b>bold_two</b>";
var htmlDoc = new HtmlDocument();
HtmlDocument document = new HtmlDocument();
document.Load(htmlStr);
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='olddiv']").Remove();
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='LatestLayout']").PrependChild(newChild)
htmlDoc.Save(FilePath); // FilePath .html file with full path if need to save file.
so selecting an object and removing respective HTML object
and appending it as chile. of respective object.
I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?
You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;
Html Agility Pack does not support attribute selection.
You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}
Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342
Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}
I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.
I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});