Getting specific data from html - c#

I want to get specific data from html. Im using c# and HtmlAgilityPack
Here's the HTML sample:
<p class="heading"><span>Greeting!</span>
<p class='verse'>Hi!<br> //
Hello!</p><p class='verse'>Hello!<br> // i want to get this g
Hi!</p> //
<p class="writers"><strong>WE</strong><br/>
Here my code in c#:
StringBuilder pureText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Lyrics);
var s = doc.DocumentNode.Descendants("p");
try
{
foreach (HtmlNode childNode in s)
{
pureText.Append(childNode.InnerText);
}
}
catch
{ }
UPDATE:
StringBuilder pureText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(URL);
var s = doc.DocumentNode.SelectNodes("//p[#class='verse']"); // error
try
{
foreach (HtmlNode childNode in s)
{
pureText.Append(childNode.InnerText);
}
}
catch
{ }
ERROR:
'HtmlAgilityPack.HtmlNode' does not contain a definition for 'SelectNodes' and no extension method 'SelectNodes' accepting a first argument of type 'HtmlAgilityPack.HtmlNode' could be found (are you missing a using directive or an assembly reference?)

You can try with XPath query syntax to select all <p> having class='verse', like this :
var s = doc.DocumentNode.SelectNodes("//p[#class='verse']");
Then do the same foreach as you already have.
UPDATE I :
I don't know why the code above throwing error for you. It has been tested in my PC and should work fine. Anyway if you accept workaround, the same query can be achieved without XPath this way :
var s = doc.DocumentNode.Descendants("p").Where(o => o.Attributes["class"] != null && o.Attributes["class"].Value == "verse");
This solution is longer since we need to check if a node has class attibutes or not, before checking the attributes' value. Otherwise, we'll get Null Reference Exception if there any <p> without class attributes.

Related

Remove node of single child parent in html agility pack

I'm using Html Agility Pack (1.4.9.5) to remove a node within a specified class:
var document = new HtmlDocument();
document.LoadHtml("<p><div class=\"remove-it\"></div></p>");
var nodesToRemove = document.QuerySelectorAll(".remove-it");
if (nodesToRemove != null)
{
foreach (var node in nodesToRemove)
{
node.Remove();
}
}
var res = document.DocumentNode.OuterHtml;
The problem is that at the end res is equal to:
<p>
but it should be:
<p></p>
How can I fix this?
Almost there! You are missing
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed; before document.LoadHtml("<p><div class=\"remove-it\"></div></p>");.
What that does is that the p element will be automatically closed when parsing the document.

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[#class='f kv']/cite.
Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[#class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
The correct XPath is "//div[#class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[#class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ... in the middle)

HtmlAgilityPack Get all links inside a DIV

I want to be able to get 2 links from inside a div.
Currently I can select one but whene there's more it doesn't seem to work.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if (node != null)
{
foreach (HtmlNode type in node.SelectNodes("//a#href"))
{
recipe.type += type.InnerText;
}
}
else
recipe.type = "Error fetching type.";
Trying to get it from this piece of HTML:
<div class="myclass">
<h3>Not Relevant Header</h3>
This text,
and this text
</div>
Any help is appreciated, Thanks in advance.
var div = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if(div!=null)
{
var links = div.Descendants("a")
.Select(a => a.InnerText)
.ToList();
}
Use this XPath:
//div[#class = 'myclass']//a
It grabs all descendant a elements in div with class = 'myclass'.
And //a#href is incorrect XPath.
Use:
//div[contains(concat(' ', #class, ' '), ' myclass ')]//a
This selects any a element that is a descendant of any div whose class attribute contains a classname of "myclass".
The classname may be single, or the attribute may also contain other classnames. In this case the classname may be the starting one, or the last one or may be surrounded by other classnames -- the above XPath expression correctly selects the wanted nodes in all of these different cases.

//div[#id='resultStats']" Error

i want to get the google results as div. but i take error "Object reference not set to an instance of an object."
my code:
var doc = new HtmlWeb().Load("http://www.google.com/search?q=love");
var div = doc.DocumentNode.SelectSingleNode("//div[#id='resultStats']");
var text = div.InnerHtml.ToString(); <--- this line
textBox1.Text = div.ToString();
var matches = Regex.Matches(text, #"About ([0-9,]+) ");
var total = matches[0].Groups[1].Value;
i try this code:
int counter = 0;
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=love");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
counter = counter + 1;
}
MessageBox.Show(counter.ToString());
i see 97 in the messagebox.
but i try this code:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=love");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
MessageBox.Show(link.ToString());
}
i see "HtmlAgiliytPack.HtmlNode" in the messagebox 97 times.
The first error is because your loaded HTML doesn't contain <div> element with id='resultStats'. That's why your div variable is null, and hence div.InnerHtml gives you a NullReferenceException.
As to the second issue: by using link.ToString() you call .ToString() method of variable of type HtmlNode which seems is not overloaded, and returns just a short type name. I suspect you want to output the link node itself. To do this just use .OuterHtml property on your link:
MessageBox.Show(link.OuterHtml);
Just a side note: the HtmlNode.InnerHtml property is a type of string, so calling ToString() method on a type of string is not necessary here.

Get all attribute values of given tag with Html Agility Pack

I want to get all values of 'id' attribute of 'span' tag with html agility pack.
But instead of attributes I got tags themself. Here's the code
private static IEnumerable<string> GetAllID()
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.Load(FileName);
var nodes = sourceDocument.DocumentNode.SelectNodes(
#"//span/#id");
return nodes.Nodes().Select(x => x.Name);
}
I'll appreciate if someone tells me what's wrong here.
try
var nodes = sourceDocument.DocumentNode.SelectNodes("//span[#id]");
List<string> ids = new List<string>(nodes.Count);
if(nodes != null)
{
foreach(var node in nodes)
{
if(node.Id != null)
ids.Add(node.Id);
}
}
return ids;

Categories