I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.
I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)
Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[#class="topnewsinfotitle "]
* .//div[#class="topnewsinfotitle "]
I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!
Any clue or hint?
here is a chunk of the code:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[#class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
Thanks.
I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.
//div[#class='topnewsinfotitle ']/text()
Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6
//div[#class='topnewsinfotitle']/text()
Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj
However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):
//div[contains(#class, 'topnewsinfotitle')]/text()
(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)
BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.
Related
Trying to scrape a .pdf from a site but the XPath is being stubborn.
Site I'm trying to get the .pdf from
xpath given by inspect > copy > copy xpath:
//*[#id="content"]/div/table[2]/tbody/tr[0]/td[3]/a
For some reason /tbody does nothing but cause an issue. Removing it has worked for all other Xpath I'm using, and seems to be the way to go here as well.
//*[#id="content"]/div/table[2]/tr[0]/td[3]/a
This yields the result:
<img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small>
Which seems to be a child node?
In any case backing the xpath up a bit to:
//*[#id="content"]/div/table[2]/tr[0]/td[3]
gets me
<a target="_blank" href="/apps/cba/docs/1088-CBA6-2017_Redacted.pdf"><img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small></a>
This is nice since all I need is the value in the href attribute and I can reconstruct the URL and so on. I'm not a wizard with XPath but it seems to me that this final adjustment should get me what I want:
//*[#id="content"]/div/table[2]/tr[0]/td[3]/#href
However it returns the tag again.
I'm stumped on this. Any suggestions?
Edit:
The marked solution made it apparent to me that I was making an assumption. I assumed that I could dereference the href tag in the same manner that I was dereferencing other nodes. This is not the case, and I had to adjust my dereferencing to something like this:
var node_collection = hdoc.DocumentNode.SelectNodes(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href");
string output = node[0].Attributes["href"].Value
The problem was not with the Xpath at all. The problem was my lack of understanding of the HtmlDocument object that I was dealing with. Pasting whre I was trying to get at the href tag would have made this obvious to anyone experienced. Being too self conscious about copy-pasting my whole block of messy code made it impossible for anyone to help me. Learn from my mistakes kids, robust sections of code make it easier to accurately identify the problem.
You are right, tbody is added by Chromes on Copy XPath and should be removed since it is not present in the raw HTML code.*
Selecting the href attribute should work as suggested: //*[#id="content"]/div/table[2]/tr[1]/td[3]/a/#href
I could load the first href like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument hdoc = web.Load("https://work.alberta.ca/apps/cba/searchresults.asp?query=&employer=&union=&locality=&local=&effective_fy=&effective_fm=&effective_ty=&effective_tm=&expiry_fy=&expiry_fm=&expiry_ty=&expiry_tm=");
var nav = (HtmlNodeNavigator)hdoc.CreateNavigator();
var val = nav.SelectSingleNode(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href").Value;
Or all of them like this:
XPathNavigator nav2 = hdoc.CreateNavigator();
XPathNodeIterator xiter = nav2.Select(#"//*[#id=""content""]/div/table[2]/tr/td[3]/a/#href");
while (xiter.MoveNext())
{
Console.WriteLine(xiter.Current.Value);
}
* However, some engines indeed require tbody to be present in the XPath as demonstrated here. Only then we get a result. See this answer why tbody is added by Chrome, Firebug, and alike in the first place.
This is the xpath text i tried to use along with HtmlAgilityPack C# parser.
//div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
I tried to evaluate the xpath expression with firefox xpath add=on and sucessfully got the required items. But the c# code returns an Null exception.
HtmlAgilityPack.HtmlNodeCollection node = htmldoc.DocumentNode.SelectNodes("//div[#id ='sc1']/table/tbody/tr/td/span[#class='blacktxt']");
MessageBox.Show(node.ToString());
the node always contains null value...
Please help me to find the way to get around this problem...
Thank you..
DOM Requires <tbody/> Tags to be Inserted
All common browser extensions for building XPath expressions work on the DOM. Opposite to the HTML specs, the DOM specs require <tr/> elements to be inside <tbody/> elements, so browsers add such elements if missing. You can easily see the difference if looking at the HTML source using Firebug (or similar developer tools working on the DOM) versus displaying the page source (using wget or similar tools that do not interpret anything if necessary).
The Solution
Remove the /tbody axis step, and your XPath expression will probably work.
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt']
If you Need to Support Both HTML With and Without <tbody/> Tags
For a more general solution, you could replace the /tbody axis step by a decendant-or-self step //, but this could jump into "inner tables":
//div[#id = 'sc1']/table//tr/td/span[#class='blacktxt']
Better would be to use alternative XPath expressions:
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt'] | //div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
A cleaner XPath 2.0 only solution would be
//div[#id = 'sc1']/table/(tbody, self::*)/tr/td/span[#class='blacktxt']
I am trying to parse out some information from Google's geocoding API but I am having a little trouble with efficiently getting the data out of the xml. See link for example
All I really care about is getting the short_name from address_component where the type is administrative_area_level_1 and the long_name from administrative_area_level_2
However with my test program my XPath query returns no results for both queries.
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString("http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
using(var reader = new StringReader(locationXml))
{
var doc = new XPathDocument(reader);
var nav = doc.CreateNavigator();
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_1]/short_name").InnerXml);
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_2]/long_name").InnerXml);
}
}
}
Can anyone help me find what I am doing wrong, or recommending a better way?
You need to put the value of the node you're looking for in quotes:
".../address_component[type='administrative_area_level_1']/short_name"
↑ ↑
I'd definitely recommend using LINQ to XML instead of XPathNavigator. It makes XML querying a breeze, in my experience. In this case I'm not sure exactly what's wrong... but I'll come up with a LINQ to XML snippet instead.
using System;
using System.Linq;
using System.Net;
using System.Xml.Linq;
class Test
{
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString
("http://maps.google.com/maps/api/geocode/xml?address=1600"
+ "+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
XElement root = XElement.Parse(locationXml);
XElement result = root.Element("result");
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_1")
.Select(x => x.Element("short_name").Value)
.First());
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_2")
.Select(x => x.Element("long_name").Value)
.First());
}
}
}
Now this is more code1... but I personally find it easier to get right than XPath, because the compiler is helping me more.
EDIT: I feel it's worth going into a little more detail about why I generally prefer code like this over using XPath, even though it's clearly longer.
When you use XPath within a C# program, you have two different languages - but only one is in control (C#). XPath is relegated to the realm of strings: Visual Studio doesn't give an XPath expression any special handling; it doesn't understand that it's meant to be an XPath expression, so it can't help you. It's not that Visual Studio doesn't know about XPath; as Dimitre points out, it's perfectly capable of spotting errors if you're editing an XSLT file, just not a C# file.
This is the case whenever you have one language embedded within another and the tool is unaware of it. Common examples are:
SQL
Regular expressions
HTML
XPath
When code is presented as data within another language, the secondary language loses a lot of its tooling benefits.
While you can context switch all over the place, pulling out the XPath (or SQL, or regular expressions etc) into their own tooling (possibly within the same actual program, but in a separate file or window) I find this makes for harder-to-read code in the long run. If code were only ever written and never read afterwards, that might be okay - but you do need to be able to read code afterwards, and I personally believe the readability suffers when this happens.
The LINQ to XML version above only ever uses strings for pure data - the names of elements etc - and uses code (method calls) to represent actions such as "find elements with a given name" or "apply this filter". That's more idiomatic C# code, in my view.
Obviously others don't share this viewpoint, but I thought it worth expanding on to show where I'm coming from.
Note that this isn't a hard and fast rule of course... in some cases XPath, regular expressions etc are the best solution. In this case, I'd prefer the LINQ to XML, that's all.
1 Of course I could have kept each Console.WriteLine call on a single line, but I don't like posting code with horizontal scrollbars on SO. Note that writing the correct XPath version with the same indentation as the above and avoiding scrolling is still pretty nasty:
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/" +
"address_component[type='administrative_area_level_1']" +
"/short_name").InnerXml);
In general, long lines work a lot better in Visual Studio than they do on Stack Overflow...
I would recommend just typing the XPath expression as part of an XSLT file in Visual Studio. You'll get error messages "as you type" -- this is an excellent XML/XSLT/XPath editor.
For example, I am typing:
<xsl:apply-templates select="#* | node() x"/>
and immediately get in the Error List window the following error:
Error 9 Expected end of the expression, found 'x'. #* | node() -->x<--
XSLTFile1.xslt 9 14 Miscellaneous Files
Only when the XPath expression does not raise any errors (I might also test that it selects the intended nodes, too), would I put this expression into my C# code.
This ensures that I will have no XPath -- syntax and semantic -- errors when I run the C# program.
dtb's response is accurate. I wanted to add that you can use xpath testing tools like the link below to help find the correct xpath:
http://www.bit-101.com/xpath/
string url = #"http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false";
string value = "administrative_area_level_1";
using(WebClient client = new WebClient())
{
string wcResult = client.DownloadString(url);
XDocument xDoc = XDocument.Parse(wcResult);
var result = xDoc.Descendants("address_component")
.Where(p=>p.Descendants("type")
.Any(q=>q.Value.Contains(value))
);
}
The result is an enumeration of "address_component"s that have at least one "type" node that has contains the value you're searching for. The result of the query above is an XElement that contains the following data.
<address_component>
<long_name>California</long_name>
<short_name>CA</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
I would really recommend spending a little time learning LINQ in general because its very useful for manipulating and querying in-memory objects, querying databases and tends to be easier than using XPath when working with XML. My favorite site to reference is http://www.hookedonlinq.com/
I had been trying to extract links from a class called "tim_new" . I have been given a solution as well.
Both the solution, snippet and necessary information is given here
The said XPATH query was "//a[#class='tim_new'], my question is, how did this query differentiate between the first line of the snippet (given in the link above and the second line of the snippet).
More specifically, what is the literal translation (in English) of this XPATH query.
Furthermore, I want to write a few lines of code to extract the text written against NSE:
<div class="FL gL_12 PL10 PT15">BSE: 523395 | NSE: 3MINDIA | ISIN: INE470A01017</div>
Would appreciate help in forming the necessary selection query.
My code is written as:
IEnumerable<string> NSECODE = doc.DocumentNode.SelectSingleNode("//div[#NSE:]");
But this doesnt look right. Would appreciate some help.
The XPath in the first selection reads "select all document elements that have an attribute named class with a value of tim_new". The stuff in brackets is not what you're returning, it's the criteria you're applying to the search.
I don't have the HTML Agility pack, but if you are trying to query the divs that have "NSE:" as its text, your XPath for the second query should just be "//div" then you'll want to filter using LINQ.
Something like
var nodes =
doc.DocumentNode.SelectNodes("//div[text()]").Where(a => a.InnerText.IndexOf("NSE:") > -1);
So in English, "Return all the div elements that immediately contain text to LINQ, then check that the inner text value contains NSE:".
Again, I'm not sure the syntax is perfect, but that's the idea.
The XPath "//div[#NSE:]" would return all divs that have and attribute named, NSE:, which would be illegal anyway because ":" isn't allowed in an attribute name. Youre looking for the text of the element, not one of its attributes.
Hope that helps.'
Note: If you have nested divs that both contain text as in <div>NSE: some text<div>NSE: more text</div></div> you're going to get duplicate results.
What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.