Trouble getting data out of a xml file - c#

I am trying to parse out some information from Google's geocoding API but I am having a little trouble with efficiently getting the data out of the xml. See link for example
All I really care about is getting the short_name from address_component where the type is administrative_area_level_1 and the long_name from administrative_area_level_2
However with my test program my XPath query returns no results for both queries.
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString("http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
using(var reader = new StringReader(locationXml))
{
var doc = new XPathDocument(reader);
var nav = doc.CreateNavigator();
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_1]/short_name").InnerXml);
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/address_component[type=administrative_area_level_2]/long_name").InnerXml);
}
}
}
Can anyone help me find what I am doing wrong, or recommending a better way?

You need to put the value of the node you're looking for in quotes:
".../address_component[type='administrative_area_level_1']/short_name"
↑ ↑

I'd definitely recommend using LINQ to XML instead of XPathNavigator. It makes XML querying a breeze, in my experience. In this case I'm not sure exactly what's wrong... but I'll come up with a LINQ to XML snippet instead.
using System;
using System.Linq;
using System.Net;
using System.Xml.Linq;
class Test
{
public static void Main(string[] args)
{
using(WebClient webclient = new WebClient())
{
webclient.Proxy = null;
string locationXml = webclient.DownloadString
("http://maps.google.com/maps/api/geocode/xml?address=1600"
+ "+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false");
XElement root = XElement.Parse(locationXml);
XElement result = root.Element("result");
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_1")
.Select(x => x.Element("short_name").Value)
.First());
Console.WriteLine(result.Elements("address_component")
.Where(x => (string) x.Element("type") ==
"administrative_area_level_2")
.Select(x => x.Element("long_name").Value)
.First());
}
}
}
Now this is more code1... but I personally find it easier to get right than XPath, because the compiler is helping me more.
EDIT: I feel it's worth going into a little more detail about why I generally prefer code like this over using XPath, even though it's clearly longer.
When you use XPath within a C# program, you have two different languages - but only one is in control (C#). XPath is relegated to the realm of strings: Visual Studio doesn't give an XPath expression any special handling; it doesn't understand that it's meant to be an XPath expression, so it can't help you. It's not that Visual Studio doesn't know about XPath; as Dimitre points out, it's perfectly capable of spotting errors if you're editing an XSLT file, just not a C# file.
This is the case whenever you have one language embedded within another and the tool is unaware of it. Common examples are:
SQL
Regular expressions
HTML
XPath
When code is presented as data within another language, the secondary language loses a lot of its tooling benefits.
While you can context switch all over the place, pulling out the XPath (or SQL, or regular expressions etc) into their own tooling (possibly within the same actual program, but in a separate file or window) I find this makes for harder-to-read code in the long run. If code were only ever written and never read afterwards, that might be okay - but you do need to be able to read code afterwards, and I personally believe the readability suffers when this happens.
The LINQ to XML version above only ever uses strings for pure data - the names of elements etc - and uses code (method calls) to represent actions such as "find elements with a given name" or "apply this filter". That's more idiomatic C# code, in my view.
Obviously others don't share this viewpoint, but I thought it worth expanding on to show where I'm coming from.
Note that this isn't a hard and fast rule of course... in some cases XPath, regular expressions etc are the best solution. In this case, I'd prefer the LINQ to XML, that's all.
1 Of course I could have kept each Console.WriteLine call on a single line, but I don't like posting code with horizontal scrollbars on SO. Note that writing the correct XPath version with the same indentation as the above and avoiding scrolling is still pretty nasty:
Console.WriteLine(nav.SelectSingleNode("/GeocodeResponse/result/" +
"address_component[type='administrative_area_level_1']" +
"/short_name").InnerXml);
In general, long lines work a lot better in Visual Studio than they do on Stack Overflow...

I would recommend just typing the XPath expression as part of an XSLT file in Visual Studio. You'll get error messages "as you type" -- this is an excellent XML/XSLT/XPath editor.
For example, I am typing:
<xsl:apply-templates select="#* | node() x"/>
and immediately get in the Error List window the following error:
Error 9 Expected end of the expression, found 'x'. #* | node() -->x<--
XSLTFile1.xslt 9 14 Miscellaneous Files
Only when the XPath expression does not raise any errors (I might also test that it selects the intended nodes, too), would I put this expression into my C# code.
This ensures that I will have no XPath -- syntax and semantic -- errors when I run the C# program.

dtb's response is accurate. I wanted to add that you can use xpath testing tools like the link below to help find the correct xpath:
http://www.bit-101.com/xpath/

string url = #"http://maps.google.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false";
string value = "administrative_area_level_1";
using(WebClient client = new WebClient())
{
string wcResult = client.DownloadString(url);
XDocument xDoc = XDocument.Parse(wcResult);
var result = xDoc.Descendants("address_component")
.Where(p=>p.Descendants("type")
.Any(q=>q.Value.Contains(value))
);
}
The result is an enumeration of "address_component"s that have at least one "type" node that has contains the value you're searching for. The result of the query above is an XElement that contains the following data.
<address_component>
<long_name>California</long_name>
<short_name>CA</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
I would really recommend spending a little time learning LINQ in general because its very useful for manipulating and querying in-memory objects, querying databases and tends to be easier than using XPath when working with XML. My favorite site to reference is http://www.hookedonlinq.com/

Related

Parsing XML file in C# - how to report errors

First I load the file in a structure
XElement xTree = XElement.Load(xml_file);
Then I create an enumerable collection of the elements.
IEnumerable<XElement> elements = xTree.Elements();
And iterate elements
foreach (XElement el in elements)
{
}
The problem is - when I fail to parse the element (a user made a typo or inserted a wrong value) - how can I report exact line in the file?
Is there any way to tie an element to its corresponding line in the file?
One way to do it (although not a proper one) –
When you find a wrong value, add an invalid char (e.g. ‘<’) to it.
So instead of: <ExeMode>Bla bla bla</ExeMode>
You’ll have: <ExeMode><Bla bla bla</ExeMode>
Then load the XML again with try / catch (System.Xml.XmlException ex).
This XmlException has LineNumber and LinePosition.
If there is a limited set of acceptable values, I believe XML Schemas have the concept of an enumerated type -- so write a schema for the file and have the parser validate against that. Assuming the parser you're using supports Schemas, which most should by now.
I haven't looked at DTDs in decades, but they may have the same kind of capability.
Otherwise, you would have to consider this semantic checking rather than syntactic checking, and that makes it your application's responsibility. If you are using a SAX parser and interpreting the data as you go, you may be able to get the line number; check your parser's features.
Otherwise the best answer I've found is to report the problem using an xpath to the affected node/token rather than a line number. You may be able to find a canned solution for that, either as a routine to run against a DOM tree or as a state machine you can run alongside your SAX code to track the path as you go.
(All the "maybe"s are because I haven't looked at what's currently available in a very long time, and because I'm trying to give an answer that is valid for all languages and parser implementations. This should still get you pointed in some useful directions.)

HtmlAgilityPack C#--- Selectnodes Always returns a Null

This is the xpath text i tried to use along with HtmlAgilityPack C# parser.
//div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
I tried to evaluate the xpath expression with firefox xpath add=on and sucessfully got the required items. But the c# code returns an Null exception.
HtmlAgilityPack.HtmlNodeCollection node = htmldoc.DocumentNode.SelectNodes("//div[#id ='sc1']/table/tbody/tr/td/span[#class='blacktxt']");
MessageBox.Show(node.ToString());
the node always contains null value...
Please help me to find the way to get around this problem...
Thank you..
DOM Requires <tbody/> Tags to be Inserted
All common browser extensions for building XPath expressions work on the DOM. Opposite to the HTML specs, the DOM specs require <tr/> elements to be inside <tbody/> elements, so browsers add such elements if missing. You can easily see the difference if looking at the HTML source using Firebug (or similar developer tools working on the DOM) versus displaying the page source (using wget or similar tools that do not interpret anything if necessary).
The Solution
Remove the /tbody axis step, and your XPath expression will probably work.
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt']
If you Need to Support Both HTML With and Without <tbody/> Tags
For a more general solution, you could replace the /tbody axis step by a decendant-or-self step //, but this could jump into "inner tables":
//div[#id = 'sc1']/table//tr/td/span[#class='blacktxt']
Better would be to use alternative XPath expressions:
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt'] | //div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
A cleaner XPath 2.0 only solution would be
//div[#id = 'sc1']/table/(tbody, self::*)/tr/td/span[#class='blacktxt']

Is there a C# utility for matching patterns in (syntactic parse) trees?

I'm working on a Natural Language Processing (NLP) project in which I use a syntactic parser to create a syntactic parse tree out of a given sentence.
Example Input: I ran into Joe and Jill and then we went shopping
Example Output: [TOP [S [S [NP [PRP I]] [VP [VBD ran] [PP [IN into] [NP [NNP Joe] [CC and] [NNP Jill]]]]] [CC and] [S [ADVP [RB then]] [NP [PRP we]] [VP [VBD went] [NP [NN shopping]]]]]]
I'm looking for a C# utility that will let me do complex queries like:
Get the first VBD related to 'Joe'
Get the NP closest to 'Shopping'
Here's a Java utility that does this, I'm looking for a C# equivalent.
Any help would be much appreciated.
There are at least two NLP frameworks, i.e.
SharpNLP (NOTE: project inactive since 2006)
Proxem
And here you can find instructions to use a java NLP in .NET:
Using OpenNLP in .NET project
This page is about using java OpenNLP, but could apply to the java library you've mentioned in your post
Or use NLTK following this guidelines:
Open Source NLP in C# 3.5 using NLTK
We already use
One option would be to parse the output into C# code and then encoding it to XML making every node into string.Format("<{0}>", this.Name); and string.Format("</{0}>", this._name); in the middle put all the child nodes recursively.
After you do this, I would use a tool for querying XML/HTML to parse the tree. Thousands of people already use query selectors and jQuery to parse tree-like structure based on the relation between nodes. I think this is far superior to TRegex or other outdated and un-maintained java utilities.
For example, this is to answer your first example:
var xml = CQ.Create(d.ToXml());
//this can be simpler with CSS selectors but I chose Linq since you'll probably find it easier
//Find joe, in our case the node that has the text 'Joe'
var joe = xml["*"].First(x => x.InnerHTML.Equals("Joe"));
//Find the last (deepest) element that answers the critiria that it has "Joe" in it, and has a VBD in it
//in our case the VP
var closestToVbd = xml["*"].Last(x => x.Cq().Has(joe).Has("VBD").Any());
Console.WriteLine("Closest node to VPD:\n " +closestToVbd.OuterHTML);
//If we want the VBD itself we can just find the VBD in that element
Console.WriteLine("\n\n VBD itself is " + closestToVbd.Cq().Find("VBD")[0].OuterHTML);
Here is your second example
//Now for NP closest to 'Shopping', find the element with the text 'shopping' and find it's closest NP
var closest = xml["*"].First(x => x.InnerHTML.Equals("shopping")).Cq()
.Closest("NP")[0].OuterHTML;
Console.WriteLine("\n\n NP closest to shopping is: " + closest);

Parsing XML with C#

I have an XML file as follows:
I uploaded the XML file : http://dl.dropbox.com/u/10773282/2011/result.xml . It's a machine generated XML, so you might need some XML viewer/editor.
I use this C# code to get the elements in CoverageDSPriv/Module/*.
using System;
using System.Xml;
using System.Xml.Linq;
namespace HIR {
class Dummy {
static void Main(String[] argv) {
XDocument doc = XDocument.Load("result.xml");
var coveragePriv = doc.Descendants("CoverageDSPriv"); //.First();
var cons = coveragePriv.Elements("Module");
foreach (var con in cons)
{
var id = con.Value;
Console.WriteLine(id);
}
}
}
}
Running the code, I get this result.
hello.exe6144008016161810hello.exehello.exehello.exe81061hello.exehello.exe!17main_main40030170170010180180011190190012200200013hello.exe!107testfunctiontestfunction(int)40131505001460600158080216120120017140140018AA
I expect to get
hello.exe
61440
...
However, I get just one line of long string.
Q1 : What might be wrong?
Q2 : How to get the # of elements in cons? I tried cons.Count, but it doesn't work.
Q3 : If I need to get nested value of <CoverageDSPriv><Module><ModuleNmae> I use this code :
var coveragePriv = doc.Descendants("CoverageDSPriv"); //.First();
var cons = coveragePriv.Elements("Module").Elements("ModuleName");
I can live with this, but if the elements are deeply nested, I might be wanting to have direct way to get the elements. Are there any other ways to do that?
ADDED
var cons = coveragePriv.Elements("Module").Elements();
solves this issue, but for the NamespaceTable, it again prints out all the elements in one line.
hello.exe
61440
0
8
0
1
6
1
61810hello.exehello.exehello.exe81061hello.exehello.exe!17main_main40030170170010180180011190190012200200013hello.exe!107testfunctiontestfunction(int)40131505001460600158080216120120017140140018
Or, Linq to XML can be a better solution, as this post.
It looks to me like you only have one element named Module -- so .Value is simply returning you the InnerText of that entire element. Were you intending this instead?
coveragePriv.Element("Module").Elements();
This would return all the child elements of the Module element, which seems to be what your'e after.
Update:
<NamespaceTable> is a child of <Module> but you appear to want to handle it similarly to <Module> in that you want to write out each child element. Thus, one brute-force approach would be to add another loop for <NamespaceTable>:
foreach (var con in cons)
{
if (con.Name == "NamespaceTable")
{
foreach (var nsElement in con.Elements())
{
var nsId = nsElement.Value;
Console.WriteLine(nsId);
}
}
else
{
var id = con.Value;
Console.WriteLine(id);
}
}
Alternatively, perhaps you'd rather just denormalize them altogether via .Descendents():
var cons = coveragePriv.Element("Module").Descendents();
foreach (var con in cons)
{
var id = con.Value;
Console.WriteLine(id);
}
XMLElement.Value has unexpected results. In XML using .net you are really in charge of manually traversing the xml tree. If the element is text then value may return what you want but if its another element then not so much.
I have done a lot of xml parsing and I find there are way better ways to handle XML depending on what you are doing with the data.
1) You can look into XSLT transforms if you plan on outputting this data as text, more xml, or html. This is a great way to convert the data to some other readable format. We use this when we want to display our metadata on our website in html.
2) Look into XML Serialization. C# makes this very easy and it is amazing to use because then you can work with a regular C# object when consuming the data. MS even has tools to create the serlization class from the XML. I usually start with that, clean it up and add my own tweaks to make it work as I wish. The best way is to deserialize the object to XML and see if that matches what you have.
3) Try Linq to XML. It will allow you to query the XML as if it were a database. It is a little slower generally but unless you need absolute performance it works very well for minimizing your work.

XPath Expression not working in HtmlAgilityPack

I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.
I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)
Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[#class="topnewsinfotitle "]
* .//div[#class="topnewsinfotitle "]
I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!
Any clue or hint?
here is a chunk of the code:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[#class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
Thanks.
I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.
//div[#class='topnewsinfotitle ']/text()
Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6
//div[#class='topnewsinfotitle']/text()
Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj
However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):
//div[contains(#class, 'topnewsinfotitle')]/text()
(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)
BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.

Categories