XPath OR, alternative - c#

I use CSharp, XPath and HTMLAgility Pack. I use XPath strings such as:
"//table[3]/td[1]/span[2]/text() | //table[6]/td[1]/span[2]/text()"
"//table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()"
The difference is only in table numbers. Is it possible to use some other XPath function to replace the XPath or |?
What I actually do: With the first XPath string (where I have table numbers 3 & 6) I extract one value. With the second XPath string (where i have table numbers are 8 & 10) I extract another value.
And additional question about performance - is the XPath string //table[8]/td[1]/span[2]/text() faster than the XPath string with OR //table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()? I ask this because I have many many XPath strings for many many values and if there is a difference which really means I need to try something else. I can't do the measurement right now that's why I ask you this question to share your experience.

Firstly, //table[6] looks odd. Are you sure you don't mean (//table)[6]? (The first selects every table that is the 6th child of its parent; the second selects the sixth table in the document.) I will assume the latter.
In XPath 2.0 you can write
(//table)[position()=(3,6,8,10)]/td[1]/span[2]/text()
In 1.0 that would have to be
(//table)[position()=3 or position()=6 or position()=8 or position()=10]
/td[1]/span[2]/text()
Or (in either release) you could write
((//table)[3] | (//table)[6] | (//table)[8] | (//table)[10])/td[1]/span[2]/text()
Your question about performance can't be answered without knowing what XPath implementation you are using.

Related

Is there a C# utility for matching patterns in (syntactic parse) trees?

I'm working on a Natural Language Processing (NLP) project in which I use a syntactic parser to create a syntactic parse tree out of a given sentence.
Example Input: I ran into Joe and Jill and then we went shopping
Example Output: [TOP [S [S [NP [PRP I]] [VP [VBD ran] [PP [IN into] [NP [NNP Joe] [CC and] [NNP Jill]]]]] [CC and] [S [ADVP [RB then]] [NP [PRP we]] [VP [VBD went] [NP [NN shopping]]]]]]
I'm looking for a C# utility that will let me do complex queries like:
Get the first VBD related to 'Joe'
Get the NP closest to 'Shopping'
Here's a Java utility that does this, I'm looking for a C# equivalent.
Any help would be much appreciated.
There are at least two NLP frameworks, i.e.
SharpNLP (NOTE: project inactive since 2006)
Proxem
And here you can find instructions to use a java NLP in .NET:
Using OpenNLP in .NET project
This page is about using java OpenNLP, but could apply to the java library you've mentioned in your post
Or use NLTK following this guidelines:
Open Source NLP in C# 3.5 using NLTK
We already use
One option would be to parse the output into C# code and then encoding it to XML making every node into string.Format("<{0}>", this.Name); and string.Format("</{0}>", this._name); in the middle put all the child nodes recursively.
After you do this, I would use a tool for querying XML/HTML to parse the tree. Thousands of people already use query selectors and jQuery to parse tree-like structure based on the relation between nodes. I think this is far superior to TRegex or other outdated and un-maintained java utilities.
For example, this is to answer your first example:
var xml = CQ.Create(d.ToXml());
//this can be simpler with CSS selectors but I chose Linq since you'll probably find it easier
//Find joe, in our case the node that has the text 'Joe'
var joe = xml["*"].First(x => x.InnerHTML.Equals("Joe"));
//Find the last (deepest) element that answers the critiria that it has "Joe" in it, and has a VBD in it
//in our case the VP
var closestToVbd = xml["*"].Last(x => x.Cq().Has(joe).Has("VBD").Any());
Console.WriteLine("Closest node to VPD:\n " +closestToVbd.OuterHTML);
//If we want the VBD itself we can just find the VBD in that element
Console.WriteLine("\n\n VBD itself is " + closestToVbd.Cq().Find("VBD")[0].OuterHTML);
Here is your second example
//Now for NP closest to 'Shopping', find the element with the text 'shopping' and find it's closest NP
var closest = xml["*"].First(x => x.InnerHTML.Equals("shopping")).Cq()
.Closest("NP")[0].OuterHTML;
Console.WriteLine("\n\n NP closest to shopping is: " + closest);

Match expressions in Strings

I have a database here with certain rules I need to apply to a a bunch of Strings, they're expressions that can occur within the Strings. They are expressed like
(word1 AND word2) OR (word3)
I can't hardcode those (because they may be changed in the database), so I thought about programmatically turning those expressions into Regex patterns.
Has anybody done such a task yet or has an idea on how to do this the best way?
I'm not wuite sure about how to deal with more complex expressions, how to take them apart and so on.
Edit: I'm using C# in VisualStudio / .NET.
The data is basically directory paths, a customer wants to get their documents organized, so the String I'm having are paths, the expressions in the DB could look like:
(office OR headquarter) AND (official OR confidential)
So if the file's directory path contains office and confidential, it should match.
Hope this makes it clearer.
EDIT2:
Heres some dummy examples:
The paths could look like:
c:\documents\official\johnmeyer\court\out\letter.doc
c:\documents\internal\appointments\court\in\september.doc
c:\documents\official\stevemiller\meeting\in\letter.doc
And the expressions like:
(meyer or miller) AND (court OR jail)
So this expression would match the 1st path/ file, but not the 2nd and 3rd one.
No answer, but a good hint:
The expressions you have are actual trees constructed by the parentheses. You need a stack machine to parse the text into a (binary) tree structure, where each node is an AND or OR element and the leaves are the words.
Afterwards, you can simply construct your regex in whatever language you need by walking the tree using depth first search and adding prefix and suffix data as needed before/after reading the subtree.
Consider an abstract class TreeNode having a method GenerateExpression(StringBuilder result).
Each actual TreeNode item will be either an CombinationTreeNode (with a CombinationMode And/Or) or an SearchTextTreeNode (with an SearchText property).
GenerateExpression(StringBuilder result) for CombinationTreeNode will look similar like that:
result.Append("(");
rightSubTree.GenerateExpression(result);
result.Append(") " + this.CombinationMode.ToString() + " (");
rightSubTree.GenerateExpression(result);
result.Append(")");
GenerateExpression(StringBuilder result) for SearchTextTreeNode is much easier:
result.Append(this.SearchText);
Of course, your code will produce a regular expression instead of the input text, as mine does.

Parse expression (with custom functions and operations)

I have a string, which contains a custom expression, I have to parse and evaluate:
For example:
(FUNCTION_A(5,4,5) UNION FUNCTION_B(3,3))
INTERSECT (FUNCTION_C(5,4,5) UNION FUNCTION_D(3,3))
FUNCTION_X represent functions, which are implemented in C# and return ILists.
UNION or INTERSECT are custom functions which should be applied to the lists, which are returned from those functions.
Union and intersect are implemented via Enumerable.Intersect/Enumerable.Union.
How can the parsing and evaluating be implemented in an elegant and expandable manner?
It depends on how complex your expressions will become, how many different operators are going to be available, and a whole number of different variables. Whichever way you do it, you will probably need to first determine a grammar for your mini-language.
For simple grammars, you can just write a custom parser. In the case of many calculators and similar applications, a recursive descent parser is expressive enough to handle the grammar and is intuitive to write. The linked Wikipedia page gives a sample grammar and the implementation of a C parser for it. Eric White also has a blog post on building recursive descent parsers in C#.
For more complex grammars, you will likely want to skip the work of creating this yourself and use a lex/yacc-type lexer and parser toolset. Normally you give as input to these a grammar in EBNF or similar syntax, and they will produce the code necessary to parse the input for you. The parser will typically return a syntax tree which you can traverse, allowing you to apply logic for each token in the input stream (each node in the tree). For C#, I have worked with GPLex and GPPG, but others such as ANTLR are also available.
Basic Parsing Concepts
In general, you want to be able to split each item in the input into a meaningful token, and build a tree based on those tokens. Once the tree is built, you can traverse the tree and perform the necessary action at each node. A syntax tree for FUNCTION_A(5,4,5) UNION FUNCTION_B(3,3) might look like this, where the node types are in capital letters and their values are in parenthesis:
PROGRAM
|
|
UNION
|
------------------------------
| |
FUNCTION (FUNCTION_A) FUNCTION(FUNCTION_B)
| |
------------- ----------
| | | | |
INT(5) INT(4) INT(5) INT(3) INT(3)
The parser needs to be smart enough to know that when a UNION is found, it needs to be supplied with two items to union, etc. Given this tree, you would start at the root (PROGRAM) and do a depth-first traversal. At the UNION node, the action would be to first visit all children, and then union the results together. At a FUNCTION node, the action would be to first visit all of the children, find their values, and use those values as parameters to the function, and secondly to evaluate the function on those inputs and return the value.
This would continue for all tokens, for any expression you can come up with. In this way, if you spend the time to get the parser to produce the right tree and each node knows how to perform whatever action it needs to, your design is very extensible and can handle any input that matches the grammar it was designed for.

C# regex html table inside a table

I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?
Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).
Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!
Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?

Xpath, retrieving node value

I get this return value from Sharepoint... which I have just included the first part of the xml snippet...
<Result ID=\"1,New\" xmlns=\"http://schemas.microsoft.com/sharepoint/soap/\">
<ErrorCode>0x00000000</ErrorCode><ID /><z:row ows_ID=\"9\"
It populates a XmlNode node object.
How using xPath can I get the value of ows_id ?
My code so far...
XmlNode results = list.UpdateListItems("MySharePointList", batch);
Update
So far I have this : results.FirstChild.ChildNodes[2].Attributes["ows_ID"].Value
But I am not sure how reliable it is, can anyone improve on it?
I don't know if its necessarily an improvement, but it might be more readable, though more verbose:
/*[local-name() = 'Result']/*[local-name() = 'row']/#ows_ID
There is probably more to the fragment you posted so this XPath query might need a fixup when used against the actual xml result.
The function, local-name(), lets you ignore namespaces, which can be both a boon and a curse. :)
When you start from root:
/Result/z:row/#ows_ID
also you can improve search if exists multiple Result:
/Result[#ID='1,New']/z:row/#ows_ID
<xsl:value-of select="Result/b:row/#ows_ID"/>
or
<xsl:value-of select="Result/b:row[#ows_ID = '9']"/>
Depending on what value you wanted
You probably need to make sure the z namespace prefix is declared correctly - that's implementation dependent. Here's how you do it in Java's XPath implementation.
Then to select the value of the ows_ID attribute, you need to navigate to the element itself, then use #ows_ID to get the value.
The specific xpath calls depend on what library you use (e.g. libxml xpath implementation).
But the generic xpath statement would be:
"//z:row[#ows_ID='9']"
This will select all z:row nodes with an attribute ows_ID of value 9.
You can modify this query to match all z:row nodes or only those with a specific attribute.
For details look here: W3Schools XPath syntax

Categories