I have written a chunk of XML parsing which works successfully provided I use an absolute path.
I now need to take an XMLNode as an argument and run an xpath against this.
Does anyone know how to do this?
I tried using relative XPath queries without any success!!
Should it be this hard??
It would help to see examples of XPath expressions that don't work as you think they should. Here are some possible causes (mistakes I frequently make).
Assume an XML document such as:
<A>
<B>
<C d='e'/>
</B>
<C/>
<D xmlns="http://foo"/>
</A>
forgetting to remove the top-level slash ('/') representing the document:
document.XPathSelectElements("/A") // selects a single A node
document.XPathSelectElements("//B") // selects a single B node
document.XPathSelectElements("//C") // selects two C nodes
but
aNode.XPathSelectElements("/B") // selects nothing (this looks for a rootNode with name B)
aNode.XPathSelectElements("B") // selects a B node
bNode.XPathSelectElements("//C") // selects TWO C nodes - all descendants of the root node
bNode.select(".//C") // selects one C node - all descendants of B
forgetting namespaces.
aNode.XPathSelectElements("D") // selects nothing (D is in a different namespace from A)
aNode.XPathSelectElements("[local-name()='D' and namespace-uri()='http://foo']") // one D node
(This is often a problem when the root node carries a prefixless namespace - easy to miss)
Related
I am trying to figure out how to write a regex that will strip out the values enclosed in an xml tag. For example,
string xml = "<MyElement1 attribute="bla"><MyElement1>12345</MyElement1></MyElement1>"
I want to know how to do the following:
match on MyElement1 nodes that do not have an attribute
So specifically, using my example I would match <MyElement1>12345</MyElement1> and replace <MyElement1> and </MyElement1> so that my final node looks like this: <MyElement1 attribute="bla">12345</MyElement1>
I've tried: [<][^>]*[>] but this matches on all elements. I'm not sure how to specify specific elements I want to match on.
I have made edits to make the question more focused and clearer as suggested based on the downvotes. I understand that I can use parse and navigate my document tree, but I prefer to use a regex replace of some sort because I want to apply this logic to any number of xml files with different tree structures, elements, and attributes.
Well you really don't need to use regular expressions, you just need to parse your XML using an XML parser.
One of the options you have would be to use the XDocument.Parse( xml ) method and XElement, where the first would be to parse the string, and the second to read it's tag and it's value. An example for reading it would be the following one
string xml = "<MyElement1>12345</MyElement1><MyElement2>abcd</MyElement2><MyElement3>12345</MyElement3><MyElement4>12345</MyElement4>";
// wrap your element in a rootnode (you seem to be missing one in your example)
var document = XDocument.Parse( $"<root>{xml}</root>");
// get the root node and loop over it's children (cast XNode to XElement in the process)
foreach (var node in document.Root.Nodes().OfType<XElement>()) {
// name is tag, value is well, it's value
Console.WriteLine($"{node.Name}: {node.Value}");
}
Note that for the example to parse the document correctly, you must add a rootnode, as xml can have only one rootnode in the document. In my sample, I enclosed the rootnode during the parsing
This sample code uses the System.Xml.Linq namespace, so don't forget to import that one.
One additional comment would be that your supplied XML code had an error in it (MyElemen4 opening tag with MyElement4 closing tag)
I would recommend using a XML Parser but if you want, you can use a simple regex like <([\w]*)>(.*?)<\/[\w]*>, this would return the name of the tag and the value inside.
Output:
Match 1
Full match 0-30 <MyElement1>12345</MyElement1>
Group 1. 1-11 MyElement1
Group 2. 12-17 12345
Match 2
Full match 30-59 <MyElement2>abcd</MyElement2>
Group 1. 31-41 MyElement2
Group 2. 42-46 abcd
Match 3
Full match 59-89 <MyElement3>12345</MyElement3>
Group 1. 60-70 MyElement3
Group 2. 71-76 12345
Match 4
Full match 89-118 <MyElemen4>12345</MyElement4>
Group 1. 90-99 MyElemen4
Group 2. 100-105 12345
Keep in mind it doesn't take in consideration of tag attributes. If you want to fetch a specific tag you can replace [\w] with the tag name you want.
Take the following XML as example:
<root>
<lines>
<line>
<number>1</number>
</line>
<line>
<number>2</number>
</line>
</lines>
</root>
XmlNodeList nodeList = doc.SelectNodes("//lines/line");
foreach(XmlNode node in nodeList)
{
int index = node.SelectSingleNode("//number");
}
The above code will result in index = 1 for both iterations.
foreach(XmlNode node in nodeList)
{
int index = node.SelectSingleNode("number");
}
The above code will result in 1,2 respectively. I know that // finds first occurrence of xpath but i feel like the first occurrence should be relative to the node itself. The behavior appears to find first occurrence from the root even when selecting nodes from a child node. Is this the way microsoft intended this to work or is this a bug.
yeah thanks but just removing the slashes worked as well as in my second example.
Removing the slashes only works because number is an immediate child element of line. If it were further down in the hierarchy:
<root>
<lines>
<line>
<other>
<number>1</number>
</other>
</line>
</lines>
</root>
you would still need to use .//number.
I just think it is confusing that if you are searching for node within a node that // would go back to the whole document.
That's just how XPath syntax is designed. // at the beginning of an XPath expression means that the evaluation context is the document node - the outermost node of an XML document. .// means that the context of the path expression is the current context node.
If you think about it, it is actually useful to have a way to select from the whole document in any context.
Is this the way microsoft intended this to work or is this a bug.
Microsoft is implementing the XPath standard, and yes, this is how the W3C intended an XPath library to work and it's not a bug.
If you have an XML document and you need to find certain nodes based on certain attribute values (4 in number) which one would be the correct approach (in terms of performance):-
a) Filter the XML document (with XPath) to get node list which match any of the attribute values and then traverse through the filtered Node list to get nodes having a particular attribute value using If-else.
b) Filter the XML document (with XPath) for each attribute value separately.
<Nodes>
<a class="myclass" type="type1">some text</a>
<a class="myclass" type="type2">some text</a>
<img src = "myGraphic.jpg?id={Guid}"/>
</Nodes>
I am using the below XPath (which might be incorrect :-))
"//A[#class] | //a[#class] | //IMG[#src] | //img[#src]"
The goal is to get the separate list of all a having type="type1" a separate list of type="type2" and a separate list of ids in the img tag.
My rough answer would be, performance is not going to really matter much unless you have a very large document or set of documents.
In that case, you probably will want to use SAX, and in any case, you'll want to traverse the document(s) only once, and not hold the whole thing in memory. So you'll be streaming through the documents, stopping at each a element, and appending it to list1 or list2 depending on its type.
My XML looks like :
<?xml version=\"1.0\"?>
<itemSet>
<Item>one</Item>
<Item>two</Item>
<Item>three</Item>
.....maybe more Items here.
</itemSet>
Some of the individual Item may or may not be present. Say I want to retrieve the element <Item>two</Item> if it's present. I've tried the following XPaths (in C#).
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']") --- If Item two is present, then it returns me only the first element one. Maybe this query just points to the first element in itemSet, if it has an Item of value two somewhere as a child. Is this interpretation correct?
So I tried:
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']/Item[1]") --- I read this query as, return me the first <Item> element within itemSet that has value = 'two'. Am I correct?
This still returns only the first element one. What am I doing wrong?
In both the cases, using the siblings I can traverse the child nodes and get to two, but that's not what I am looking at. Also if two is absent then SelectSingleNode returns null. Thus the very fact that I am getting a successfull return node does indicate the presence of element two, so had I wanted a boolean test to chk presence of two, any of the above XPaths would suffice, but I actually the need the full element <Item>two</Item> as my return node.
[My first question here, and my first time working with web programming, so I just learned the above XPaths and related xml stuff on the fly right now from past questions in SO. So be gentle, and let me know if I am a doofus or flouting any community rules. Thanks.]
I think you want:
myXMLdoc.SelectSingleNode("/itemSet/Item[text()='two']")
In other words, you want the Item which has text of two, not the itemSet containing it.
You can also use a single dot to indicate the context node, in your case:
myXMLdoc.SelectSingleNode("/itemSet/Item[.='two']")
EDIT: The difference between . and text() is that . means "this node" effectively, and text() means "all the text node children of this node". In both cases the comparison will be against the "string-value" of the LHS. For an element node, the string-value is "the concatenation of the string-values of all text node descendants of the element node in document order" and for a collection of text nodes, the comparison will check whether any text node is equal to the one you're testing against.
So it doesn't matter when the element content only has a single text node, but suppose we had:
<root>
<item name="first">x<foo/>y</item>
<item name="second">xy<foo/>ab</item>
</root>
Then an XPath expression of "root/item[.='xy']" will match the first item, but "root/item[text()='xy']" will match the second.
Presuming that I don't know the name of my base node or its children, what is the XPath syntax for "all nodes exactly one below the base node?"
With pattern being an XmlNode, I have the following code:
XmlNodeList kvpsList = pattern.SelectNodes(#"//");
Which looks right to me, but I get the following exception:
System.Xml.XPath.XPathException: Expression must evaluate to a node-set.
What is the correct syntax?
The path you're looking for is
/*/*
// isn't a meaningful XPath expression, since it is an operator. If you wrote something like //element, it would match every element named element anywhere in the XML document, no matter how deep into the hierarchy it is.
/*/* is saying "match every node that has two levels of depth in the hierarchy".
The two current answers are wrong:
/*/*
does not select all nodes that are children of the top node. It does not select any text nodes, processing-instructions or comments that are children of the top element.
One XPath expression that select all nodes that arte children of the top element is:
/*/node()
// is not a syntactically correct XPath expression; according to the XPath Spec:
// is short for
/descendant-or-self::node()/
Do note the beginning of an unfinished location step at the very end of the expanded abbreviation. If nothing is added to it, the whole XPath expression containing the abbreviation is finished and, therefore, syntactically incorrect.
Another note is that the // abbreviation is not necessary in specifying the selection of all nodes that are children of the top element. If you wanted to select all nodes in the XML document that descend from the top element, then one XPath expression that select these is:
/*//node()