I have the following XML sample for which I need a XPath query to return only node1 and node3.
<root>
<node1 />
<node2 anyAttribute="anyText" />
<node3> </node3>
<node4>anyText</node4>
<node5>
<anyChildNode />
</node5>
</root>
In other words a XPath query which returns all nodes which have (simultaneously):
no attributes
no child nodes
no or whitespace-only content
I've found some solutions (1 & 2) but which are only applicable to one of the points above at a time:
for 1. /root/node()[not(node())] - tested and works
for 2. /root/node()[not(#*)] - tested and works
for 3. /root/node()[string-length(normalize-space(text())) = 0] - not working (dunno why)
Yes, I know, I could use the 3 variants above together, but I would like to avoid it and I would think that for just searching for empty nodes/elements there should be an easy way, or?
I'm also limited to xPath 1.0 on .NET, since there is no progress on supporting newer versions.
This XPath,
/root/*[not(#* or * or text()[normalize-space()])]
will select only node1 and node3, as requested.
Explanation:
Select all element (note difference from node) children of root that have no children that are attributes (#*) or elements (*) or non-whitespace text (text()[normalize-space()]).
Related
Take the following XML as example:
<root>
<lines>
<line>
<number>1</number>
</line>
<line>
<number>2</number>
</line>
</lines>
</root>
XmlNodeList nodeList = doc.SelectNodes("//lines/line");
foreach(XmlNode node in nodeList)
{
int index = node.SelectSingleNode("//number");
}
The above code will result in index = 1 for both iterations.
foreach(XmlNode node in nodeList)
{
int index = node.SelectSingleNode("number");
}
The above code will result in 1,2 respectively. I know that // finds first occurrence of xpath but i feel like the first occurrence should be relative to the node itself. The behavior appears to find first occurrence from the root even when selecting nodes from a child node. Is this the way microsoft intended this to work or is this a bug.
yeah thanks but just removing the slashes worked as well as in my second example.
Removing the slashes only works because number is an immediate child element of line. If it were further down in the hierarchy:
<root>
<lines>
<line>
<other>
<number>1</number>
</other>
</line>
</lines>
</root>
you would still need to use .//number.
I just think it is confusing that if you are searching for node within a node that // would go back to the whole document.
That's just how XPath syntax is designed. // at the beginning of an XPath expression means that the evaluation context is the document node - the outermost node of an XML document. .// means that the context of the path expression is the current context node.
If you think about it, it is actually useful to have a way to select from the whole document in any context.
Is this the way microsoft intended this to work or is this a bug.
Microsoft is implementing the XPath standard, and yes, this is how the W3C intended an XPath library to work and it's not a bug.
I am looking for a good approach that can remove empty tags from XML efficiently. What do you recommend? Regex? XDocument? XmlTextReader?
For example,
const string original =
#"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<pig />
<dog>Puppy</dog>
<snake></snake>
<elephant>
<africanElephant></africanElephant>
<asianElephant>Biggy</asianElephant>
</elephant>
<tiger>
<tigerWoods></tigerWoods>
<americanTiger></americanTiger>
</tiger>
</pet>";
Could become:
const string expected =
#"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<dog>Puppy</dog>
<elephant>
<asianElephant>Biggy</asianElephant>
</elephant>
</pet>";
Loading your original into an XDocument and using the following code gives your desired output:
var document = XDocument.Parse(original);
document.Descendants()
.Where(e => e.IsEmpty || String.IsNullOrWhiteSpace(e.Value))
.Remove();
This is meant to be an improvement on the accepted answer to handle attributes:
XDocument xd = XDocument.Parse(original);
xd.Descendants()
.Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(a.Value))
&& string.IsNullOrWhiteSpace(e.Value)
&& e.Descendants().SelectMany(c => c.Attributes()).All(ca => ca.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(ca.Value))))
.Remove();
The idea here is to check that all attributes on an element are also empty before removing it. There is also the case that empty descendants can have non-empty attributes. I inserted a third condition to check that the element has all empty attributes among its descendants. Considering the following document with node8 added:
<root>
<node />
<node2 blah='' adf='2'></node2>
<node3>
<child />
</node3>
<node4></node4>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns='urn://blah' d='a'/>
<node7 xmlns='urn://blah2' />
<node8>
<child2 d='a' />
</node8>
</root>
This would become:
<root>
<node2 blah="" adf="2"></node2>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns="urn://blah" d="a" />
<node8>
<child2 d='a' />
</node8>
</root>
The original and improved answer to this question would lose the node2 and node6 and node8 nodes. Checking for e.IsEmpty would work if you only want to strip out nodes like <node />, but it's redunant if you're going for both <node /> and <node></node>. If you also need to remove empty attributes, you could do this:
xd.Descendants().Attributes().Where(a => string.IsNullOrWhiteSpace(a.Value)).Remove();
xd.Descendants()
.Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration))
&& string.IsNullOrWhiteSpace(e.Value))
.Remove();
which would give you:
<root>
<node2 adf="2"></node2>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns="urn://blah" d="a" />
</root>
As always, it depends on your requirements.
Do you know how the empty tag will display? (e.g. <pig />, <pig></pig>, etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). Also considering a string.Replace approach seems to be problematic unless your XML doesn't have a certain structure.
Finally, I would recommend using an XML parser approach (make sure your code is valid XML).
var doc = XDocument.Parse(original);
var emptyElements = from descendant in doc.Descendants()
where descendant.IsEmpty || string.IsNullOrWhiteSpace(descendant.Value)
select descendant;
emptyElements.Remove();
Anything you use will have to pass through the file once at least. If its just a single named tag that you know then regex is your friend otherwise use a stack approach. Start with parent tag and if it has a sub tag place it in stack. If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. If its empty remove it as well. This way you can remove all empty tags including tags with empty children.
If you are after a reg ex expression use this
XDocument is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small.
XmlTextReader will be faster and use less memory than XDocument when processing very large documents.
Regex is best for handling text rather than XML. It might not handle all edge cases as you would like (e.g. a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML.
XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). You can determine if tag is empty using XmlReader.IsEmptyElement property.
XDocument approach which produces desired output:
public static bool IsEmpty(XElement n)
{
return n.IsEmpty
|| (string.IsNullOrEmpty(n.Value)
&& (!n.HasElements || n.Elements().All(IsEmpty)));
}
var doc = XDocument.Parse(original);
var emptyNodes = doc.Descendants().Where(IsEmpty);
foreach (var emptyNode in emptyNodes.ToArray())
{
emptyNode.Remove();
}
If you have an XML document and you need to find certain nodes based on certain attribute values (4 in number) which one would be the correct approach (in terms of performance):-
a) Filter the XML document (with XPath) to get node list which match any of the attribute values and then traverse through the filtered Node list to get nodes having a particular attribute value using If-else.
b) Filter the XML document (with XPath) for each attribute value separately.
<Nodes>
<a class="myclass" type="type1">some text</a>
<a class="myclass" type="type2">some text</a>
<img src = "myGraphic.jpg?id={Guid}"/>
</Nodes>
I am using the below XPath (which might be incorrect :-))
"//A[#class] | //a[#class] | //IMG[#src] | //img[#src]"
The goal is to get the separate list of all a having type="type1" a separate list of type="type2" and a separate list of ids in the img tag.
My rough answer would be, performance is not going to really matter much unless you have a very large document or set of documents.
In that case, you probably will want to use SAX, and in any case, you'll want to traverse the document(s) only once, and not hold the whole thing in memory. So you'll be streaming through the documents, stopping at each a element, and appending it to list1 or list2 depending on its type.
I have written a chunk of XML parsing which works successfully provided I use an absolute path.
I now need to take an XMLNode as an argument and run an xpath against this.
Does anyone know how to do this?
I tried using relative XPath queries without any success!!
Should it be this hard??
It would help to see examples of XPath expressions that don't work as you think they should. Here are some possible causes (mistakes I frequently make).
Assume an XML document such as:
<A>
<B>
<C d='e'/>
</B>
<C/>
<D xmlns="http://foo"/>
</A>
forgetting to remove the top-level slash ('/') representing the document:
document.XPathSelectElements("/A") // selects a single A node
document.XPathSelectElements("//B") // selects a single B node
document.XPathSelectElements("//C") // selects two C nodes
but
aNode.XPathSelectElements("/B") // selects nothing (this looks for a rootNode with name B)
aNode.XPathSelectElements("B") // selects a B node
bNode.XPathSelectElements("//C") // selects TWO C nodes - all descendants of the root node
bNode.select(".//C") // selects one C node - all descendants of B
forgetting namespaces.
aNode.XPathSelectElements("D") // selects nothing (D is in a different namespace from A)
aNode.XPathSelectElements("[local-name()='D' and namespace-uri()='http://foo']") // one D node
(This is often a problem when the root node carries a prefixless namespace - easy to miss)
My XML looks like :
<?xml version=\"1.0\"?>
<itemSet>
<Item>one</Item>
<Item>two</Item>
<Item>three</Item>
.....maybe more Items here.
</itemSet>
Some of the individual Item may or may not be present. Say I want to retrieve the element <Item>two</Item> if it's present. I've tried the following XPaths (in C#).
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']") --- If Item two is present, then it returns me only the first element one. Maybe this query just points to the first element in itemSet, if it has an Item of value two somewhere as a child. Is this interpretation correct?
So I tried:
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']/Item[1]") --- I read this query as, return me the first <Item> element within itemSet that has value = 'two'. Am I correct?
This still returns only the first element one. What am I doing wrong?
In both the cases, using the siblings I can traverse the child nodes and get to two, but that's not what I am looking at. Also if two is absent then SelectSingleNode returns null. Thus the very fact that I am getting a successfull return node does indicate the presence of element two, so had I wanted a boolean test to chk presence of two, any of the above XPaths would suffice, but I actually the need the full element <Item>two</Item> as my return node.
[My first question here, and my first time working with web programming, so I just learned the above XPaths and related xml stuff on the fly right now from past questions in SO. So be gentle, and let me know if I am a doofus or flouting any community rules. Thanks.]
I think you want:
myXMLdoc.SelectSingleNode("/itemSet/Item[text()='two']")
In other words, you want the Item which has text of two, not the itemSet containing it.
You can also use a single dot to indicate the context node, in your case:
myXMLdoc.SelectSingleNode("/itemSet/Item[.='two']")
EDIT: The difference between . and text() is that . means "this node" effectively, and text() means "all the text node children of this node". In both cases the comparison will be against the "string-value" of the LHS. For an element node, the string-value is "the concatenation of the string-values of all text node descendants of the element node in document order" and for a collection of text nodes, the comparison will check whether any text node is equal to the one you're testing against.
So it doesn't matter when the element content only has a single text node, but suppose we had:
<root>
<item name="first">x<foo/>y</item>
<item name="second">xy<foo/>ab</item>
</root>
Then an XPath expression of "root/item[.='xy']" will match the first item, but "root/item[text()='xy']" will match the second.