I am looking for a good approach that can remove empty tags from XML efficiently. What do you recommend? Regex? XDocument? XmlTextReader?
For example,
const string original =
#"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<pig />
<dog>Puppy</dog>
<snake></snake>
<elephant>
<africanElephant></africanElephant>
<asianElephant>Biggy</asianElephant>
</elephant>
<tiger>
<tigerWoods></tigerWoods>
<americanTiger></americanTiger>
</tiger>
</pet>";
Could become:
const string expected =
#"<?xml version=""1.0"" encoding=""utf-16""?>
<pet>
<cat>Tom</cat>
<dog>Puppy</dog>
<elephant>
<asianElephant>Biggy</asianElephant>
</elephant>
</pet>";
Loading your original into an XDocument and using the following code gives your desired output:
var document = XDocument.Parse(original);
document.Descendants()
.Where(e => e.IsEmpty || String.IsNullOrWhiteSpace(e.Value))
.Remove();
This is meant to be an improvement on the accepted answer to handle attributes:
XDocument xd = XDocument.Parse(original);
xd.Descendants()
.Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(a.Value))
&& string.IsNullOrWhiteSpace(e.Value)
&& e.Descendants().SelectMany(c => c.Attributes()).All(ca => ca.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(ca.Value))))
.Remove();
The idea here is to check that all attributes on an element are also empty before removing it. There is also the case that empty descendants can have non-empty attributes. I inserted a third condition to check that the element has all empty attributes among its descendants. Considering the following document with node8 added:
<root>
<node />
<node2 blah='' adf='2'></node2>
<node3>
<child />
</node3>
<node4></node4>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns='urn://blah' d='a'/>
<node7 xmlns='urn://blah2' />
<node8>
<child2 d='a' />
</node8>
</root>
This would become:
<root>
<node2 blah="" adf="2"></node2>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns="urn://blah" d="a" />
<node8>
<child2 d='a' />
</node8>
</root>
The original and improved answer to this question would lose the node2 and node6 and node8 nodes. Checking for e.IsEmpty would work if you only want to strip out nodes like <node />, but it's redunant if you're going for both <node /> and <node></node>. If you also need to remove empty attributes, you could do this:
xd.Descendants().Attributes().Where(a => string.IsNullOrWhiteSpace(a.Value)).Remove();
xd.Descendants()
.Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration))
&& string.IsNullOrWhiteSpace(e.Value))
.Remove();
which would give you:
<root>
<node2 adf="2"></node2>
<node5><![CDATA[asdfasdf]]></node5>
<node6 xmlns="urn://blah" d="a" />
</root>
As always, it depends on your requirements.
Do you know how the empty tag will display? (e.g. <pig />, <pig></pig>, etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). Also considering a string.Replace approach seems to be problematic unless your XML doesn't have a certain structure.
Finally, I would recommend using an XML parser approach (make sure your code is valid XML).
var doc = XDocument.Parse(original);
var emptyElements = from descendant in doc.Descendants()
where descendant.IsEmpty || string.IsNullOrWhiteSpace(descendant.Value)
select descendant;
emptyElements.Remove();
Anything you use will have to pass through the file once at least. If its just a single named tag that you know then regex is your friend otherwise use a stack approach. Start with parent tag and if it has a sub tag place it in stack. If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. If its empty remove it as well. This way you can remove all empty tags including tags with empty children.
If you are after a reg ex expression use this
XDocument is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small.
XmlTextReader will be faster and use less memory than XDocument when processing very large documents.
Regex is best for handling text rather than XML. It might not handle all edge cases as you would like (e.g. a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML.
XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). You can determine if tag is empty using XmlReader.IsEmptyElement property.
XDocument approach which produces desired output:
public static bool IsEmpty(XElement n)
{
return n.IsEmpty
|| (string.IsNullOrEmpty(n.Value)
&& (!n.HasElements || n.Elements().All(IsEmpty)));
}
var doc = XDocument.Parse(original);
var emptyNodes = doc.Descendants().Where(IsEmpty);
foreach (var emptyNode in emptyNodes.ToArray())
{
emptyNode.Remove();
}
Related
I have below XML files. While I'm using it in Process its gives me error like
[2016-06-22 19:29:53 IST] ERROR: "Line 15 column 20: character content of element "electronics" invalid; must be equal to "foods", "goods" or "mobiles" at XPath
/products/product[1]/item_type"
This XPath /products/product[1] is not correct. it's returns by their system.
So, I have to search it and replace with original value through Line No only.
Order.xml
<products>
<product>
<country>AE</country>
<sale>false</sale>
<item_type>foods</item_type>
</product>
<product>
<country>IN</country>
<sale>false</sale>
<item_type>goods</item_type>
</product>
<product>
<country>US</country>
<sale>false</sale>
<item_type>electronics</item_type>
</product>
<product>
<country>AM</country>
<sale>false</sale>
<item_type>mobiles</item_type>
</product>
</products>
Please let me know to do search using Line No(15) in XML document
You should not rely on line/colmns numbers when dealing with xml. Xml does not care much for whitespace and formatting is a merely convenience for readabilty; Although it is correctly reported in your example, I would not trust it will alway be that way in future. Interstingly the xpath is also incorrect but lets not dwell on that for the moment.
You need to take back control of the problem... You should create your own validating reader and capture the validation error while the xml is being parsed. How you do this will depend on which framework your using. Either XmlValidatingReader or, XmlReader with the appropriate XmlReaderSettings.
You can set up a callback/eventhandler to capture the error as the xml file is being read, so you can be certain your in the correct place and have all the information you need to handle the error to hand. Using an XmlReader will also allow you to continue to process the entire file and not just stop at the first error.
The code is too big for SO and we'd require a lot more information from you to do it but a google search will find lots of examples including this one from microsoft: https://msdn.microsoft.com/en-us/library/w5aahf2a(v=vs.100).aspx
Construct a new XmlReaderSettings instance.
Add an XML schema to the Schemas property of the XmlReaderSettings instance.
Specify Schema as the ValidationType.
Specify ValidationFlags and a ValidationEventHandler to handle schema validation errors and warnings encountered during validation.
Pass the XmlReaderSettings object to the Create method of the XmlReader class along with the XML document, creating a schema-validating XmlReader.
Call Read() to run through the xml from start to end.
Would this work for you?
string[] lines = File.ReadAllLines(#"C:\order.xml", Encoding.UTF8);
var line15 = lines[14];
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\order.xml");
XmlElement el =(XmlElement)doc.SelectSingleNode("/products/product[1]/item_type");
if(el != null) { el.ParentNode.RemoveChild(el); }
you can remove node by using above code...using proper Xpath you can remove all the item_type which is not equals to foods,good or mobiles.
Try this,
var xml = XDocument.Load(#"C:\order.xml", LoadOptions.SetLineInfo);
var lineNumbers = xml.Descendants()
.Where(x => !x.Descendants().Any() && //exact node contains the value
x.Value.Contains("foods"))
.Cast<IXmlLineInfo>()
.Select(x => x.LineNumber);
int getLineNumber = lineNumbers.First();
I have a UI that uses the DataGridView to display the content of XML files.
If XmlNode contains only InnerText, it's quite simple, however I'm having a problem with nodes that contains childnodes (and not only string).
Simple
<node>value</node>
Displayed as "value" in DataGridViewCell.
Complex
<node>
<foo>bar</foo>
<foo2>bar</foo2>
</node>
The problem is that the InnerXml code is not intended and it's very hard to modify in UI.
I've tried to use XmlTextWriter to "beautify" the string - it works quite well, however requires a XmlNode (includes node, not only childnodes) and I cannot assign it back to InnerXml.
I would like to either see following in the UI:
<foo>bar</foo>
<foo2>bar</foo2>
(this can be assigned to InnerXml afterwards)
Or
<node>
<foo>bar</foo>
<foo2>bar</foo2>
</node>
(and find a way how to replace OuterXml with this string).
Thanks for any ideas,
Martin
You can load the OuterXml to XElement, then use String.Join() to join all child elements of the root node (in other point-of-view, the InnerXml) separated by line break, for example :
XElement e = e.Parse(something.OuterXml);
var result = string.Join(
Environment.NewLine,
e.Elements().Select(o => o.ToString())
);
I have an XDocument object where I am trying to get the direct parent element based on a child element's value.
Getting the child element's value has been no issue, but I am struggling with finding the correct way to get only the parent element. Having not worked with XML much, I have a suspicion that the solution is simple and I am overthinking it.
Essentially, based on the below XML, if <Active>true</Active> then I want the direct parent element (i.e. <AlertNotification>) and no other elements.
Thank you in advance.
An example of the XML
<?xml version="1.0" encoding="utf-16"?>
<Policies xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLschema">
<PolicyID>1</PolicyID>
<EmailNotification>
<Active>false</Active>
</EmailNotification>
<AlertNotification>
<Active>true</Active>
</AlertNotification>
<AlarmEnabled>
<Active>false</Active>
</AlarmEnabled>
</Policies>
I thinks you should replace the utf-16 in the first line to utf-8. Then you may try this:
XDocument doc = XDocument.Load(your file);
var elements = doc.Descendants("Active")
.Where(i => i.Value == "true")
.Select(i => i.Parent);
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<stock-items>
<stock-item>
<name>Loader 34</name>
<sku>45GH6</sku>
<vendor>HITINANY</vendor>
<useage>Lifter 45 models B to C</useage>
<typeid>01</typeid>
<version>01</version>
<reference>33</reference>
<comments>EOL item. No Re-order</comments>
<traits>
<header>56765</header>
<site>H4</site>
<site>A6</site>
<site>V1</site>
</traits>
<type-validators>
<actions>
<endurance-tester>bake/shake</endurance-tester>
</actions>
<rules>
<results-file>Test-Results.txt</results-file>
<file-must-contain file-name="Test-Results.xml">
<search>
<term>[<![CDATA[<"TEST TYPES 23 & 49 PASSED"/>]]></term>
<search-type>exactMatch</search-type>
</search>
</file-must-contain>
</rules>
</type-validators>
</stock-item>
</stock-items>
Im trying to get the rules fragment from the xml above into a string so it can be added to a database. Currently the search element and its contents are added twice. I know why this is happing but cant figure out how to prevent it.
Heres my code
var Rules = from rules in Type.Descendants("rules")
select rules.Descendants();
StringBuilder RulesString = new StringBuilder();
foreach (var rule in Rules)
{
foreach (var item in rule)
{
RulesString.AppendLine(item.ToString());
}
}
Console.WriteLine(RulesString);
Finally any elements in rules are optional and some of these elements may or may not contain other child elements up to 4 or 5 levels deep. TIA
UPDATE:
To try and make it clearer what im trying to achieve.
From the xml above I should end up with a string containing everthing in the rules element, exactly like this:
<results-file>Test-Results.txt</results-file>
<file-must-contain file-name="Test-Results.xml">
<search>
<term>[<![CDATA[<"TEST TYPES 23 & 49 PASSED"/>]]></term>
<search-type>exactMatch</search-type>
</search>
</file-must-contain>
Objective is to extract the entire contents of the rules element as is while taking account that the rules element may or may not contains child elements several levels deep
If you just want the entirety of the rules element as a string (rather than caring about its contents as xml), you don't need to dig into its contents, you just need to get the element as an XNode and then call ToString() on it :
The following example uses this method to retrieve indented XML.
XElement xmlTree = new XElement("Root",
new XElement("Child1", 1)
);
Console.WriteLine(xmlTree);
This example produces the following output:
<Root>
<Child1>1</Child1>
</Root>
if you want to prevent duplicates than you will need to use Distinct() or GroupBy() after parsing the xml and before building the string.
I'm still not fully understanding exactly what the output should be, so I can't provide a clear solution on what exactly to use, or how, in terms of locating duplicates. If you can refine the original post that would help.
we need the structure of the xml as it would appear in your scenario. nesting and all.
we need an example of the final string.
saving it to a db doesn't really matter for this post so you only need to briefly mention that once, if at all.
I have several XML documents, all of which have the same structure (element names, attribute names and hierarchy).
However, some of the elements and attribute have custom namespaces in each XML document which are not known at design time. They change, don't ask...
How can I deal with this when traversing the documents using a single set of XPath?
Should I remove all the namespaces before processing?
Can I automatically register all namespaces with an XmlNamespaceManager?
Any thoughts?
Update: some examples (with namespace declarations omitted for clarity):
<root>
<child attr="val" />
</root>
<root>
<x:child attr="val" />
</root>
<root>
<y:child z:attr="val" />
</root>
Thanks
Suppose you have following xml:
<root xmlns="first">
<el1 xmlns="second">
<el2 xmlns="third">...
You can write you queries to ignore namespaces in the following way:
/*[local-name()='root']/*[local-name()='el1']/*[local-name()='el2']
etc.
Of course you can iterate over the whole document to get namespaces and load them into nsmanager. But in general case this will cause you to evaluate every node in the document. In this case it will be faster to just treat document as a tree of objects and don't use XPath.
I believe you'll find some good insight in this Stackoverflow thread
XPath + Namespace Driving me crazy
In my opinion you have either of two solutions:
1- If the set of all possible namespaces are know before hand, then you can register them all in a XmlNamespaceManager before you begin parsing
2- Use Xpath namespace-agnostic selectors
Of course you can always scrub the xml document from any inline namespaces and start your parsing on a clean unfiorm xml without namespace.. but honestly I don't see the gain in adding this overhead step.
Scott Hanselman has a nice article about extracting all of the XML Namespaces in an XML document. Presumably, when you get all of the XML Namespaces, you can just iterate over all of them and register them in your namespace manager.
You could try something like this to strip the namespaces:
//Implemented based on interface, not part of algorithm
public string RemoveAllNamespaces(string xmlDocument)
{
return RemoveAllNamespaces(XElement.Parse(xmlDocument)).ToString();
}
//Core recursion function
private XElement RemoveAllNamespaces(XElement xmlDocument)
{
if (!xmlDocument.HasElements)
{
XElement xElement = new XElement(xmlDocument.Name.LocalName);
xElement.Value = xmlDocument.Value;
return xElement;
}
return new XElement(xmlDocument.Name.LocalName, xmlDocument.Elements().Select(el => RemoveAllNamespaces(el)));
}
See Peter Stegnar's answer here for more details:
How to remove all namespaces from XML with C#?
You can also use direct node tests with wildcards, which will match any namespace (or lack thereof):
$your-document/*:root/*:child/#*:attr