XDocument.Parse is retaining unwanted white space when parsing my XML. It appears that my XML is "not indented," which means that white space is retained regardless of whether or not I send in the LoadOptions.PreserveWhitespace flag (http://msdn.microsoft.com/en-us/library/bb551294(v=vs.110).aspx).
This means that when I have XML like the following:
<?xml version="1.0" encoding="UTF-8"?>
<blah:Root xmlns:blah="example.blah.com">
<blah:Element>
value
</blah:Element>
</blah:Root>
and then look at
XDocument xDoc = XDocument.Parse(blahXml);
xElement xEl = xDoc.Root.Element("Element");
string value = xEl.Value;
print value;
it will print "\n value\n" instead of "value".
How do I make XDocument.Parse always ignore white space regardless of whether or not I give it indented or not-indented XML?
White space between elements can be ignored (e.g.
<root>
<foo>foo 1</foo>
<foo>foo 2</foo>
</root>
can be parsed into a root element nodes with two foo child elements node if white space is ignored or into a root element node with five child nodes: text node, foo element node, text node, foo element node, text node), but if an element contains some text data that includes white space then it is considered important. So your only option is indeed to use a method like Trim in the .NET framework or like normalize-space in XPath to have the white space removed when you process the element.
Related
I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?
You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.
Does anybody know what is the difference between those two below statements :
xdoc.Root.Value;
and
xdoc.Root.ToString();
From my own research, I can see that the first line removes the root node and replaces the '\r\n' to '\n' whereas the second one keeps the content as original. Am I correct ? any documentation to back that up ?
As I want to use the first line but keep the original Windows new lines, is there a way to do that ?
Did you read the documentation?
Value:
A String that contains all of the text content of this element. If there are multiple text nodes, they will be concatenated.
ToString():
Returns the indented XML for this node.
The primary difference is:
ToString() includes the root element tags and the indentation/tabs.
For example:
<Root>
<Child1>1</Child1>
</Root>
Whereas, value doesn't; nor does it maintain the tabs, it just shows the content inside the root tag - it will show you the tags for the children, but not for the root itself:
For example:
<Child1>1</Child1>
I have a UI that uses the DataGridView to display the content of XML files.
If XmlNode contains only InnerText, it's quite simple, however I'm having a problem with nodes that contains childnodes (and not only string).
Simple
<node>value</node>
Displayed as "value" in DataGridViewCell.
Complex
<node>
<foo>bar</foo>
<foo2>bar</foo2>
</node>
The problem is that the InnerXml code is not intended and it's very hard to modify in UI.
I've tried to use XmlTextWriter to "beautify" the string - it works quite well, however requires a XmlNode (includes node, not only childnodes) and I cannot assign it back to InnerXml.
I would like to either see following in the UI:
<foo>bar</foo>
<foo2>bar</foo2>
(this can be assigned to InnerXml afterwards)
Or
<node>
<foo>bar</foo>
<foo2>bar</foo2>
</node>
(and find a way how to replace OuterXml with this string).
Thanks for any ideas,
Martin
You can load the OuterXml to XElement, then use String.Join() to join all child elements of the root node (in other point-of-view, the InnerXml) separated by line break, for example :
XElement e = e.Parse(something.OuterXml);
var result = string.Join(
Environment.NewLine,
e.Elements().Select(o => o.ToString())
);
I have following XElement:
<title>
<bold>Foo</bold>
<italic>Bar</italic>
</title>
When I get Value property it returns FooBar without space. How to fix it?
By definition, the Value of the <title> element is the concatenation of all text in this element. By default whitespace between elements and their contents is ignored, so it gives "FooBar". You can specify that you want to preserve whitespace:
var element = XElement.Parse(xml, LoadOptions.PreserveWhitespace);
However it will preserve all whitespace, including the line feeds and indentation. In your XML, there is a line feed and two spaces between "Foo" and "Bar"; how is it supposed to guess that you only want to keep one space?
From the documentation for the Value property of the XElement class:
Gets or sets the concatenated text contents of this element.
Given your example, this behavior is expected. If you want spaces, you will have to provide the logic to do it.
My XML looks like :
<?xml version=\"1.0\"?>
<itemSet>
<Item>one</Item>
<Item>two</Item>
<Item>three</Item>
.....maybe more Items here.
</itemSet>
Some of the individual Item may or may not be present. Say I want to retrieve the element <Item>two</Item> if it's present. I've tried the following XPaths (in C#).
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']") --- If Item two is present, then it returns me only the first element one. Maybe this query just points to the first element in itemSet, if it has an Item of value two somewhere as a child. Is this interpretation correct?
So I tried:
XMLNode node = myXMLdoc.SelectSingleNode("/itemSet[Item='two']/Item[1]") --- I read this query as, return me the first <Item> element within itemSet that has value = 'two'. Am I correct?
This still returns only the first element one. What am I doing wrong?
In both the cases, using the siblings I can traverse the child nodes and get to two, but that's not what I am looking at. Also if two is absent then SelectSingleNode returns null. Thus the very fact that I am getting a successfull return node does indicate the presence of element two, so had I wanted a boolean test to chk presence of two, any of the above XPaths would suffice, but I actually the need the full element <Item>two</Item> as my return node.
[My first question here, and my first time working with web programming, so I just learned the above XPaths and related xml stuff on the fly right now from past questions in SO. So be gentle, and let me know if I am a doofus or flouting any community rules. Thanks.]
I think you want:
myXMLdoc.SelectSingleNode("/itemSet/Item[text()='two']")
In other words, you want the Item which has text of two, not the itemSet containing it.
You can also use a single dot to indicate the context node, in your case:
myXMLdoc.SelectSingleNode("/itemSet/Item[.='two']")
EDIT: The difference between . and text() is that . means "this node" effectively, and text() means "all the text node children of this node". In both cases the comparison will be against the "string-value" of the LHS. For an element node, the string-value is "the concatenation of the string-values of all text node descendants of the element node in document order" and for a collection of text nodes, the comparison will check whether any text node is equal to the one you're testing against.
So it doesn't matter when the element content only has a single text node, but suppose we had:
<root>
<item name="first">x<foo/>y</item>
<item name="second">xy<foo/>ab</item>
</root>
Then an XPath expression of "root/item[.='xy']" will match the first item, but "root/item[text()='xy']" will match the second.