XmlWriter inserting spaces when xml:space=preserve

XmlWriter inserting spaces when xml:space=preserve - c#

Given this code (C#, .NET 3.5 SP1):
var doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\"?><root>"
+ "<value xml:space=\"preserve\">"
+ "<item>content</item>"
+ "<item>content</item>"
+ "</value></root>");
var text = new StringWriter();
var settings = new XmlWriterSettings() { Indent = true, CloseOutput = true };
using (var writer = XmlWriter.Create(text, settings))
{
doc.DocumentElement.WriteTo(writer);
}
var xml = text.GetStringBuilder().ToString();
Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-16\"?>\r\n<root>\r\n"
+ " <value xml:space=\"preserve\"><item>content</item>"
+ "<item>content</item></value>\r\n</root>", xml);
The assertion fails because the XmlWriter is inserting a newline and indent around the <item> elements, which would seem to contradict the xml:space="preserve" attribute.
I am trying to take input with no whitespace (or only significant whitespace, and already loaded into an XmlDocument) and pretty-print it without adding any whitespace inside elements marked to preserve whitespace (for obvious reasons).
Is this a bug or am I doing something wrong? Is there a better way to achieve what I'm trying to do?
Edit: I should probably add that I do have to use an XmlWriter with Indent=true on the output side. In the "real" code, this is being passed in from outside of my code.

Ok, I've found a workaround.
It turns out that XmlWriter does the correct thing if there actually is any whitespace within the xml:space="preserve" block -- it's only when there isn't any that it screws up and adds some. And conveniently, this also works if there are some whitespace nodes, even if they're empty. So the trick that I've come up with is to decorate the document with extra 0-length whitespace in the appropriate places before trying to write it out. The result is exactly what I want: pretty printing everywhere except where whitespace is significant.
The workaround is to change the inner block to:
PreserveWhitespace(doc.DocumentElement);
doc.DocumentElement.WriteTo(writer);
...
private static void PreserveWhitespace(XmlElement root)
{
var nsmgr = new XmlNamespaceManager(root.OwnerDocument.NameTable);
foreach (var element in root.SelectNodes("//*[#xml:space='preserve']", nsmgr)
.OfType<XmlElement>())
{
if (element.HasChildNodes && !(element.FirstChild is XmlSignificantWhitespace))
{
var whitespace = element.OwnerDocument.CreateSignificantWhitespace("");
element.InsertBefore(whitespace, element.FirstChild);
}
}
}
I'm still thinking that this behaviour of XmlWriter is a bug, though.

Related

C# XmlReader reads XML wrong and different based on how I invoke the reader's methods

So my current understanding of how the C# XmlReader works is that it takes a given XML File and reads it node-by-node when I wrap it in a following construct:
using System.Xml;
using System;
using System.Diagnostics;
...
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
settings.IgnoreWhitespace = true;
settings.IgnoreProcessingInstructions = true;
using (XmlReader reader = XmlReader.Create(path, settings))
{
while (reader.Read())
{
// All reader methods I call here will reference the current node
// until I move the pointer to some further node by calling methods like
// reader.Read(), reader.MoveToContent(), reader.MoveToElement() etc
}
}
Why will the following two snippets (within the above construct) produce two very different results, even though they both call the same methods?
I used this example file for testing.
Debug.WriteLine(new string(' ', reader.Depth * 2) + "<" + reader.NodeType.ToString() + "|" + reader.Name + ">" + reader.ReadString() + "</>");
(Snippet 1)
vs
(Snippet 2)
string xmlcontent = reader.ReadString();
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Debug.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
Output of Snippet 1:
<XmlDeclaration|xml></>
<Element|rss></>
<Element|head></>
<Text|>Test Xml File</>
<Element|description>This will test my xml reader</>
<EndElement|head></>
<Element|body></>
<Element|g:id>1QBX23</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<Element|item></>
<Text|>2QXB32</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Yes, this is formatted as it was in my output window. As to be seen it skipped certain elements and outputted a wrong depth for a few others. Therefore, the NodeTypes are correct, unlike Snippet Number 2, which outputs:
<XmlDeclaration|xml></>
<Element|xml></>
<Element|title></>
<EndElement|title>Test Xml File</>
<EndElement|description>This will test my xml reader</>
<EndElement|head></>
<Element|item></>
<EndElement|g:id>1QBX23</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<Element|g:id></>
<EndElement|g:id>2QXB32</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Once again, the depth is messed up, but it's not as critical as with Snippet Number 1. It also skipped some elements and assigned wrong NodeTypes.
Why can't it output the expected result? And why do these two snippets produce two totally different outputs with different depths, NodeTypes and skipped nodes?
I'd appreciate any help on this. I searched a lot for any answers on this but it seems like I'm the only one experiencing these issues. I'm using the .NET Framework 4.6.2 with Asp.net Web Forms in Visual Studio 2017.

Firstly, you are using a method XmlReader.ReadString() that is deprecated:
XmlReader.ReadString Method
... reads the contents of an element or text node as a string. However, we recommend that you use the ReadElementContentAsString method instead, because it provides a more straightforward way to handle this operation.
However, beyond warning us off the method, the documentation doesn't precisely specify what it actually does. To determine that, we need to go to the reference source:
public virtual string ReadString() {
if (this.ReadState != ReadState.Interactive) {
return string.Empty;
}
this.MoveToElement();
if (this.NodeType == XmlNodeType.Element) {
if (this.IsEmptyElement) {
return string.Empty;
}
else if (!this.Read()) {
throw new InvalidOperationException(Res.GetString(Res.Xml_InvalidOperation));
}
if (this.NodeType == XmlNodeType.EndElement) {
return string.Empty;
}
}
string result = string.Empty;
while (IsTextualNode(this.NodeType)) {
result += this.Value;
if (!this.Read()) {
break;
}
}
return result;
}
This method does the following:
If the current node is an empty element node, return an empty string.
If the current node is an element that is not empty, advance the reader.
If the now-current node is the end of the element, return an empty string.
While the current node is a text node, add the text to a string and advance the reader. As soon as the current node is not a text node, return the accumulated string.
Thus we can see that this method is designed to advance the reader. We can also see that, given mixed-content XML like <head>text <b>BOLD</b> more text</head>, ReadString() will only partially read the <head> element, leaving the reader positioned on <b>. This oddity is likely why Microsoft deprecated the method.
We can also see why your two snippets function differently. In the first, you get reader.Depth and reader.NodeType before calling ReadString() and advancing the reader. In the second you get these properties after advancing the reader.
Since your intent is to iterate through the nodes and get the value of each, rather than ReadString() or ReadElementContentAsString() you should just use XmlReader.Value:
gets the text value of the current node.
Thus your corrected code should look like:
string xmlcontent = reader.Value;
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Console.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
XmlReader is tricky to work with. You always need to check the documentation to determine exactly where a given method positions the reader. For instance, XmlReader.ReadElementContentAsString() moves the reader past the end of the element, whereas XmlReader.ReadSubtree() moves the reader to the end of the element. But as a general rule any method named Read is going to advance the reader, so you need to be careful using a Read method inside an outer while (reader.Read()) loop.
Demo fiddle here.

IgnoreWhiteSpace not ignoring whitespace at beginning of xml string

Question
Should whitespace be ignored at the beginning of my multi-line string literal xml?
Code
string XML = #"
<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
Notice in the above code that there is white space before the XML declaration, this doesn't seem to be being ignored despite my setting of the IgnoreWhiteSpace property. Where am I going wrong?!
Note: I have the same behaviour when the XML string does not have a line break, and just a whitespace, as below. I know this will run if I remove the whitespace, my question is as to why the property doesn't take care of this?
string XML = #" <?xml version=""1.0"" encoding=""utf-8"" ?>"

The documentations say that the IgnoreWhitespace property will "Gets or sets a value indicating whether to ignore insignificant white space.". While that first whitespace (and also linebreak) should be insignificant, the one who made XmlReader apparently didn't think so. Just trim XML before use, and you'll be fine.
As stated in comments and for clarity, change your code to:
string XML = #"<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML.Trim()))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}

According to Microsoft's documentation regarding XML Declaration
The XML declaration typically appears as the first line in an XML
document. The XML declaration is not required, however, if used it
must be the first line in the document and no other content or white
space can precede it.
The parse should fail for your code because white space precedes the XML declaration. Removing either the white space OR the xml declaration will result in a successful parse.
In other words it would be a bug if XmlReaderSettings were at odds with the documentation for XML Declaration - it is defined behavior.
Here's some code demonstrating the above rules.
using System;
using System.Web;
using System.Xml;
using System.Xml.Linq;
public class Program
{
public static void Main()
{
//The XML declaration is not required, however, if used it must
// be the first line in the document and no other content or
//white space can precede it.
// here, no problem because this does not have an XML declaration
string xml = #"
<xml></xml>";
XDocument doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
//
// problem here because this does have an XML declaration
//
xml = #"
<?xml version=""1.0"" encoding=""utf-8"" ?><xml></xml>";
try
{
doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
} catch(Exception e) {
Console.WriteLine(e.Message);
}
}
}

How can I add specific escape characters to xmlserializer?

I have a method that serializes an object to xml and returns the string:
public static string SerializeType<T>(T item)
{
var serializer = new XmlSerializer(typeof(T));
var builder = new StringBuilder();
var settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
using (var stringWriter = XmlWriter.Create(builder, settings))
{
serializer.Serialize(stringWriter, item);
return builder.ToString();
}
}
However, it is not removing all the reserved characters from strings in objects I pass in. Microsoft lists the Reserved Characters as <>&% but when I input an item with a "abc&cd%d" string field, it spits out "a &lt ;ab&gt ;bc&amp ;cd%d" without out the spaces preceding the semicolons. % is not being escaped. How can I add the correct escape sequence for percent? The % causes an error when I send it to a client's app. The escaping listed on that page fixes the problem.

% isn't really a reserved character in XML. The documentation you've referred to is for SQL server, and there's a small note under the table:
The Notification Services XML vocabulary reserves the percent sign (%) for denoting parameters.
But you shouldn't expect XmlSerializer (or any other general-purpose XML library) to escape % for you. Unless you're using "Notification Services XML" I wouldn't expect this to be a problem.

XmlTextWriter incorrectly writing control characters

.NET's XmlTextWriter creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' ( ), but others are not, like 'vertical tab' (). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello  world</Test>

This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw
an ArgumentException when attempting to write character values in the
range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD).
These invalid XML characters can be written by creating the XmlWriter
with the CheckCharacters property set to false. Doing so will result
in the characters being replaced with numeric character entities (
through &#0x1F). Additionally, an XmlTextWriter created with the new
operator will replace the invalid characters with numeric character
entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.

Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, #"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.

Built in .NET escapers such as SecurityElement.Escape don't properly escape/strip it either.
You could set CheckCharacters to false on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though.
See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
If setting CheckCharacters to true(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them:
Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "&apos;" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL which XmlConvert.IsXmlChar, Wikipedia, and the spec marks as a valid (although discouraged) character.

Unicode escape characters not being read by XmlReader

I've got an XML document that I'm importing into an XmlReader that has some unicode formatting I need to preserve. I'm preserving the whitespace but it's dropping the encoded #x2028 which I assume should be expressed as a line break.
Here's my code:
var settings = new XmlReaderSettings
{
ProhibitDtd = false,
XmlResolver = null,
IgnoreWhitespace = false
};
var reader = XmlReader.Create(new StreamReader(fu.PostedFile.InputStream), settings);
var document = new XmlDocument {PreserveWhitespace = true};
document.Load(reader);
return document;
XML example:
<td valign="top" align="center">Camels and camel  resting place</td>
How do I get to those characters to I can render br tags?

Your question is unclear: do you expect the XmlReader to translate the   into an HTML <br> tag? That isn't going to happen.
Or are you examining the actual character content of the <td> element (within the code, not as printed/displayed) and seeing "camel resting place"? If yes, please show the code that you're using to verify this, because it would be a pretty major bug.
Or something else?

After importing the code into the reader I was able to find and replace that character:
Regex.Replace(s, "\u2028", "<br/>");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XmlWriter inserting spaces when xml:space=preserve - c#

Related

C# XmlReader reads XML wrong and different based on how I invoke the reader's methods

IgnoreWhiteSpace not ignoring whitespace at beginning of xml string

How can I add specific escape characters to xmlserializer?

XmlTextWriter incorrectly writing control characters

Unicode escape characters not being read by XmlReader

Categories

Resources