Question
Should whitespace be ignored at the beginning of my multi-line string literal xml?
Code
string XML = #"
<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
Notice in the above code that there is white space before the XML declaration, this doesn't seem to be being ignored despite my setting of the IgnoreWhiteSpace property. Where am I going wrong?!
Note: I have the same behaviour when the XML string does not have a line break, and just a whitespace, as below. I know this will run if I remove the whitespace, my question is as to why the property doesn't take care of this?
string XML = #" <?xml version=""1.0"" encoding=""utf-8"" ?>"
The documentations say that the IgnoreWhitespace property will "Gets or sets a value indicating whether to ignore insignificant white space.". While that first whitespace (and also linebreak) should be insignificant, the one who made XmlReader apparently didn't think so. Just trim XML before use, and you'll be fine.
As stated in comments and for clarity, change your code to:
string XML = #"<?xml version=""1.0"" encoding=""utf-8"" ?>"
using (StringReader stringReader = new StringReader(XML.Trim()))
using (XmlReader xmlReader = XmlReader.Create(stringReader,
new XmlReaderSettings() { IgnoreWhitespace = true }))
{
xmlReader.MoveToContent();
// further implementation withheld
}
According to Microsoft's documentation regarding XML Declaration
The XML declaration typically appears as the first line in an XML
document. The XML declaration is not required, however, if used it
must be the first line in the document and no other content or white
space can precede it.
The parse should fail for your code because white space precedes the XML declaration. Removing either the white space OR the xml declaration will result in a successful parse.
In other words it would be a bug if XmlReaderSettings were at odds with the documentation for XML Declaration - it is defined behavior.
Here's some code demonstrating the above rules.
using System;
using System.Web;
using System.Xml;
using System.Xml.Linq;
public class Program
{
public static void Main()
{
//The XML declaration is not required, however, if used it must
// be the first line in the document and no other content or
//white space can precede it.
// here, no problem because this does not have an XML declaration
string xml = #"
<xml></xml>";
XDocument doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
//
// problem here because this does have an XML declaration
//
xml = #"
<?xml version=""1.0"" encoding=""utf-8"" ?><xml></xml>";
try
{
doc = XDocument.Parse(xml);
Console.WriteLine(doc.Document.Declaration);
Console.WriteLine(doc.Document);
} catch(Exception e) {
Console.WriteLine(e.Message);
}
}
}
Related
My question is simple, but I just can't find why I have this problem and can't resolve it.
I need to read a XML file with values and use them on Unity. For now on, I read my document with its path :
XmlDocument doc = new XmlDocument();
doc.Load(path);
XmlElement root = doc.DocumentElement;
I have a Namespace Manager already configured.
I read my data like this :
string text = node.SelectSingleNode("x:textRuns/x:DOMTextRun/x:characters", nsmgr).InnerText.Replace("
", Environment.NewLine);
My XML and the data I would like to extract :
<characters>Third occupant
folding seat</characters>
My objective is to replace this entity character : "& #xD;" with an Environment.NewLine.
I tried to :
Formalize the Xml in a file with a replace
Read with an InnerText, and an InnerXml
Make an entity char "detector"
Get the node with all its content (OuterXML)
It looks like this char, however you read it, is exclude and not readable, I just can't have it on my console.
The entity has already been replaced once you extracted InnerText. Problem is, you have a CR (carriage return; 0x0D, \r) instead of a LF (line feed; 0x0A, \n). So replace "\r" by Environment.NewLine:
public static void Main() {
XmlDocument doc = new XmlDocument();
doc.LoadXml("<characters>Third occupant
folding seat</characters>");
string text = doc.SelectSingleNode("/characters").InnerText;
text = text.Replace("\r", Environment.NewLine);
Console.WriteLine(text);
}
I am trying to replace within a string
<?xml version="1.0" encoding="UTF-8"?>
<response success="true">
<output><![CDATA[
And
]]></output>
</response>
with nothing.
The problem I am running into is the characters <> and " characters are interacting within the replace. Meaning, it's not reading those lines as a full string all together as one but breaking the string when it comes to a <> or ". Here is what I have but I know this isn't right:
String responseString = reader.ReadToEnd();
responseString.Replace(#"<<?xml version=""1.0"" encoding=""UTF-8""?><response success=""true""><output><![CDATA[[", "");
responseString.Replace(#"]]\></output\></response\>", "");
What would be the correct code to get the replace to see these lines as just a string?
A string will never change. The Replace method works as follows:
string x = "AAA";
string y = x.Replace("A", "B");
//x == "AAA", y == "BBB"
However, the real problem is how you handle the XML response data.
You should reconsider your approach of handling incoming XML by string replacement. Just get the CDATA content using the standard XML library. It's as easy as this:
using System.Xml.Linq;
...
XDocument doc = XDocument.Load(reader);
var responseString = doc.Descendants("output").First().Value;
The CDATA will already be removed. This tutorial will teach more about working with XML documents in C#.
Given your document structure, you could simply say something like this:
string response = #"<?xml version=""1.0"" encoding=""UTF-8""?>"
+ #"<response success=""true"">"
+ #" <output><![CDATA["
+ #"The output is some arbitrary text and it may be found here."
+ "]]></output>"
+ "</response>"
;
XmlDocument document = new XmlDocument() ;
document.LoadXml( response ) ;
bool success ;
bool.TryParse( document.DocumentElement.GetAttribute("success"), out success) ;
string content = document.DocumentElement.InnerText ;
Console.WriteLine( "The response indicated {0}." , success ? "success" : "failure" ) ;
Console.WriteLine( "response content: {0}" , content ) ;
And see the expected results on the console:
The response indicated success.
response content: The output is some arbitrary text and it may be found here.
If your XML document is a wee bit more complex, you can easily select the desired node(s) using an XPath query, thus:
string content = document.SelectSingleNode( #"/response/output" ).InnerText;
I have a xml-document that simplified looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Node1 separator=" " />
There is a \t as attribute value.
When executing this code
var path = #"C:\test.xml";
var doc = XDocument.Load(path);
doc.Save(path);
the attribute value changed from tab to space.
<?xml version="1.0" encoding="utf-8"?>
<Node1 separator=" " />
Is there a way to preserve the origin value, because it is required to be a tab?
This is "XML whitespace normalization in attributes" portion of XML:Attribute-Value Normalization which is default behavior when handling XML documents.
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value
You should be able to use XmlTextReader.Normalization property as described here. XmlDocument can load from reader XmlDocument.Load.
var path = #"C:\test.xml";
XmlDocument doc = new XmlDocument();
XmlTextReader reader = new XmlTextReader(path);
doc.Load(reader);
var s = doc.SelectSingleNode("*/#*").InnerText;
Console.WriteLine("|{0}|, {1}", (int)s[0], s.Length); // prints 9 - ASCII code of tab
doc.Save(path);
Given this code (C#, .NET 3.5 SP1):
var doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\"?><root>"
+ "<value xml:space=\"preserve\">"
+ "<item>content</item>"
+ "<item>content</item>"
+ "</value></root>");
var text = new StringWriter();
var settings = new XmlWriterSettings() { Indent = true, CloseOutput = true };
using (var writer = XmlWriter.Create(text, settings))
{
doc.DocumentElement.WriteTo(writer);
}
var xml = text.GetStringBuilder().ToString();
Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-16\"?>\r\n<root>\r\n"
+ " <value xml:space=\"preserve\"><item>content</item>"
+ "<item>content</item></value>\r\n</root>", xml);
The assertion fails because the XmlWriter is inserting a newline and indent around the <item> elements, which would seem to contradict the xml:space="preserve" attribute.
I am trying to take input with no whitespace (or only significant whitespace, and already loaded into an XmlDocument) and pretty-print it without adding any whitespace inside elements marked to preserve whitespace (for obvious reasons).
Is this a bug or am I doing something wrong? Is there a better way to achieve what I'm trying to do?
Edit: I should probably add that I do have to use an XmlWriter with Indent=true on the output side. In the "real" code, this is being passed in from outside of my code.
Ok, I've found a workaround.
It turns out that XmlWriter does the correct thing if there actually is any whitespace within the xml:space="preserve" block -- it's only when there isn't any that it screws up and adds some. And conveniently, this also works if there are some whitespace nodes, even if they're empty. So the trick that I've come up with is to decorate the document with extra 0-length whitespace in the appropriate places before trying to write it out. The result is exactly what I want: pretty printing everywhere except where whitespace is significant.
The workaround is to change the inner block to:
PreserveWhitespace(doc.DocumentElement);
doc.DocumentElement.WriteTo(writer);
...
private static void PreserveWhitespace(XmlElement root)
{
var nsmgr = new XmlNamespaceManager(root.OwnerDocument.NameTable);
foreach (var element in root.SelectNodes("//*[#xml:space='preserve']", nsmgr)
.OfType<XmlElement>())
{
if (element.HasChildNodes && !(element.FirstChild is XmlSignificantWhitespace))
{
var whitespace = element.OwnerDocument.CreateSignificantWhitespace("");
element.InsertBefore(whitespace, element.FirstChild);
}
}
}
I'm still thinking that this behaviour of XmlWriter is a bug, though.
I've got an XML document that I'm importing into an XmlReader that has some unicode formatting I need to preserve. I'm preserving the whitespace but it's dropping the encoded #x2028 which I assume should be expressed as a line break.
Here's my code:
var settings = new XmlReaderSettings
{
ProhibitDtd = false,
XmlResolver = null,
IgnoreWhitespace = false
};
var reader = XmlReader.Create(new StreamReader(fu.PostedFile.InputStream), settings);
var document = new XmlDocument {PreserveWhitespace = true};
document.Load(reader);
return document;
XML example:
<td valign="top" align="center">Camels and camel
resting place</td>
How do I get to those characters to I can render br tags?
Your question is unclear: do you expect the XmlReader to translate the
into an HTML <br> tag? That isn't going to happen.
Or are you examining the actual character content of the <td> element (within the code, not as printed/displayed) and seeing "camel resting place"? If yes, please show the code that you're using to verify this, because it would be a pretty major bug.
Or something else?
After importing the code into the reader I was able to find and replace that character:
Regex.Replace(s, "\u2028", "<br/>");