How to stop XDocument.Save writing escape chars - c#

I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to "&gt".
Is there an easy way to prevent this?

First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}

Related

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

How should the '\t' character be handled within XML attribute values?

I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.
The issue is actually fairly easy to reproduce:
Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
xmlDoc.Save(#"d:\TabTest.xml");
NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.
Load the document using new XmlTextReader:
var rawFile = XmlReader.Create(#"D:\TabTest.xml");
var rawDoc = new XmlDocument();
rawDoc.Load(rawFile);
Load the document using XmlReader.Create:
var rawFile2 = new XmlTextReader(#"D:\TabTest.xml");
var rawDoc2 = new XmlDocument();
rawDoc2.Load(rawFile2);
Compare the documents in the debugger:
(rawDoc).InnerXml "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>" string
(rawDoc2).InnerXml "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>" string
The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected.
However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.
What the....!! :-)
After a bit of a Google search I found that I could encode a '\t' as ' ' - if I used this instead of '\t' in the example XML both readers work as expected.
Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?
Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?
Which way should I go to fix/work around this issue?
Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.
#all: Thanks for all your answers and comments.
It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.
In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.
The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.
While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.
So the original file save code becomes:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
using (var xmlWriter = XmlWriter.Create(#"d:\TabTest.Encoded.xml"))
{
xmlDoc.Save(xmlWriter);
}
This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.
The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.
Check out XmlReaderSettings.ComformanceLevel. In particular, this description:
Note that XmlReader objects created by the Create method are more compliant by default than the XmlTextReader class. The following are conformance improvements that are not enabled on XmlTextReader, but are available by default on readers created by the Create method
At a glance it seems that XmlTextReader is not compliant with the W3C recommendation. See the section in the recommendation on attribute value normalization, specifically
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
Hence the behaviour that you weren't expecting (seeing a space instead of a tab) is actually the correct recommended behaviour.
I have no idea why XmlTextReader is behaving this way (there is nothing in the documentation), however you seem to have already seem to have identified the correct workaround - encode the attribute as instead. In this case the normalised string will contain the tab character itself.

Writing XMLDocument unescaped in C#

Currently I'm writing XHTML in a XmlDocument. This works perfect, but I'm stuck on one problem. Some XmlText elements can contain things like . When I want to write such things to a stream it uses the innerXML instead of the innerText value for such nodes. The problem is that the ouput is wrong because now its outputting &nbsp; instead of . How can I use xmlwriter and xmldocument without performing such escaping when writing to a stream? I just want unescaped output.
If you use XmlWriter.WriteRaw, it won't perform any escaping - it assumes you've got raw XML.
For example:
using System;
using System.Xml;
class Test
{
static void Main()
{
using (XmlWriter writer = XmlWriter.Create(Console.Out))
{
writer.WriteStartDocument();
writer.WriteStartElement("root");
writer.WriteRaw("<element> </element>");
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
}
Output:
<?xml version="1.0" encoding="IBM437"?><root><element> </element></root>
You're almost certainly trying to solve the wrong problem here. If you want text with non-breaking spaces, then you should use the non-breaking space character. In a C# string literal you can write it as the escape sequence \u00A0, for example:
var xmldoc = new XmlDocument();
XmlElement test = xmldoc.CreateElement("test");
xmldoc.AppendChild(test);
XmlText nbsp = xmldoc.CreateTextNode("\u00A0");
test.AppendChild(nbsp);
HTML entities like nbsp are just a way to encode such characters in a non-unicode text file. You shouldn't be using them when constructing an XML DOM. By the way, if you force .NET to write the above DOM to an ASCII encoded file (via the proper XmlWriterSettings) then it will probably write the non-breaking space character as  . In an UTF-8 encoded file (the default) it will just appear as a space.
If you force certain literal character sequences to appear in the XML output, then you risk creating invalid XML that cannot be loaded by conforming XML processors. For example, try to load <test> </test> in an empty XmlDocument. This will throw an exception. To be fair, you can declare such entities, and the XHTML schema does so. But I hope you see my point.
edit: XmlDocument is doing it's job correctly. If it wouldn't escape characters such as & < > then you could create invalid XML that's impossible to load again. To force an XML entity in the output you should use XmlDocument.CreateEntityReference. The bug is in whatever code is using entities in XmlText nodes instead of generating XmlEntityReference nodes.
Assuming you are using .NET 3.x ,learn and use LINQ-to-XML... the API is very simple and more capable. That way you need not walk/traverse the DOM...instead you can just query the object tree.
Specifically, look into the XDocument clas of the API.

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

XML Exception: Invalid Character(s)

I am working on a small project that is receiving XML data in string form from a long running application. I am trying to load this string data into an XDocument (System.Xml.Linq.XDocument), and then from there do some XML Magic and create an xlsx file for a report on the data.
On occasion, I receive the data that has invalid XML characters, and when trying to parse the string into an XDocument, I get this error.
[System.Xml.XmlException]
Message: '?', hexadecimal value 0x1C, is an invalid character.
Since I have no control over the remote application, you could expect ANY kind of character.
I am well aware that XML has a way where you can put characters in it such as &#x1C or something like that.
If at all possible I would SERIOUSLY like to keep ALL the data. If not, than let it be.
I have thought about editing the response string programatically, then going back and trying to re-parse should an exception be thrown, but I have tried a few methods and none of them seem successful.
Thank you for your thought.
Code is something along the line of this:
TextReader tr;
XDocument doc;
string response; //XML string received from server.
...
tr = new StringReader (response);
try
{
doc = XDocument.Load(tr);
}
catch (XmlException e)
{
//handle here?
}
You can use the XmlReader and set the XmlReaderSettings.CheckCharacters property to false. This will let you to read the XML file despite the invalid characters. From there you can import pass it to a XmlDocument or XDocument object.
You can read a little more about in my blog.
To load the data to a System.Xml.Linq.XDocument it will look a little something like this:
XDocument xDocument = null;
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
{
xmlReader.MoveToContent();
xDocument = XDocument.Load(xmlReader);
}
More information can be found here.
XML can handle just about any character, but there are ranges, control codes and such, that it won't.
Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.
(You can't even resort to CDATA, as there is no way to escape these characters there.)
Would something as described in this blog post be helpful?
Basically, he creates a sanitizing xml stream.
If your input is not XML, you should use something like Tidy or Tagsoup to clean the mess up.
They would take any input and try, hopefully, to make a useful DOM from it.
I don't know how relevant dark side libraries are called.
Garbage In, Garbage Out. If the remote application is sending you garbage, then that's all you'll get. If they think they're sending XML, then they need to be fixed. In this case, you're not doing them any favors by working around their bug.
You should also make sure of what they think they're sending. What did the %1C mean to them? What did they want it to be?
IMHO the best solution would be to modify the code/program/whatever produced the invalid XML that is being fed to your program. Unfortunately this is not always possible. In this case you need to escape all characters < 0x20 before trying to load the document.
If you really can't fix the source XML data, consider taking an approach like I described in this answer. Basically, you create a TextReader subclass (e.g StripTextReader) that wraps an existing TextReader (tr) and discards invalid characters.
Its a late answer, but may help someone. When you read or serialize an XML it may have 1 invisible character at the beginning of the XML. XDocument don't like this invisible character.
So while reading the XML, just start reading from the first < character:
var myXml = XDocument.Parse(loadedString.Substring(loadedString.IndexOf("<")));
That's it and it loads just fine.

Categories