I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.
The issue is actually fairly easy to reproduce:
Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
xmlDoc.Save(#"d:\TabTest.xml");
NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.
Load the document using new XmlTextReader:
var rawFile = XmlReader.Create(#"D:\TabTest.xml");
var rawDoc = new XmlDocument();
rawDoc.Load(rawFile);
Load the document using XmlReader.Create:
var rawFile2 = new XmlTextReader(#"D:\TabTest.xml");
var rawDoc2 = new XmlDocument();
rawDoc2.Load(rawFile2);
Compare the documents in the debugger:
(rawDoc).InnerXml "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>" string
(rawDoc2).InnerXml "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>" string
The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected.
However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.
What the....!! :-)
After a bit of a Google search I found that I could encode a '\t' as ' ' - if I used this instead of '\t' in the example XML both readers work as expected.
Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?
Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?
Which way should I go to fix/work around this issue?
Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.
#all: Thanks for all your answers and comments.
It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.
In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.
The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.
While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.
So the original file save code becomes:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
using (var xmlWriter = XmlWriter.Create(#"d:\TabTest.Encoded.xml"))
{
xmlDoc.Save(xmlWriter);
}
This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.
The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.
Check out XmlReaderSettings.ComformanceLevel. In particular, this description:
Note that XmlReader objects created by the Create method are more compliant by default than the XmlTextReader class. The following are conformance improvements that are not enabled on XmlTextReader, but are available by default on readers created by the Create method
At a glance it seems that XmlTextReader is not compliant with the W3C recommendation. See the section in the recommendation on attribute value normalization, specifically
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
Hence the behaviour that you weren't expecting (seeing a space instead of a tab) is actually the correct recommended behaviour.
I have no idea why XmlTextReader is behaving this way (there is nothing in the documentation), however you seem to have already seem to have identified the correct workaround - encode the attribute as instead. In this case the normalised string will contain the tab character itself.
Related
We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.
I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its header. But it also includes a char encoded in UTF-16 - 'é'. This code was used to read XML file for validating its content:
var xDoc = XDocument.Load(taxFile);
It raises exception for specified incorrect XML file: "Invalid character in the given encoding. Line 59, position 104." The quick fix is as follows:
XDocument xDoc = null;
using (var oReader = new StreamReader(taxFile, Encoding.UTF8))
{
xDoc = XDocument.Load(oReader);
}
This code doesn't raise exception for the incorrect file. But the 'é' character is loaded as �. My first question is "why does it work?".
Another point is using XmlReader doesn't raise exception until the node with 'é' is loaded.
XmlReader xmlTax = XmlReader.Create(filePath);
And again the workout with StreamReader helps. The same question.
It seems like the fix solution is not good enough, cause one day :) XML encoded in another format may appear and it could be proceed in the wrong way. BUT I've tried to process UTF-16 formatted XML file and it worked fine (configured to UTF-8).
The final question is if there are any options to be provided for XDocument/XmlReader to ignore characters encoding or smth like this.
Looking forward for your replies. Thanks in advance
The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.
As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered
Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:
The UTF8Encoding object that is returned by this property may not have
the appropriate behavior for your application. It uses replacement
fallback to replace each string that it cannot encode and each byte
that it cannot decode with a question mark ("?") character.
You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default.
http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx
If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.
I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to ">".
Is there an easy way to prevent this?
First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.
I'm using the new System.Xml.Linq to create HTML documents (Yes, I know about HtmlDocument, but much prefer the XDocument/XElement classes). I'm having a problem inserting (or any other HTML entity). What I've tried already:
Just putting text in directly doesn't work because the & gets turned int &.
new XElement("h1", "Text to keep together.");
I tried parsing in the raw XML using the following, but it barfs with this error:
XElement.Parse("Text to keep together.");
--> Reference to undeclared entity 'nbsp'.`
Try number three looks like the following. If I save to a file, there is just a space, the gets lost.
var X = new XDocument(new XElement("Name", KeepTogether("Hi Mom!")));
private static XNode KeepTogether(string p)`
{
return XElement.Parse("<xml>" + p.Replace(" ", " ") + "</xml>").FirstNode;
}
I couldn't find a way to just shove the raw text through without it getting escaped. Am I missing something obvious?
I couldn't find a way to just shove the raw text through without it getting escaped.
Just put the Unicode character in that refers to (U+00A0 NO-BREAK SPACE) directly in a text node, then let the serializer worry about whether it needs escaping to or not. (Probably not: if you are using UTF-8 or ISO-8859-1 as your page encoding the character can be included directly without having to worry about encoding it into an entity reference or character reference).
new XElement("h1", "Text\u00A0to\u00A0keep\u00A0together");
Replacing the & with a marker like ##AMP## solved my problem. Maybe not the prettiest solution but I got a demo for a customer in 10 mins so...I don't care :)
Thx for the idea
I know this is old, but I found something and I'm rather surprised!
XElement element = new XElement("name",new XCData("<br /> & etc"));
And there you go! CDATA text!
You could also try using numbered entities, they need no declaration.
Numbered entity equivalent to the named entity is
Unlike amp (&), lt (<) etc, nbsp is not known entity to XML, so you need to declare it.
In XML, e.g. &xyz; is treated as an entity, The parser will reference its value to produce the output.
// the xml, plz remove '.' within xml
string xml = "<xml>test&.n.b.s.p;test</xml>";
// declare nbsp as xml entity and its value is " " in this case.
string declareEntity = "<!DOCTYPE xml [<!ENTITY nbsp \" \">]>";
XElement x = XElement.Parse( declareEntity + xml );
// output with a space between tests
// <xml>test test</xml>
or
// plz remove '.' in the string
XElement.Parse("<xml>" + HttpUtility.HtmlEncode("Text&.n.b.s.p;keep everything") + "</xml>");
You can paste the character as you wish to see it if you copy it somewhere else. Viusal studio allows that.
Though this is hard to do if you need , it is easy if you need any symbols, for example:
&bull ...just paste •
↔ ...just paste ↔
I came up with this slightly daft approach which suits me:
String replace all the & with ##AMP## when you store the data....
And reverse that operation on output.
I am using this in conjunction with XElement SQL column and works a treat.
Regards
Neil