C# Load XML with special characters inside node

C# Load XML with special characters inside node - c#

We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?

Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.

Related

Encoding Special Characters Before Processing Invalid XML

I have some invalid XML from a vendor that I need to process. Here is an example:
<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>
So, we have a few problems. First, there is no root document. I overcome that by adding a root document. No problem. The second, and more difficult problem, is the less than symbol. I can just encode the whole thing but it will encode the XML tags. Is there a library or simple method out there somewhere for handling this? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. Appreciate any help.

I would read the file line by line and use a regex to get the values between the nodes. Your example doesn't have nested elements so this is pretty easy. While reading line by line you can replace encode the inner values. The named capture group (?.*?) will get everything between the nodes into the group named xml.
var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?

If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.

As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

How should the '\t' character be handled within XML attribute values?

I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.
The issue is actually fairly easy to reproduce:
Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
xmlDoc.Save(#"d:\TabTest.xml");
NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.
Load the document using new XmlTextReader:
var rawFile = XmlReader.Create(#"D:\TabTest.xml");
var rawDoc = new XmlDocument();
rawDoc.Load(rawFile);
Load the document using XmlReader.Create:
var rawFile2 = new XmlTextReader(#"D:\TabTest.xml");
var rawDoc2 = new XmlDocument();
rawDoc2.Load(rawFile2);
Compare the documents in the debugger:
(rawDoc).InnerXml "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>" string
(rawDoc2).InnerXml "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>" string
The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected.
However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.
What the....!! :-)
After a bit of a Google search I found that I could encode a '\t' as ' ' - if I used this instead of '\t' in the example XML both readers work as expected.
Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?
Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?
Which way should I go to fix/work around this issue?

Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.

#all: Thanks for all your answers and comments.
It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.
In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.
The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.
While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.
So the original file save code becomes:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
using (var xmlWriter = XmlWriter.Create(#"d:\TabTest.Encoded.xml"))
{
xmlDoc.Save(xmlWriter);
}
This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.
The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.

Check out XmlReaderSettings.ComformanceLevel. In particular, this description:
Note that XmlReader objects created by the Create method are more compliant by default than the XmlTextReader class. The following are conformance improvements that are not enabled on XmlTextReader, but are available by default on readers created by the Create method

At a glance it seems that XmlTextReader is not compliant with the W3C recommendation. See the section in the recommendation on attribute value normalization, specifically
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
Hence the behaviour that you weren't expecting (seeing a space instead of a tab) is actually the correct recommended behaviour.
I have no idea why XmlTextReader is behaving this way (there is nothing in the documentation), however you seem to have already seem to have identified the correct workaround - encode the attribute as instead. In this case the normalised string will contain the tab character itself.

How to stop XDocument.Save writing escape chars

I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to "&gt".
Is there an easy way to prevent this?

First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??

I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.

I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Load XML with special characters inside node - c#

Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.

Related

Encoding Special Characters Before Processing Invalid XML

Strip < Character from XML content

How should the '\t' character be handled within XML attribute values?

How to stop XDocument.Save writing escape chars

C# Special Characters not displayed propely in XML

Categories

Resources