Writing XMLDocument unescaped in C#

Writing XMLDocument unescaped in C# - c#

Currently I'm writing XHTML in a XmlDocument. This works perfect, but I'm stuck on one problem. Some XmlText elements can contain things like . When I want to write such things to a stream it uses the innerXML instead of the innerText value for such nodes. The problem is that the ouput is wrong because now its outputting &nbsp; instead of . How can I use xmlwriter and xmldocument without performing such escaping when writing to a stream? I just want unescaped output.

If you use XmlWriter.WriteRaw, it won't perform any escaping - it assumes you've got raw XML.
For example:
using System;
using System.Xml;
class Test
{
static void Main()
{
using (XmlWriter writer = XmlWriter.Create(Console.Out))
{
writer.WriteStartDocument();
writer.WriteStartElement("root");
writer.WriteRaw("<element> </element>");
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
}
Output:
<?xml version="1.0" encoding="IBM437"?><root><element> </element></root>

You're almost certainly trying to solve the wrong problem here. If you want text with non-breaking spaces, then you should use the non-breaking space character. In a C# string literal you can write it as the escape sequence \u00A0, for example:
var xmldoc = new XmlDocument();
XmlElement test = xmldoc.CreateElement("test");
xmldoc.AppendChild(test);
XmlText nbsp = xmldoc.CreateTextNode("\u00A0");
test.AppendChild(nbsp);
HTML entities like nbsp are just a way to encode such characters in a non-unicode text file. You shouldn't be using them when constructing an XML DOM. By the way, if you force .NET to write the above DOM to an ASCII encoded file (via the proper XmlWriterSettings) then it will probably write the non-breaking space character as  . In an UTF-8 encoded file (the default) it will just appear as a space.
If you force certain literal character sequences to appear in the XML output, then you risk creating invalid XML that cannot be loaded by conforming XML processors. For example, try to load <test> </test> in an empty XmlDocument. This will throw an exception. To be fair, you can declare such entities, and the XHTML schema does so. But I hope you see my point.
edit: XmlDocument is doing it's job correctly. If it wouldn't escape characters such as & < > then you could create invalid XML that's impossible to load again. To force an XML entity in the output you should use XmlDocument.CreateEntityReference. The bug is in whatever code is using entities in XmlText nodes instead of generating XmlEntityReference nodes.

Assuming you are using .NET 3.x ,learn and use LINQ-to-XML... the API is very simple and more capable. That way you need not walk/traverse the DOM...instead you can just query the object tree.
Specifically, look into the XDocument clas of the API.

Related

C# Load XML with special characters inside node

We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?

Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?

If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.

As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

Reading Xml files with umlaut chars

I have asked this question yesterday and got a reply.
Writing encoded values for umlauts
In the code the parse method works if it's a string like so:
XDocument xDoc = XDocument.Parse("<description>Top Shelf-ÖÄÜookcase</description>");
To pass the input xml file as string, I have to read it first. The read method will fail if there are umlauts in the input xml.
How do I get past that?
Tried both Load and Parse methods of XDocument.
Load:
Invalid character in the given encoding. Line 3, position 35.
Parse:
Data at the root level is invalid. Line 1, position 1.
Here is a sample xml after using CDATA:
<?xml version="1.0" encoding="utf-8"?>
<kal>
<description><![CDATA[Top Shelf-ÖÄÜookcase]]> </description>
</kal>

Change encoding to "iso-8859-1"

Have you tried wrapping the description data with a CDATA?
<description><![CDATA[Top Shelf-ÖÄÜookcase]]> </description>
Special characters don't particularly parse well in XML unless you wrap them with CDATA.

As Besi stated, you have to use the correct encoding of the xml-file in order to achieve correct handling of the umlauts.
Even so you said that the creation of the incoming xml-file is not in your hand, you can still affect the encoding to use for parsing the xml by using a dedicated StreamReader:
// create your XDocument
XDocument Doc;
// setup a StreamReader for your file, specifying the encoding you need
using (StreamReader Reader = new StreamReader(#"C:\your-file.xml", System.Text.Encoding.GetEncoding("ISO-8859-1")))
{
// PARSE the STRING that is RETURNED from the StreamReader.ReadToEnd()-method
Doc = XDocument.Parse(Reader.ReadToEnd());
}

How to stop XDocument.Save writing escape chars

I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to "&gt".
Is there an easy way to prevent this?

First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}

How make XMLDocument do not put spaces on self-closed tags?

I have an XML well formatted without any spaces. It' must be like that.
When I load it to XMLDocument to sign, the self-closing tags gets an extra white space and
<cEAN/>
becomes:
<cEAN />
Once this document must be signed, it's impossible to remove the white space.
The property PreserveWhiteSpace doesn't made any difference to the result.
How can I change this behavior?

There is no space before the closing "/" in the XmlDocument. XmlDocument is a data structure consisting of nodes. It is binary. It is not text.
Any extra space you are seeing exists only when you serialize the document as text.
Are you actually having a problem with signing, or do you only think you will have such a problem?

I have had this problem before. XML signed by a basic Hash so it can't change when serialized. I solved it by writing a serializer so that I could be sure that it would output the correct XML.
The basic recipe is to Read the XML with a XMLReader, and write out each chunk as it comes.

Try this:
XMLDocument doc;
...
string XMLstring = doc.OuterXml.Replace(" />","/>");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Writing XMLDocument unescaped in C# - c#

Assuming you are using .NET 3.x ,learn and use LINQ-to-XML... the API is very simple and more capable. That way you need not walk/traverse the DOM...instead you can just query the object tree. Specifically, look into the XDocument clas of the API.

Related

C# Load XML with special characters inside node

Strip < Character from XML content

Reading Xml files with umlaut chars

How to stop XDocument.Save writing escape chars

How make XMLDocument do not put spaces on self-closed tags?

Categories

Resources