Efficient way to encode CDATA elements - c#

Ok, I'm reading data from a stream using a StreamReader. The data inside the stream is not xml, it could be anything.
Based on the input StreamReader I'm writing to an output stream using an XmlTextWriter. Basically, when all is said and done, the output stream contains data from the input stream wrapped in a element contained in a parent element.
My problem is twofold. Data gets read from the input stream in chunks, and the StreamReader class returns char[]. If data in the input stream contains a "]]>" it needs to be split across two CDATA elements. First, how do I search for "]]>" in a char array? And second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?
I could probably convert the char[] to a string, and do a search replace on it. That would solve my first problem. On each read, I could also check to see if the last character was a "]", so that on the next read, if the first two characters are "]>" I would start a new CDATA section.
This hardly seems efficient because it involves converting the char array to a string, which means spending time to copy the data, and eating up twice the memory. Is there a more efficient way, both speedwise and memory wise?

According to HOWTO Avoid Being Called a Bozo When Producing XML:
Don’t bother with CDATA sections
XML provides two ways of escaping
markup-significant characters:
predefined entities and CDATA
sections. CDATA sections are only
syntactic sugar. The two alternative
syntactic constructs have no semantic
difference.
CDATA sections are convenient when you
are editing XML manually and need to
paste a large chunk of text that
includes markup-significant characters
(eg. code samples). However, when
producing XML using a serializer, the
serializer takes care of escaping
automatically and trying to
micromanage the choice of escaping
method only opens up possibilities for
bugs.
...
Only <, >, & and (in attribute values) " need escaping.
So long as the small set of special characters are encoded/escaped it should just work.
Whether you have to handle the escaping yourself is a different matter, but certainly a much more straightforward-to-solve problem.
Then just append the whole lot as a child text node to the relevant XML element.

I know of exactly two real use cases for CDATA:
One is in an XHTML document containing script:
<script type="text/javascript">
<![CDATA[
function foo()
{
alert("You don't want <this> text escaped.");
}
]]>
</script>
The other is in hand-authored XML documents where the text contains embedded markup, e.g.:
<p>
A typical XML element looks like this:
</p>
<p>
<pre>
<![CDATA[
<sample>
<text>
I'm using CDATA here so that I don't have to manually escape
all of the special characters in this example.
</text>
</sample>
]]>
</pre>
</p>
In all other cases, just letting the DOM (or the XmlWriter, or whatever tool you're using to create the XML) escape the text nodes works just fine.

second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?
Indeed, you would have to keep back the last two characters in a queue instead of spitting them out immediately. Then when new input comes in, append it to the queue and again take all but the last two characters, search-and-replace over them, and output.
Better: don't bother with a CDATA section at all. They're only there for the convenience of hand-authoring. If you're already doing search-and-replace, there's no reason you shouldn't just search-and-replace ‘<’, ‘>’ and ‘&’ with their predefined entities, and include those in a normal Text node. Since those are simple single-character replacements, you don't need to worry about buffering.
But: if you're using an XmlTextWriter as you say, it's as simple as calling WriteString() on it for each chunk of incoming text.

Related

XML: how to pre-parse when only SOME data is escaped?

XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

Writing xml with at most two tags per line

I am saving xml from .NET's XElement. I've been using the method ToString, but the formatting doesn't look how I'd like (examples below). I'd like at most two tags per line. How can I achieve that?
Saving XElement.Parse("<a><b><c>one</c><c>two</c></b><b>three<c>four</c><c>five</c></b></a>").ToString() gives me
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three<c>four</c><c>five</c></b>
</a>
But for readability I would rather 'three', 'four' and 'five' were on separate lines:
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three
<c>four</c>
<c>five</c>
</b>
</a>
Edit: Yes I understand this is syntactically different and "not in the spirit of xml", but I'm being pragmatic. Recently I've seen megabyte-size xml files with as few as 3 lines—these are challenging to text editors, source control, and diff tools. Something needs to be done! I've tested that changing the formatting above is compatible with our application.
If you want exactly that output, you'll need to do it manually, adding whitespace around nodes as necessary.
Almost all whitespace in XML documents is significant, even if we only think of it as indenting. When we ask the serializer to indent the document for us, it is making changes to the content that can get extracted, so they try to be as conservative as possible. The elements
<tag>foo</tag>
and
<tag>
foo
</tag>
have different content, and if an serializer changed the former into the latter, it would change what you get back from your XML API when asking for the contents of <tag>.
The usual rule of thumb is that no indenting will be applied if there's any existing non-whitespace between the elements. In this case, your three between the tags would be modified if a serializer applied the indenting you desire, so nothing will do it for you automatically.
If you have control over the XML format, it's inadvisable to mix element and text children like this, where <b> has both text (three) and element (<c>) children, as it causes issues like what you're seeing.
The formatting isn't working the way you want because of the naked "three". Is there a reason it's not in it's own tag? Should it be an attribute of "b" instead?
Explained reasons to colleagues - we're going to change the file format. I recommend you try to do the same. It's nigh impossible to do what I wanted, because most xml tools assume whitespace is significant.
XML is an information exchange format, intended for computers. The whitespace is irrelevant (depending on location and schema, really) and as such, it would be arbitrary to use one or the other.
You could use XmlTextWriter with XElement.Save and see whether you can tweak it to your liking with the XmlWriter.Settings Property
I've had to do something similar before (for a client request). All I ended up doing was writing a custom .ToString() method only used for either displaying the XML in a browser(ugh, i know) or for their use in downloading an xml file of the content. Because the code did not have to be computationally efficient, it was merely a matter of checking the children of each tag and arranging the 'hanging' text as such.
Eventually we were able to convince the user that the text should be an attribute instead.

What is the proper way to store a file name in XML?

I'm using XDocument to cache a list of files.
<file id="20" size="244318208">a file with an &ersand.txt</file>
In this example, I used XText, and let it automatically escape characters in the file name, such as the & with &
<file id="20" size="244318208"><![CDATA[a file with an &ersand.txt]]></file>
In this one, I used XCData to let me use a literal string rather than an escaped one, so it appears in the XML as it would in my application.
I'm wondering if either of them is better than the other under any certain conditions, or if it is just personal taste. Also, if it means anything, the file names may or may not contain illegal characters.
I wouldn't explicitly use either XText or XCData - I'd just provide a string and let LINQ to XML do whatever it wants.
I do think the non-CDATA version is generally clearer though. Yes, amperands are escaped - and < will be too - but that's still considerably less fluff than the CDATA start/end section.
Don't forget that it should be pretty rare for humans to see the XML representation itself - the idea is that it's a transport for information which is reasonably readable in that representation when you need to. I wouldn't get too hung up about it.
Both are essentially the same and there is no specific "best practice".
Personally, I reserve <![CDATA[]]> for large amounts of text that requires lots of escaping (say bits of code or HTML markup).
In this specific case, I would rather escape the & to & as in your first example.
Most file names will not contain ampersands, or less then symbols. So go with XText. Reserve XCData for cases where you expect a lot of those characters, such as when embedding and HTML fragment in an XML document.
Rationale: difference in CPU utilization to serialize and parse text are completely negligible. But there is a (small) difference in storage, bandwidth or memory needs. Everything else being equal, use the format that uses the least space (even if the differences are small).
It doesn't matter.
They're both valid XML, and they both have the same meaning.

Clean out/replace invalid XML characters in element attributes

UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"

Categories