I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?
You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.
Related
I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.
UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"
I have following XElement:
<title>
<bold>Foo</bold>
<italic>Bar</italic>
</title>
When I get Value property it returns FooBar without space. How to fix it?
By definition, the Value of the <title> element is the concatenation of all text in this element. By default whitespace between elements and their contents is ignored, so it gives "FooBar". You can specify that you want to preserve whitespace:
var element = XElement.Parse(xml, LoadOptions.PreserveWhitespace);
However it will preserve all whitespace, including the line feeds and indentation. In your XML, there is a line feed and two spaces between "Foo" and "Bar"; how is it supposed to guess that you only want to keep one space?
From the documentation for the Value property of the XElement class:
Gets or sets the concatenated text contents of this element.
Given your example, this behavior is expected. If you want spaces, you will have to provide the logic to do it.
I am using XLinq (XML to Linq) to parse a xml document and one part of the document deals with representing rich-text and uses the xml:space="preserve" attribute to preserve whitespace within the rich-text element.
The issue I'm experiencing is that when I have a element inside the rich-text which only contains a sub-element but no text, XLinq reformats the xml and puts the element on its own line. This, of course, causes additional white space to be created which changes the original content.
Example:
<rich-text xml:space="preserve">
<text-run><br/></text-run>
</rich-text>
results in:
<rich-text xml:space="preserve">
<text-run>
<br/>
</text-run>
</rich-text>
If I add a space or any other text before the <br/> in the original xml like so
<rich-text xml:space="preserve">
<text-run> <br/></text-run>
</rich-text>
the parser doesn't reformat the xml
<rich-text xml:space="preserve">
<text-run> <br/></text-run>
</rich-text>
How can I prevent the xml parser from reformatting my element?
Is this reformatting normal for XML parsing or is this just an unwanted side effect of the XLinq parser?
EDIT:
I am parsing the document like this:
using (var reader = System.Xml.XmlReader.Create(stream))
return XElement.Load(reader);
I am not using any custom XmlReaderSettings or LoadOptions
The problem occurs when I use the .Value property on the text-run XElement to get the text value of the element. Instead of receiving \n which would be the correct output from the original xml, I will receive
\n \n
Note the additional whitespace and line break due to the reformatting! The reformatting can also be observed when inspecting the element in the debugger or calling .ToString().
Have you tried this:
yourXElement.ToString(SaveOptions.DisableFormatting)
This should solve your problem.
btw - you should also do a similar thing on load:
XElement.Parse(sr, LoadOptions.PreserveWhitespace);
Ok, I'm reading data from a stream using a StreamReader. The data inside the stream is not xml, it could be anything.
Based on the input StreamReader I'm writing to an output stream using an XmlTextWriter. Basically, when all is said and done, the output stream contains data from the input stream wrapped in a element contained in a parent element.
My problem is twofold. Data gets read from the input stream in chunks, and the StreamReader class returns char[]. If data in the input stream contains a "]]>" it needs to be split across two CDATA elements. First, how do I search for "]]>" in a char array? And second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?
I could probably convert the char[] to a string, and do a search replace on it. That would solve my first problem. On each read, I could also check to see if the last character was a "]", so that on the next read, if the first two characters are "]>" I would start a new CDATA section.
This hardly seems efficient because it involves converting the char array to a string, which means spending time to copy the data, and eating up twice the memory. Is there a more efficient way, both speedwise and memory wise?
According to HOWTO Avoid Being Called a Bozo When Producing XML:
Don’t bother with CDATA sections
XML provides two ways of escaping
markup-significant characters:
predefined entities and CDATA
sections. CDATA sections are only
syntactic sugar. The two alternative
syntactic constructs have no semantic
difference.
CDATA sections are convenient when you
are editing XML manually and need to
paste a large chunk of text that
includes markup-significant characters
(eg. code samples). However, when
producing XML using a serializer, the
serializer takes care of escaping
automatically and trying to
micromanage the choice of escaping
method only opens up possibilities for
bugs.
...
Only <, >, & and (in attribute values) " need escaping.
So long as the small set of special characters are encoded/escaped it should just work.
Whether you have to handle the escaping yourself is a different matter, but certainly a much more straightforward-to-solve problem.
Then just append the whole lot as a child text node to the relevant XML element.
I know of exactly two real use cases for CDATA:
One is in an XHTML document containing script:
<script type="text/javascript">
<![CDATA[
function foo()
{
alert("You don't want <this> text escaped.");
}
]]>
</script>
The other is in hand-authored XML documents where the text contains embedded markup, e.g.:
<p>
A typical XML element looks like this:
</p>
<p>
<pre>
<![CDATA[
<sample>
<text>
I'm using CDATA here so that I don't have to manually escape
all of the special characters in this example.
</text>
</sample>
]]>
</pre>
</p>
In all other cases, just letting the DOM (or the XmlWriter, or whatever tool you're using to create the XML) escape the text nodes works just fine.
second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?
Indeed, you would have to keep back the last two characters in a queue instead of spitting them out immediately. Then when new input comes in, append it to the queue and again take all but the last two characters, search-and-replace over them, and output.
Better: don't bother with a CDATA section at all. They're only there for the convenience of hand-authoring. If you're already doing search-and-replace, there's no reason you shouldn't just search-and-replace ‘<’, ‘>’ and ‘&’ with their predefined entities, and include those in a normal Text node. Since those are simple single-character replacements, you don't need to worry about buffering.
But: if you're using an XmlTextWriter as you say, it's as simple as calling WriteString() on it for each chunk of incoming text.