I am using XLinq (XML to Linq) to parse a xml document and one part of the document deals with representing rich-text and uses the xml:space="preserve" attribute to preserve whitespace within the rich-text element.
The issue I'm experiencing is that when I have a element inside the rich-text which only contains a sub-element but no text, XLinq reformats the xml and puts the element on its own line. This, of course, causes additional white space to be created which changes the original content.
Example:
<rich-text xml:space="preserve">
<text-run><br/></text-run>
</rich-text>
results in:
<rich-text xml:space="preserve">
<text-run>
<br/>
</text-run>
</rich-text>
If I add a space or any other text before the <br/> in the original xml like so
<rich-text xml:space="preserve">
<text-run> <br/></text-run>
</rich-text>
the parser doesn't reformat the xml
<rich-text xml:space="preserve">
<text-run> <br/></text-run>
</rich-text>
How can I prevent the xml parser from reformatting my element?
Is this reformatting normal for XML parsing or is this just an unwanted side effect of the XLinq parser?
EDIT:
I am parsing the document like this:
using (var reader = System.Xml.XmlReader.Create(stream))
return XElement.Load(reader);
I am not using any custom XmlReaderSettings or LoadOptions
The problem occurs when I use the .Value property on the text-run XElement to get the text value of the element. Instead of receiving \n which would be the correct output from the original xml, I will receive
\n \n
Note the additional whitespace and line break due to the reformatting! The reformatting can also be observed when inspecting the element in the debugger or calling .ToString().
Have you tried this:
yourXElement.ToString(SaveOptions.DisableFormatting)
This should solve your problem.
btw - you should also do a similar thing on load:
XElement.Parse(sr, LoadOptions.PreserveWhitespace);
Related
I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?
You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.
I want to use an xml document to store print strings that will be sent to an IPL printer along with some other data using C#.
Is there an attribute I can assign to the node that will ignore any xml syntax within that node and basically treat it as a string? For example I have a node below called string which will contain the print job that I want to send to the printer. The problem is the string has XML style tags within it which is causing formatting issues when viewed in visual studio. I am using C# serializer to read the xml tag and copy its contents to a string. What would be the best way to accomplish this task?
<string>
<STX>R<ETX>
<STX><ESC>C<SI>W565<SI>h<ETX>
<STX><ESC>P<ETX>
<STX>E3;F3<ETX>
<STX>U1,LOGO;f3;o130,30;c2;h0,w0<ETX>
<STX>H2,A1;f3;o130,220;c26;b0;h12;w12;d3,PART NO:<ETX>
<STX>H3,B1;f3;o35,220;c26;b0;h8;w7;d3,Date Code<ETX>
<STX>H4,C1;f3;o35,390;c26;b0;h8;w7;d3,Supplier Code<ETX>
<STX>H5,D1;f3;o15,220;c26;b0;h8;w7;d3,lss 1<ETX>
<STX>H6,E1;f3;o15,435;c26;b0;h8;w7;d3,ASSEMBLED IN USA<ETX>
<STX>B7,CODEA;f3;o95,220;c0,3;w2;h55;r0;d0,10<ETX>
<STX>H8,DATAA;f3;o130,350;c26;b0;h12;w12;d0;10<ETX>
<STX>H9,DATAB;f3;o35,300;c26;b0;h8;w7;d0,8<ETX>
<STX>H10,DATAC;f3;o35,510;c26;b0;h8;w7;d0,7<ETX>
<STX>R<ETX>
<STX><ESC>E3<CAN><ETX>
<STX><ESC>F7<DEL>var0<ETX>
<STX><ESC>F8<DEL>var1<ETX>
<STX><ESC>F9<DEL>var2<ETX>
<STX><ESC>F10<DEL>var3<ETX
<STX><RS>1<US>1<ETX>
<STX><ETB><FF><ETX>
</string>
You can turn xml parsing off for your string by using CDATA.
<string>
<![CDATA[
<STX>R<ETX>
[...]
<STX><ETB><FF><ETX>
]]>
</string>
All content between the opening and closing CDATA tag will strictly be treated as string. For more see: What does CDATA in XML mean?
I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.
I am using XDocument in LINQ to edit (insert) and save xml document.
XDocument doc = XDocument.Load("c:\\sample.xml", LoadOptions.PreserveWhitespace);
doc.Save("c:\\sample.xml",SaveOptions.DisableFormatting)
sample.xml before doc.Save :
<ELEMENT ATTRIB1="attrib1" ATTRIB2="attrib2" >
value
</ELEMENT>
sample.xml after doc.Save
<ELEMENT ATTRIB1="attrib1" ATTRIB2="attrib2">
value
</ELEMENT>
As you can see, there is double space after ATTRIB1 and a single space after ATTRIB2 in the original document.
But these spaces have been removed by linq when I call doc.save.
How can I preserve the whitespaces inside tag?
I believe that LoadOptions.PreserveWhitespace and SaveOptions.DisableFormatting only instruct XDocument on how to handle whitespace in terms of indentation and the content of text nodes. It would still normalize the attributes, etc.
You may wish to use an overload where you specify an XmlWriter that is configured to do what you want, and if you can't find a configuration that works with the default XmlTextWriter, you could always create your own XmlWriter.
These are "not significant whitespaces" and are removed at the moment of reading the XML. By the time you call save there is no information about spacing between attributes. (Note that strictly speaking even order of attributes may not be known as it has no significance in XML).
If you want to read/write XML in a way that is not directly supported by XML standard you need to provide some custom handling. Depending on requirements custom XmlWriter may be enough (i.e. if you want uniformly separate attributes with 2 whitespaces) or you'll need to build whole stack (readers/writers/nodes) yourself if you want to actually preserve information from original XML (treating it as text, not XML).
UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentiniƫ love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"