Processing large CDATA section in C# - c#

I am trying to retrieve a cdata section from an xml document, the format of the xml is like this:
<Configuration>
<ConfigItem>
<Key>Hello World</Key>
<Value><![CDATA[For the value we have a large chunk of XAML stored in a CDATA section]]></Value>
</ConfigItem>
</Configuration>
What I am trying to do is retrieve the XAML from the CDATA section, my code so far is as follows:
XmlDocument document = new XmlDocument();
document.Load("Configuration.xml");
XmlCDataSection cDataNode = (XmlCDataSection) document.SelectSingleNode("//*[local-name()='Value']").ChildNodes[0];
String cdata = cDataNode.Data;
However the cdata string has been truncated and is incomplete, I guess because the actual cdata is too large to fit in the string object.
Whats the correct way to do this?
EDIT:
So my original assumption that the string was too long was incorrect. The problem now is that my CDATA contains a nested CDATA within it. Reading online it appears that the proper way to escape the nested cdata is to use ]]]]><![CDATA[> which this xml is using, but it seems like when I select the node it is escaping at the wrong place.

When there's nested CDATA sections, what you need to do is piece the data back together. At present, you're just selecting ChildNodes[0] and ignoring all of the other children. What you'll probably find is that ChildNodes[1] contains some plain text, and then ChildNodes[2] contains another CDATA section, and so on.
You need to extract all of these, extract the data from the CData sections, and concatenate them all together to get the effective "text" contents of the Value element.

Related

Strip < Character from XML content

I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.

count number of "elements" in an XML tag using c#

I'm using C# in reading an XML file and counting how many "elements" there are in an XML tag, like this for example...
<Languages>English, Deutsche, Francais</Languages>
there are 3 "elements" inside the Languages tag: English, Deutsche, and Francais . I need to know how to count them and return the value of how much elements there are. The contents of the tag have the possibility of changing over time, because the XML file has to expand/accommodate additional languages (whenever needed).
IF this is not possible, please do suggest workarounds for the problem. Thank you.
EDIT: I haven't come up with the code to read the XML file, but I'm also interested in learning how to.
EDIT 2: revisions made to question
string xml = #"<Languages>English, Deutsche, Francais</Languages>";
var doc = XDocument.Parse(xml);
string languages = doc.Elements("Languages").FirstOrDefault().Value;
int count = languages.Split(',').Count();
In response to your edits which indicate that you're not simply trying to pull out comma separated strings from an XML element, then your approach to storing the XML in the first place is incorrect. As another poster commented, it should be:
<Languages>
<Language>English</Language>
<Language>Deutsche</Language>
<Language>Francais</Language>
</Languages>
Then, to get the count of languages:
string xml = #"<Languages>
<Language>English</Language>
<Language>Deutsche</Language>
<Language>Francais</Language>
</Languages>";
var doc = XDocument.Parse(xml);
int count = doc.Element("Languages").Elements().Count();
First, an "ideal" solution: do not put more than one piece of information in a single tag. Rather, put each language in its own tag, like this:
<Languages>
<Language>English</Language>
<Language>Deutsche</Language>
<Language>Francais</Language>
</Languages>
If this is not possible, retrieve the content of the tag with multiple languages, split using allLanguages.Split(',', ' '), and obtain the count by checking the length of the resultant array.
Ok, but just to be clear, an XML Element has a very specific meaning. In fact, the entire codeblock you have is an XML Element.
XElement xElm = new XElement("Languages", "English, Deutsche, Francais");
string[] elements = xElm.Value.Split(",".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

Save string to XML File

I want to save the following string in an XML File:
<text><![CDATA[<p>what is my pet name</p>]]></text>
When I am saving it, it looks like:
<text><![CDATA[<p>what is my pet name</p>]]></text>
I have tried File.WriteAllText(), XmlDocument.Save() methods but didnt get the proper response.
basically everywhere other than opening and closing tags in the XML, < is replaced by < and > is replaced by >.
What is happening is that the XML parser is encoding your string. When you try to access the string later, it can be decoded again at that time.
What I suggest, is that you either try to load the text as into a new 'XmlDocument' with XmlDocument.LoadXml(string s), and then import that into your current document, or leave it encoded.
You should not try to both use an XML parser, and manually add text at the same time.
I guess you add the CDATA manually and the XML writing mechanism correctly escapes your CDATA because it treats it as text content. Instead explicitly add a CDATA section with just the contents.
If you are using the old XML API (System.XML), then use this method to create the CDATA Section: http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.createcdatasection
Then append the node to the element just like in the example in the link.
XML is being written correctly.
XML has special characters that are reserved for commands, just like C# reserves words like "if" and "string".
XML is encoding your string for storage. What you need to do is when you retrieve your string, run it through a similar decode process.
Use this: HttpServerUtility.HtmlDecode(encodedString)
Reference:
Decode XML returned by a webservice (< and > are replaced with < and &gt)?

Clean out/replace invalid XML characters in element attributes

UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"

Preserving whitespaces and line-breaks in XML Document

I have an xml document like this,
<Customer ID = "000A551"
Name = "Robert"
Salaried = "yes"
Area = "VA"
/>
Please note the how attributes are line-breaked and white-spaced for the editing and reading convenience. When using XDocument or XmlDocument to modify this document whole formatting goes away. Looks like PreserveWhitespace will only handle significant whitespace.
Is there any way out there to maintain line-breaks and whitespaces ?
No, not handling XML natively. The ‘infoset’ (the data model of XML) does not maintain any record of attribute order or whitespace inside an attribute list. Some XML processors maintain attribute order as a side-effect, but none store whitespace between attributes. It is extremely unusual for anyone to care.

Categories