Remove all CDATA nodes and replace with encoded text - c#

So, I've got a massive XML file and I want to remove all CDATA sections and replace the CDATA node contents with safe, html encoded text nodes.
Just stripping out the CDATA with a regex will of course break the parsing. Is there a LINQ or XmlDocument or XmlTextWriter technique to swap out the CDATA with encoded text?
I'm not too concerned with the final encoding quite yet, just how to replace the sections with the encoding of my choice.
Original Example
---
<COLLECTION type="presentation" autoplay="false">
<TITLE><![CDATA[Rights & Responsibilities]]></TITLE>
<ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
<TITLE><![CDATA[Watch the demo]]></TITLE>
<LINK><![CDATA[_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4]]></LINK>
</ITEM>
</COLLECTION>
---
Sould Become
<COLLECTION type="presentation" autoplay="false">
<TITLE>Rights & Responsibilities</TITLE>
<ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
<TITLE>Watch the demo</TITLE>
<LINK>_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4</LINK>
</ITEM>
</COLLECTION>
I guess the ultimate goal is to move to JSON. I've tried this
XmlDocument doc = new XmlDocument();
doc.Load(Server.MapPath( #"~/somefile.xml"));
string jsonText = JsonConvert.SerializeXmlNode(doc);
But I end up with ugly nodes, i.e. "#cdata-section" keys. It would take WAAAAY to many hours to have the front end re-developed to accept this.
"COLLECTION":[{"#type":"whitepaper","TITLE":{"#cdata-section":"SUPPORTING DOCUMENTS"}},{"#type":"presentation","#autoplay":"false","TITLE":{"#cdata-section":"Demo Presentation"},"ITEM":{"#id":"2802725d-dbac-e011-bcd6-005056af18ff","#presenterGender":"male","TITLE":{"#cdata-section":"Watch the demo"},"LINK":{"#cdata-section":"_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4"}

Process the XML with a XSLT that just copies input to output - C# code:
XslCompiledTransform transform = new XslCompiledTransform();
transform.Load(#"c:\temp\id.xslt");
transform.Transform(#"c:\temp\cdata.xml", #"c:\temp\clean.xml");
id.xslt:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Using LINQ to XML, you can do it like this:
XDocument doc = …;
var cDataNodes = doc.DescendantNodes().OfType<XCData>().ToArray();
foreach (var cDataNode in cDataNodes)
cDataNode.ReplaceWith(new XText(cDataNode));

I think you can load the xml into a XmlDocument class. Then recursively process each XmlNode and look for XmlCDataSection node. This XmlCDataSection node should be replaced withXmlTextNode with same value.

Related

XSLT invalid token results in invalid XML document

I am using an XSLT file to transform an XML file to another XML file and then creating this XML file locally. I get this error:
System.InvalidOperationException: 'Token Text in state Start would result in an invalid XML document. Make sure that the ConformanceLevel setting is set to ConformanceLevel.Fragment or ConformanceLevel.Auto if you want to write an XML fragment. '
The XSLT file was debugged in visual studios and it looks like it works correctly but I don't understand this error. What does this mean and how can it be fixed?
This is my XML:
<?xml version="1.0" encoding="utf-8"?>
<In xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="take.xsd">
<Submit ID="1234">
<Values>
<Code>34</Code>
<Source>27</Source>
</Values>
<Information>
<Number>55</Number>
<Date>2018-05-20</Date>
<IsFile>1</IsFile>
<Location></Location>
<Files>
<File>
<Name>Red.pdf</Name>
<Type>COLOR</Type>
</File>
<File>
<Name>picture.pdf</Name>
<Type>IMAGE</Type>
</File>
</Files>
</Information>
</Submit>
</In>
My XSLT code:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
<xsl:output method="xml" indent="yes"/>
<!-- identity template - copies all elements and its children and attributes -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<xsl:template match="/In">
<!-- Remove the 'In' element -->
<xsl:apply-templates select="node()"/>
</xsl:template>
<xsl:template match="Submit">
<!-- Create the 'Q' element and its sub-elements -->
<Q xmlns:tns="Q" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://schema.xsd" Source="{Values/Source}" Notification="true">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:apply-templates select="Values" />
<xsl:apply-templates select="Information" />
<xsl:apply-templates select="Information/Files" />
</xsl:copy>
</Q>
</xsl:template>
<xsl:template match="Information">
<!-- Create the 'Data' sub-element without all of its children -->
<xsl:copy>
<xsl:copy-of select="Number"/>
<xsl:copy-of select="Date"/>
<xsl:copy-of select="IsFile"/>
<xsl:copy-of select="Location"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
And this is the C# code used to transform the file:
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(#"D:\\Main\XLSTFiles\Test.xslt");
string xmlPath = #"D:\Documents\Test2.xml";
using (XmlWriter w = XmlWriter.Create(#"D:\Documents\NewFile.xml"))
{
xslt.Transform(xmlPath, w);
}
Also, is there a way to produce the new XML file with proper indentation? It seems to create each node after the last one is closed and on the custom template it just appends each item one after another.
It's an amazingly unhelpful message, isn't it? But I think I can decipher it for you.
The XSLT processor is producing its output by writing events such as start-document, start-element, output-text to an XML Writer.
If you want to produce a well-formed XML document, then you can't have any text before the start of the first element. The message is saying that if the last thing you did is to issue start-document, then the next thing isn't allowed to be text, because the document would be ill-formed (it says invalid, but it means ill-formed).
Now, XSLT stylesheets are allowed to produce "well-formed fragments" rather than only being allowed to write "well-formed documents". Actually, the term used in the XML spec is "well-formed external general parsed entity", but that's a bit of a mouthful, so everyone calls them "fragments" because that's what DOM calls them, and there's no point using correct terminology in error messages if no-one understands it. The difference is that a fragment can contain multiple elements and text nodes at the top level, for example this <b>really</b> is a <i>well-formed</i> fragment. The problem is that the destination to which you write the XSLT output might not handle fragments, and in this particular case, the XML Writer can handle a fragment only if it's configured to do so.
I suspect you didn't actually intend to produce a fragment, and you need to fix your XSLT code so it outputs a well-formed document.
To expand on Michael Kay's excellent answer (as this was too long to write in comments), for your particular input XML the issue is with whitespace. In the template matching /In you do this...
<xsl:template match="/In">
<!-- Remove the 'In' element -->
<xsl:apply-templates select="node()"/>
</xsl:template>
But by selecting node() you are selecting the whitespace nodes before and after the child Submit, so you end up with a text node before your root Q element causing the error.
So, what you could do in this case, is simply strip out the whitespace from your XML by adding this to your XSLT
<xsl:strip-space elements="*" />
Alternatively, you could also do this, to select only elements, as opposed other nodes (although this would omit comments and processing instructions)
<xsl:apply-templates select="*" />
However, if you have multiple Submit elements in your XML, you then get multiple Q elements in your output, which will be a fragment, as there would be a single root element. If this is what you really intend, then you should make the following change to your C#...
using (XmlWriter w = XmlWriter.Create(#"C:\Users\tcase.BGT\Documents\NewFile.xml", xslt.OutputSettings ))
The default ConformanceLevel is ConformanceLevel.Auto, which I think allows fragments. Adding this will also solve your indentation problem, as it will use the settings in your xsl:output.

Saxon XSLT: Serializer producing weird indents

I'm using Saxon HE 9.5.1.8 to transform an XML to another XML file.
My problem is that the XML content written by the Serializer() class of Saxon prints out several additional indents that I don't want to have in there. I'm assuming that this is "wrong" because I got the expected output when using the DomDestination() class (but then the outer XML document information is missing) or other XSL transformers like the one that is shipped with Visual Studio / .NET Framework.
This is the input XML:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>$44.95</price>
<publish_date>2000-10-01</publish_date>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>$5.95</price>
<publish_date>2000-12-16</publish_date>
</book>
This is the XLST file:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="book">
<book>
<xsl:copy-of select="#*|book/#*" />
<xsl:for-each select="*">
<xsl:attribute name="{name()}">
<xsl:value-of select="text()"/>
</xsl:attribute>
</xsl:for-each>
</book>
</xsl:template>
</xsl:stylesheet>
That is the expected output:
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<book id="bk101" author="Gambardella, Matthew" title="XML Developer's Guide" genre="Computer" price="$44.95" publish_date="2000-10-01" />
<book id="bk102" author="Ralls, Kim" title="Midnight Rain" genre="Fantasy" price="$5.95" publish_date="2000-12-16" />
</catalog>
And that is the output when using Saxon:
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101"
author="Gambardella, Matthew"
title="XML Developer's Guide"
genre="Computer"
price="$44.95"
publish_date="2000-10-01"/>
<book id="bk102"
author="Ralls, Kim"
title="Midnight Rain"
genre="Fantasy"
price="$5.95"
publish_date="2000-12-16"/>
</catalog>
Does anybody know how to suppress or modify this behavior of Saxon? That is the C# code that is used to call the Saxon API:
public Stream Transform(string xmlFilePath, string xsltFilePath)
{
var result = new MemoryStream();
var xslt = new FileInfo(xsltFilePath);
var input = new FileInfo(xmlFilePath);
var processor = new Processor();
var compiler = processor.NewXsltCompiler();
var executable = compiler.Compile(new Uri(xslt.FullName));
var destination = new Serializer();
destination.SetOutputStream(result);
using(var inputStream = input.OpenRead())
{
var transformer = executable.Load();
transformer.SetInputStream(inputStream, new Uri(input.DirectoryName));
transformer.Run(destination);
}
result.Position = 0;
return result;
}
Try setting http://saxonica.com/documentation9.5/extensions/output-extras/line-length.html to a very large value to avoid that attributes are put on a new line: <xsl:output xmlns:saxon="http://saxon.sf.net/" saxon:line-length="1000"/>.
Your goal of having multiple processors produce output in the same format is hopelessly misguided. That's especially so if you choose indented output: the spec leaves it entirely to implementations how to do indentation, saying only that the goal is to make it human-readable. (And placing constraints on where extra whitespace can be inserted.)
I'm sorry you don't find Saxon's way of wrapping long attribute lists pleasing, but it is entirely within the letter and the spirit of the specification. Without it, if you have an element with eight namespace declarations, you can easily get a line that is 400 characters long, which I certainly don't regard as human-readable.
There are many reasons that comparing two XML documents lexically is never going to work. For example, the attributes can be in a different order. There are two ways of comparing XML: convert the documents into canonical form using a "Canonical XML" processor, or compare them at the tree level for example by using the XPath 2.0 deep-equal() function. Ideally (especially if you want to know where the differences are, rather than just whether differences exist), use a specialist XML comparison tool such as DeltaXML.
For what it's worth, when we do unit testing, we first attempt a lexical comparison of the results. If that fails, we parse both documents and compare them using saxon:deep-equal(), which is a modified form of the deep-equal() function that gives fine control over the comparison rules, e.g. handling of whitespace and handling of namespaces.

Preserving whitespace within XML elements between attributes when using XslCompiledTransform

I am applying an XSL-T file xsltUri to an XML file TargetXmlFile using the XslCompiledTransform class:
XslCompiledTransform xslTransform = new XslCompiledTransform(false);
xslTransform.Load(xsltUri);
using (var outStream = new MemoryStream())
{
var writer = new StreamWriter(outStream, new UTF8Encoding());
using (var reader = new XmlTextReader(TargetXmlFileName)
{
WhitespaceHandling = WhitespaceHandling.All,
DtdProcessing = DtdProcessing.Ignore
})
{
xslTransform.Transform(reader, xsltArguments, writer);
}
outStream.Position = 0;
using (FileStream outFile = new FileStream(outputFileName, FileMode.Create))
{
outStream.CopyTo(outFile);
}
}
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element
id="1"
attr1="value11"
attr2="value12"/>
<element id="2" attr1="value21" attr2="value22"/>
</root>
Input XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//element[#id='2']/#attr1">
<xsl:attribute name="attr1">
<xsl:value-of select="'newvalue21'"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Actual output XML:
<?xml version="1.0" encoding="utf-8"?><root>
<element id="1" attr1="value11" attr2="value12" />
<element id="2" attr1="newvalue21" attr2="value22" />
</root>
Desired output XML:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element
id="1"
attr1="value11"
attr2="value12"/>
<element id="2" attr1="newvalue21" attr2="value22"/>
</root>
Question: How can I preserve the whitespace (particularly, line breaks) of the input XML file within the "element" tags in the output XML file? I have experimented with different options, but nothing worked for this case.
Thanks for any hints!
This has nothing to do with XSLT. The whitespace you're referring to does not exist in the XML document model, and it cannot be made significant to a conformant XML processor, even with xml:space="preserve". There is no place for it in the DOM, and it will be skipped by the reader; as such there is no way to copy it to the writer. You would have to emit the XML with custom code (in other words, not with an XmlWriter).
The internal formatting of a tag (whitespace between attributes) is completely ephemeral in XML.
As far as XML documents are concerned, it does not exist.
As far as XML parsers are concerned, it is ignored, because 1). The only exception is that whitespace is illegal immediately after a <.
As far as XML serializers are concerned, they can do what they want, because 1) and 2). Most (if not all) will use a single space character to separate attributes from each other.
So...
Don't try to build an application that depends on the source code layout of XML.
Since this kind of source code layout in XML is technically irrelevant… get over your OCD. ;)

Modify XSLT using C# Code

I am Working on Visual-studio 2012 in C#.
I want to update the value of a node of a XSLT.
This abc.xslt is like:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<DocumentElement>
<PositionMaster>
<Name>
<xsl:value-of select = "'Ryan'"/>
</Name>
</PositionMaster>
</DocumentElement>
Code i have written to modify this XSLT in the C# is:
XmlDocument xslDoc = new XmlDocument();
xslDoc.Load(abc.xslt);
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xslDoc.NameTable);
nsMgr.AddNamespace("xsl", "http://www.w3.org/1999/XSL/Transform");
I am looking to change the value of Name field to David. What should i write further here?
XmlElement valueOf = xslDoc.SelectSingleNode("/xsl:stylesheet/xsl:template[#match = '/']/DocumentElement/PositionMaster/Name/xsl:value-of", nsMgr);
if (valueOf != null)
{
valueOf.SetAttribute("select", "'David'");
xslDoc.Save("new.xslt");
}
else
{
// handle case here that element was not found
}
You seem to be going about this a very odd way. Why not just use a stylesheet parameter (a global xsl:param element)?
And if you do need to modify a source stylesheet, as you sometimes do, surely it makes more sense to use XSLT for the purpose?

XML iteration and rearrangement- C#

I Have an XML file like the one below:
<?xml version="1.0" ?>
<System>
<LP1>
<Equipment>
<FromName>Receptacle</FromName>
<Wire>1-#10, 1-#10, 1-#10</Wire>
<Length>89.8411846136344</Length>
</Equipment>
</LP1>
<X-1>
<Equipment>
<FromName>LP1</FromName>
<Wire>3-#3/0, 1-#3/0, 1-#6</Wire>
<Length>10.170412377555</Length>
</Equipment>
</X-1>
<HP1>
<Equipment>
<FromName>X-1</FromName>
<Wire>3-#3/0, 1-#3/0, 1-#6</Wire>
<Length>8.2423259796908</Length>
</Equipment>
<Equipment>
<FromName>AH-1</FromName>
<Wire>3-#6, 1-#10</Wire>
<Length>32.4019419736209</Length>
</Equipment>
<Equipment>
<FromName>EF-1</FromName>
<Wire>3-#12, 1-#12, 1-#12</Wire>
<Length>8.33572105849677</Length>
</Equipment>
</HP1>
</System>
I need to read it, and re-arrange it to look:
<?xml version="1.0" ?>
<HP1>
<Equipment>
<FromName>X-1</FromName>
<Wire>3-#3/0, 1-#3/0, 1-#6</Wire>
<Length>8.2423259796908</Length>
<Equipment>
<FromName>LP1</FromName>
<Wire>3-#3/0, 1-#3/0, 1-#6</Wire>
<Length>10.170412377555</Length>
<Equipment>
<FromName>Receptacle</FromName>
<Wire>1-#10, 1-#10, 1-#10</Wire>
<Length>89.8411846136344</Length>
</Equipment>
</Equipment>
</Equipment>
<Equipment>
<FromName>AH-1</FromName>
<Wire>3-#6, 1-#10</Wire>
<Length>32.4019419736209</Length>
</Equipment>
<Equipment>
<FromName>EF-1</FromName>
<Wire>3-#12, 1-#12, 1-#12</Wire>
<Length>8.33572105849677</Length>
</Equipment>
</HP1>
</System>
Basically, the original XML has separate Elements (LP1, X-1, HP1) that I want to put as sub elements when the equipment "FromName" matches the parent element name of the system.
I am guessing that I will need to do some recursive function, but I am kind of new to C# and programming in general and haven't had much experience with XML or recursive function.
Any help would be appreciated.
Thank you
While it can be definitely compressed to a one-liner as Steven suggested :) I chose to sprawl around a bit to make it more understandable. Of course I still failed so I'll also explain a bit.
XDocument x = XDocument.Parse(xml);
Func<string, XName> xn = s => XName.Get(s, "");
var systems = x.Elements().First();
var equipments = x.Descendants(xn("Equipment"));
equipments.ToList().ForEach(e =>
{
string fromName = e.Element(xn("FromName")).Value;
var found = systems.Element(xn(fromName));
if (found != null)
{
e.Add(found.Elements(xn("Equipment")));
found.Remove();
};
});
string result = x.ToString();
Assuming xml is the string in the OP, I simply parsed an XDocument from it. Then, a simple shortcut for getting XNames, since the code would have been even more crowded with it inline.
We get all the child elements of System and store it for later; for lack of a better term I called them systems. If there are multiple levels on which these elements may appear, the logic will of course need to be adjusted to find them all reliably.
Then we iterate through all the elements with the name Equipment (equipments), get the FromName element value, and search for an element with the same name in systems. If we find it, we simply add it to the current element and remove it from its parent; since the elements are still all part of the x tree, it works as expected.
Aaand... done. result is the desired result posted by the OP.
There are more than a few tutorials on XML file manipulation. Example of both save & load (reversed for tutorial, but both there):
http://www.java2s.com/Code/CSharp/XML/Loadxmldocumentfromxmlfile.htm
The steps should be roughly...
open Input file
Load input into XmlNodeList
Parse input into ouput XmlNodeList()
Save output into new file.
I actually think I see what your wanting, after a bit of staring... I'd have to tink about the parse step for a bit personally.
If Xml output is what you're after, Xslt is probably your best bet. You'll need to load the xml and stylesheet into memory and then process accordingly. Something like:
// load the stylesheet
XslTransform stylesheet = new XslTransform();
stylesheet.Load(xsltFilePath);
// load the xml
XPathDocument doc = new XPathDocument(xmlFilePath);
//create the output stream
XmlTextWriter myWriter = new XmlTextWriter("output.xml", null);
// autobots! Transform!
stylesheet.Transform(doc, null, myWriter);
myWriter.Close();
Regarding the stylesheet, I'm assuming that you'd be taking the last node in the xml file and then nesting the Equipment nodes from above. In order to do the recursive lookups, you'll need a special xslt feature that ships as part of the MSXML parser, specifically ms:node-set. This function let's you perform an xpath query and return a node rather than raw text.
Tip: Xslt is processed from the bottom up, so scroll down and then read-up.
PS -- I'm doing this from memory. Xslt can be finicky so you may want to play with this a bit.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ms="urn:schemas-microsoft-com:xslt"
>
<!-- wildcard: other content is copied as is -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<!-- this is our magic recursive template. Any element that matches "Equipment"
will caught and processed here. Everything else will default to our
Wildcard above -->
<xsl:template match="Equipment">
<!-- Read the FromName element into the variable $fromName -->
<xsl:variable name="fromName" select="FromName/text()" />
<!-- Manually reconstruct the Equipment Node -->
<Equipment>
<!-- copy out FromName, Wire and Length -->
<xsl:copy-of select="FromName" />
<xsl:copy-of select="Wire" />
<xsl:copy-of select="Length" />
<!-- this is how we recursively pull our Element nodes in, which
will match on this template -->
<xsl:apply-templates select="ms:node-set('//' + $fromName')/Equipment" />
</Equipment>
</xsl:template>
<!-- Starting point: Find the last node under system -->
<xsl:template match="/System/*[last()]">
<!-- copy the elements and attributes of this node to the output stream -->
<xsl:copy>
<!-- match templates on its contents -->
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Categories