Removing non-printing characters from XML text (or any string) - c#

I'm getting an XML document back from a company and it has embedded tabs, newlines and other non-printing garbage in it. Is there some method in the framework that will take such a string and remove these unwanted characters? Some screenshots below, these are not debugger/visualiser artefacts as they are actually coming into play when I do string compares
Example #1:
Example #2:
FWIW these XML documents come from UTF8 encoding the response to a web request.
EDIT 2014-09-03 20:20 IST
In response to comments below from #CodeCaster I upload values (in the form of a NameValueCollection) using an instance of a WebClient. The response comes back to me and I do the following:
string reply = System.Text.Encoding.UTF8.GetString(response);
XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(new NameTable());
xmlNamespaceManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(reply);
It is this xmlDocument that has the offending characters throughout

That's a trivial task for XSLT.
This XSLT stylesheet normalizes (removes excessive whitespace from) all text nodes from the input XML document, leaving everything else untouched.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node() | #*">
<xsl:copy>
<xsl:apply-templates select="node() | #*" />
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select="normalize-space()" />
</xsl:template>
</xsl:stylesheet>
Use the XslCompiledTransform class to apply it to your input XML.
Be aware that whitespace may sometimes carry meaning. Clobbering all of it might be counter-productive.
When in doubt, adapt the match expression (<xsl:template match="text()">) to something more specific (like <xsl:template match="message//text()"> or <xsl:template match="status/text()">) to affect only those text nodes that you really want to straighten out.
Of course you can achieve the same effect by applying a regular expression to the offending string value after you extracted it from the document:
return Regex.Replace(value, #"\s+", " ").Trim();
Using XSLT to clean up the input XML up-front in one step might be more convenient.

Related

Convert CDATA node to encoded string in .Net

TL;DR - in .Net and XmlDocument/XDocument is there an easy way (XPath?) to find CDATA nodes, so they can be removed and the contents encoded?
Details...
My system has lots of situations where it builds XML strings manually (e.g. string concatination, rather than building via XmlDocument or XDocument) which could contain multiple <![CDATA[...]]> nodes (which could appear at any level of the structure)... e.g.
<data><one><![CDATA[ab&cd]]></one><two><inner><![CDATA[xy<z]]></inner></two></data>
When storing this data in a SQLServer XML column, the <![CDATA[..]]> is automatically removed and the inner text encoded... this is standard for SQLServer which doesn't "do" CDATA.
My issue is that I have complex code that takes two instances of a class, and audit-trails differences between them... one or more could be a string property containing XML.
This results in a mismatch (and therefore an audit-trail entry) when nothing is actually changing, because the code creates one format of XML and SQLServer returns a different form, e.g...
// Manually generated XML string...
<data><one><![CDATA[ab&cd]]></one><two><inner><![CDATA[xy<z]]></inner></two></data>
// SQLServer returned string...
<data><one>ab&cd</one><two><inner>xy<z</inner></two></data>
Is there an easy way in .Net to process the manually generated XML and convert each CDATA node into it's encoded version, so I can compare the string to the one returned by SQLServer?
Is there a SelectNodes XPath that would find all those elements?
(And before anybody states it, the obvious solution is to not use CDATA in the manual creation of the XML in the first place... however, this is not possible due to the sheer number of instances.)
Easy with one foreach loop and ReplaceChild:
using System.Xml;
var doc = new XmlDocument();
doc.LoadXml(#"<data><one><![CDATA[ab&cd]]></one><two><inner><![CDATA[xy<z]]></inner></two><three><inner>a < b</inner></three></data>");
foreach (var cdata in doc.SelectNodes("//text()").OfType<XmlCDataSection>())
{
cdata.ParentNode.ReplaceChild(doc.CreateTextNode(cdata.Data), cdata);
}
Console.WriteLine(doc.OuterXml);
Outputs
<data><one>ab&cd</one><two><inner>xy<z</inner></two><three><inner>a < b</inner></three></data>
Another option would be to run the XML through an XSLT identity transformation with XslCompiledTransform and e.g.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

JSON to XML using XSLT

I met a requirement in which I need to get transfer JSON data into various XML document on the basis of XSLT.
In fact, same json data goes to different systems and they have their own object structure (properties nesting level etc) to store it.
I use XslCompiledTransform() in C# to transform Xml into Json; And now looking if there is any efficient way of transforming JSON into XML using XSLT ?
I don't think this will work. JSON is not XML based, so you can't apply XSLT transformations on it. XML to JSON would work, but not JSON to XML
Edit. I was wrong, check this: https://github.com/bramstein/xsltjson and this: How to convert json to xml using xslt
XSLT is to change one xml document to another xml document, however, json is not even a xml type document..
you can write a simple application to transfer the format
Setting aside the fact that XSLT is definitely not the right tool for that job, here's a pseudo approach to how I'd do it if I ever had to:
Create an extension function in C# that does the real job, i.e., getting a JSON string as an argument, returning a generic XPathNodeIterator XML chunk.
Process that result normally with XSLT to return the final custom converted format.
The XSLT would then look something like this (assuming XSLT 1.0 since you're in C#):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:util="urn:JsonUtility.Converter"
>
<!-- Supplied from environment -->
<xsl:param name="json" />
<xsl:template match="/">
<xsl:variable name="xml" select="util:JSON2XML($json)" />
<!-- Start processing the returned XML -->
<xsl:apply-templates select="$xml/json" />
</xsl:template>
<xsl:template match="key">
<!-- output -->
</xsl:template>
<xsl:template match="array">
<!-- output -->
</xsl:template>
<!-- etc. -->
</xsl:stylesheet>
(Alternatively, if you create the final format in the C# extension, you could just do a <xsl:copy-of select="$xml" /> in the root template.)

XSLT getting a piece of xml to custommethod

I have a small question about XSLT, I've only recently started with xslt.
So the thing is I need to give with my custom method a piece of xml that matches the template, but the problem is, what I give is a string but it doenst have tags anymore:
so example if my xml looks like this:
<a>hi</a>
<a>bye</b>
I recieve only string that consists as this: "hi bye"
So I need to give instead of only the value/text of the node, but whole node with tags and attributes and elements etc etc.
My xslt looks like this:
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(node()[*], #name)"/>
</xsl:template>
but whatsoever I tried (like ./node() or descendant::node() or * and so on), I always get the string without xml tags :(
but I need to have something like this passed to my method in a string.
<a>hi</a><a>bye</a>
If you just want to get the tag name, try
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(name(.))"/>
</xsl:template>
If you want the whole element, as well as the tag name, try
<xsl:template match="SpecialNode">
<xsl:value-of select="CustomMethod:Handler(., name(.))"/>
</xsl:template>
Use:
CustomMethod:Handler(.)
Your XSLT stylesheet is processing a tree of nodes, and you want your external c# (?) code to see lexical serialized XML containing angle brackets. So the tree of nodes needs to be serialized into lexical XML somewhere along the line. That's not going to happen by magic as an implicit conversion done by a function call. It's probably best to let the C# code receive the data as nodes, and do the serialization from there - assuming the processing can't be node at the tree level.

How determine the right xml to write out

<?xml version="1.0" encoding="UTF-8"?>
<idmef:IDMEF-Message version="1.0" xmlns:idmef="http://iana.org/idmef">
<idmef:Alert messageid="abc123456789">
<idmef:Analyzer analyzerid="bc-corr-01">
<idmef:Node category="dns">
<idmef:name>correlator01.example.com</idmef:name>
</idmef:Node>
</idmef:Analyzer>
<idmef:CreateTime ntpstamp="0xbc72423b.0x00000000">2000-03-09T15:31:07Z
</idmef:CreateTime>
<idmef:Source ident="a1">
<idmef:Node ident="a1-1">
<idmef:Address ident="a1-2" category="ipv4-addr">
<idmef:address>192.0.2.200</idmef:address>
</idmef:Address>
</idmef:Node>
</idmef:Source>
<idmef:Target ident="a2">
<idmef:Node ident="a2-1" category="dns">
<idmef:name>www.example.com</idmef:name>
<idmef:Address ident="a2-2" category="ipv4-addr">
<idmef:address>192.0.2.50</idmef:address>
</idmef:Address>
</idmef:Node>
<idmef:Service ident="a2-3">
<idmef:portlist>5
</idmef:portlist>
</idmef:Service>
</idmef:Target>
<idmef:Classification text="Login Authentication">
<idmef:Reference origin="vendor-specific">
<idmef:name>portscan</idmef:name>
<idmef:url>http://www.vendor.com/portscan</idmef:url>
</idmef:Reference>
</idmef:Classification>
<idmef:Assessment>
<idmef:Impact severity ="high" completion ="failed" type ="file" >
</idmef:Impact>
</idmef:Assessment>
</idmef:Alert>
</idmef:IDMEF-Message>
I'm working with a xml messaging system, where a message packet is read from a queue, and applied against a rule with a pattern in it. If the pattern matches, the rule fires and some elements, node etc of the xml are read and stored. The definition of what to be read from the message is defined using Xpath expression. For example, the following xpath takes the severity attribute and store it.
name.set(".//idmef:Classification/idmef:Assesment/idmef:Impact/#severity","high");
So, I would take that xpath, compile it, and read the serverity attribute and store for latter use.
When I go to create the new XML message using the stored value, there may be a case that the completion and type attribute are mandatory.
So question is, how do I check if those attributes need to be written out. I know that schema is involved somehow, but how do you do it. More to the point, if the user selects only the severity attribute, how would I go about, adding in the rest of the structure, like Classification, Message and other elements, when have additional xpath lookups, for example down at
Bob.
The commenters are correct - you need to first fix your XML to make it well formed.
However, If I understand your problem correctly, you need write out some XML, adding or changing some attributes.
If this is what you need I would try using an XSL transform to add the attributes.
Here is a modified version of the identity transform that should be close to what you need.
if you need some conditional logic then surround the attribute tags with xsl:if
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:idmef="http://iana.org/idmef" xpath-default-namespace="http://iana.org/idmef">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Impact">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:attribute name="severity">high</xsl:attribute>
<xsl:attribute name="completion">failed</xsl:attribute>
<xsl:attribute name="type">file</xsl:attribute>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You could:
Open original XML (A)
Create a new XML document (B)
Run your xpath against (A)
Add matching results to (B)
Save (B)
This makes any sense?
I found an answer here on stackoverflow, and here it is. Create XML Nodes from XPath I know it is as far away from how I described it above, but at the time I was designing it, I
didn't have a scobie how it would work.

Transforming flat file to XML using XSLT-like technology

I'm designing a system which is receiving data from a number of partners in the form of CSV files. The files may differ in the number and ordering of columns. For the most part, I will want to choose a subset of the columns, maybe reorder them, and hand them off to a parser. I would obviously prefer to be able to transform the incoming data into some canonical format so as to make the parser as simple as possible.
Ideally, I would like to be able to generate a transformation for each incoming data format using some graphical tool and store the transformation as a document in a database or on disk. Upon receival of data, I would apply the correct transformation (never mind how I determine the correct transformation) to get an XML document in a canonical format. If the incoming files had contained XML I would just have created an XSLT document for each format and been on my way.
I've used BizTalk's Flat File XSLT Extensions (or whatever they are called) for something similar in the past, but I don't want the hassle of BizTalk (and I can't afford it either) on this project.
Does anyone know if there are alternative technologies and/or XSLT extensions which would enable me to achieve my goal in an elegant way?
I'm developing my app in C# on .NET 3.5 SP1 (thus would prefer technologies supported by .NET).
XSLT provides new features that make it easier to parse non-XML files.
Andrew Welch posted an XSLT 2.0 example that converts CSV into XML
I think you need something like this (sorry, not supported by .NET but code is very simple)
http://csv2xml.sourceforge.net
IIRC someone has created a "LINQ to CSV" library that might be a starting point to create the intermediate XML (in memory) as input into the transform.
Found it here.
You might try LINQ to CSV. There is one offering from Microsoft's Eric White and another from Matt Perdeck. Others are out there...
I have found 2 potential solutions when looking into a similar problem space.
Progress Software has a set of tools and API (.Net), which when used in conjuction with .conv (flat to XML converter) files created in their Stylus Studio tool allows for transformation of any pre-defined flat file format into XML at run time. More info here: http://www.datadirect.com/developer/data-integration/tutorials/converter-sample-code/index.ssp
In addition there is an XML format called XFLAT which allows for the description of flat files in a variety of formats, delimited, fixed width etc... There is a java program which will convert flat files, where you've provied the XFLAT description into XML so that you can continue with a standard XML to XML XSLT transformation. More details can be found here: http://www.unidex.com/overview.htm
I have never actually used either of these tools, but found them when researching a similar problem.
Check out this article on implementing an XmlReader that processes non-XML input. It's not a terrifically difficult task, and once you've got it working you don't need to use an XSLT-like technology, you can just use XSLT.
this will parse the output from the linux ip route list command. It's just what I had laying around.
you must wrap the output from the comman in an element called 'output' and the style sheet will take it from there. The real key here is the tokenize command in the xpath 2.0 spec. I don't know how you could do this before that. Also this doesn't make a single root element, as that was not what I needed it for. In your case, instead spliting on space, Id spli on a ','
<?xml version="1.0" encoding="UTF-8"?>
<xsl:output method="xml" indent="yes" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="//output">
<!-- split things up for each new line -->
<xsl:variable name="line" select="tokenize(.,'\n')"/>
<xsl:for-each select="$line">
<!-- split each line into peices based on space -->
<xsl:variable name="split" select="tokenize(.,' +')"/>
<xsl:if test="count($split) > 1">
<xsl:element name="route">
<xsl:for-each select="$split">
<xsl:choose>
<xsl:when test="position() = 1">
<xsl:attribute name="address" select="."/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="index" select="position()"/>
<xsl:variable name="fieldName" select="."/>
<xsl:if test="$fieldName and position() mod 2 = 0">
<xsl:attribute name="{$fieldName}" select="$split[$index + 1]"/>
</xsl:if>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:element>
</xsl:if>
</xsl:for-each>
</xsl:template>
You can also take a look at altova's MapForce

Categories