Sanitizing string before adding it to XML? - c#

Consider the following code:
private XmlDocument CreateMessage(string dirtyInput)
{
XmlDocument xd = new XmlDocument();
string str = #"<Message><Request>%REQ%</Request><Message>";
str = str.Replace("%REQ%", dirtyInput);
xd.LoadXml(str);
return xd;
}
What steps should I take to sanitize/validate this dirtyInput string (it can come from untrusted sources)?
EDIT:
To provide a bit more context, this XML "message" is then being sent (by me) to a third party web service. I am mostly concerned with the mitigating the risk that someone could pass me a string that could possibly exploit vulnerabilities in my XML parser, or perhaps even in the parser on the target [third party] end (to whom I am sending this message). So clearly I could focus on special XML characters like < > & etc. -- do I also need to worry about escaped/encoded forms of those characters? Is the SecurityElement.Escape method mentioned in the possible dupe link adequate for this?

Since you're generating an XmlDocument, you could rely on the DOM methods to handle all escaping for you:
private XmlDocument CreateMessage(string dirtyInput)
{
XmlDocument xd = new XmlDocument();
xd.LoadXml(#"<Message><Request></Request></Message>");
xd["Message"]["Request"].InnerText = dirtyInput;
return xd;
}

Depends on what environment this string is going to be applied to (Web? Database?...)
If it is the web and you're trying to prevent XSS, this will do the trick:
HttpUtility.HtmlEncode(dirtyInput);
For databases, I'd forego sanitization in favour of paramterized queries.
As mentioned in the comments, you should wrap the dirtyinput in a Character Data section:
<![CDATA[
...
]]>

Related

Insert double quotes around all html tag attribute

am trying to convert html to xml , but double quotes of html tag attribute doesn't work
so when convert it to xml gives me error
so how can i add double quotes to all to my xml file ,
am using vb.net windows form application
converting an html to xml would not work..There are various corner cases where your html to xml conversion may fail
The best way to convert html to xml would be to:
1>Extract relevant data from the html using parsers like htmlagilitypack
2>Store those extracted data into xml using various xml api's like XmlWriter or Linq2Xml.
I wonder what method you use to convert. You say nothing abour that. Nevertheless, it's obviously this method which is the core problem. And maybe also what you plan to do once the html is converted into xml ?
To tell the truth, no conversion is needed given that html is already xml (well-formed html at least). Simply load your html in a XDocument for example... and that's it. Nothing special to do.
Try this please :
install SgmlReader from nuget
in case you have a string variable like below you will have to convert it into a TextReader object.
Now we are going to use the package installed.
static XmlDocument HTMLTEST()
{
string html = "<table frame=all><tgroup></tgroup></table>";
TextReader reader = new StringReader(html);
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All;
sgmlReader.InputStream = reader;
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true; //false if you dont want whitespace
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Input is string html format, and the return will be doc XmlDocument format.
Your frame=all from html will become frame="all".

XML Illegal Characters in path

I am querying a soap based service and wish to analyze the XML returned however when I try to load the XML into an XDoc in order to query the data. am getting an 'illegal characters in path' error message? This (below) is the XML returned from the service. I simply want to get the list of competitions and put them into a List I have setup. The XML does load into an XML Document though so must be correctly formatted?.
Any advice on the best way to do this and get round the error would be greatly appreciated.
<?xml version="1.0" ?>
- <gsmrs version="2.0" sport="soccer" lang="en" last_generated="2010-08-27 20:40:05">
- <method method_id="3" name="get_competitions">
<parameter name="area_id" value="1" />
<parameter name="authorized" value="yes" />
<parameter name="lang" value="en" />
</method>
<competition competition_id="11" name="2. Bundesliga" soccertype="default" teamtype="default" display_order="20" type="club" area_id="80" last_updated="2010-08-27 19:53:14" area_name="Germany" countrycode="DEU" />
</gsmrs>
Here is my code, I need to be able to query the data in an XDoc:
string theXml = myGSM.get_competitions("", "", 1, "en", "yes");
XmlDocument myDoc = new XmlDocument();
MyDoc.LoadXml(theXml);
XDocument xDoc = XDocument.Load(myDoc.InnerXml);
You don't show your source code, however I guess what you are doing is this:
string xml = ... retrieve ...;
XmlDocument doc = new XmlDocument();
doc.Load(xml); // error thrown here
The Load method expects a file name not an XML itself. To load an actual XML, just use the LoadXml method:
... same code ...
doc.LoadXml(xml);
Similarly, using XDocument the Load(string) method expects a filename, not an actual XML. However, there's no LoadXml method, so the correct way of loading the XML from a string is like this:
string xml = ... retrieve ...;
XDocument doc;
using (StringReader s = new StringReader(xml))
{
doc = XDocument.Load(s);
}
As a matter of fact when developing anything, it's a very good idea to pay attention to the semantics (meaning) of parameters not just their types. When the type of a parameter is a string it doesn't mean one can feed in just anything that is a string.
Also in respect to your updated question, it makes no sense to use XmlDocument and XDocument at the same time. Choose one or the another.
Following up on Ondrej Tucny's answer :
If you would like to use an xml string instead, you can use an XElement, and call the "parse" method. (Since for your needs, XElement and XDocument would meet your needs)
For example ;
string theXML = '... get something xml-ish...';
XElement xEle = XElement.Parse(theXML);
// do something with your XElement
The XElement's Parse method lets you pass in an XML string, while the Load method needs a file name.
Why not
XDocument.Parse(theXml);
I assume this will be the right solution
If this is really your output it is illegal XML because of the minus characters ('-'). I suspect that you have cut and pasted this from a browser such as IE. You must show the exact XML from a text editor, not a browser.

How to handle xml that contains nested xml using c# xmlreader?

I'm using c# to interact with a database that has an exposed REST API. The table that I'm interested in contains forum posts, some of which themselves contain xml.
Whenever my result set contains a post that has xml, my application throws an error as follows:
Exception Details: System.Xml.XmlException: '>' is an unexpected token. The expected token is '"' or '''. Line 1, position 62.
And this is the line that fails:
Line 44: ds.ReadXml(xmlData);
And this is the code I'm using:
var webClient = new WebClient();
string searchString = searchValue.Text;
string requestUrl = "http://myserver/restapi.ashx/search.xml?pagesize=4&pageindex=0&query=";
requestUrl += searchString;
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
XmlReader xmlData = XmlReader.Create(webClient.OpenRead(requestUrl),settings);
DataSet ds = new DataSet();
ds.ReadXml(xmlData);
Repeater1.DataSource = ds.Tables[1];
Repeater1.DataBind();
And this is the type of XML record that it's choking on (the stuff in the node is causing the problem):
<SearchResults PageSize="1" PageIndex="0" TotalCount="342">
<SearchResult>
<ContentId>994</ContentId>
<Title>Help Files: What are they written in?</Title>
<Url>http://myserver/linktest.aspx</Url>
<Date>2008-10-16T16:18:00+01:00</Date><ContentType>post</ContentType>
<Body><div class="ForumPostBodyArea"> <div class="ForumPostContentText"> <p>Can anyone see anything obviously wrong with this xml, when its fired to CRM Its creating 13 null records.</p> <p><?xml version="1.0" encoding="UTF-8"?><soap:Envelope xmlns:typens="http://tempuri.org/type" soap:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" xmlns:wsdlns="http://tempuri.org/wsdl/" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Header><SessionHeader><sessionId xsi:type="xsd:long">18208442035524</sessionId></SessionHeader></soap:Header><soap:Body><typens:add><entityname xsi:type="xsd:string">lead</entityname><records xsi:nil="true" xsi:type="typens:ewarebase" /><status xsi:type="xsd:string">PreRegistration</status><requester xsi:type="xsd:string">Mimnagh</requester><personfirstname xsi:type="xsd:string">Sean</personfirstname><personlastname xsi:type="xsd:string">Test2</personlastname><personsalutation xsi:type="xsd:string">Mr</personsalutation><details xsi:type="xsd:string">test project details</details><description xsi:type="xsd:string">test description details</description><comments xsi:type="xsd:string">test project comments</comments><personemail xsi:type="xsd:string">smimnagh#mac.com</personemail><personphonenumber xsi:type="xsd:string">12334566777</personphonenumber><type xsi:type="xsd:string">PreReg</type><companyname xsi:type="xsd:string">Site Client</companyname></typens:add></soap:Body></soap:Envelope></p> <p>Many thanks</p> </div> </div>
</Body>
<Tags>
<Tag>xml</Tag>
</Tags>
<IndexedAt>2010-07-08T11:53:46.848+01:00</IndexedAt>
</SearchResult>
</SearchResults>
Is there something that I can do with the xmlreader to make it ignore whatever's causing the problem?
Please note that I can't change the XML prior to consuming it - so if it's malformed then I wonder if there's a way to ignore or modify that particular record without generating an error?
Thanks!
It looks like some of your quotes need escaping in the contents of some of your elements. Try using
"
for quote marks that aren't wrapping attribute values.
UPDATE:
Because the data you want to read isn't strictly XML (it's nearly XML) you're best bet is to
Either you or your boss, if you have one, screams at the third party because they're not sending you well formed XML.
Perform some horrible hack to try and convert whatever you might get to XML.
If you have to go with point 2, the simplest thing that pops into my head is to read the characters of the 'XML' counting in and out of angle brackets. If you find any " characters and you're not within any angle brackets, replace the " with
"
But note that doing that is a complete last resort.
The Content of your <Body> tag is not well formed. XML is very strict with the syntax of data. Either embed a CDATA section into your XML or escape the string properly.

Is there a quick way to format an XmlDocument for display in C#?

I want to output my InnerXml property for display in a web page. I would like to see indentation of the various tags. Is there an easy way to do this?
Here's a little class that I put together some time ago to do exactly this.
It assumes that you're working with the XML in string format.
public static class FormatXML
{
public static string FormatXMLString(string sUnformattedXML)
{
XmlDocument xd = new XmlDocument();
xd.LoadXml(sUnformattedXML);
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
XmlTextWriter xtw = null;
try
{
xtw = new XmlTextWriter(sw);
xtw.Formatting = Formatting.Indented;
xd.WriteTo(xtw);
}
finally
{
if(xtw!=null)
xtw.Close();
}
return sb.ToString();
}
}
You should be able to do this with code formatters. You would have to html encode the xml into the page first.
Google has a nice prettifyer that is capable of visualizing XML as well as several programming languages.
Basically, put your XML into a pre tag like this:
<pre class="prettyprint">
<link href="prettify.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="prettify.js"></script>
</pre>
Use the XML Web Server Control to display the content of an xml document on a web page.
EDIT: You should pass the entire XmlDocument to the Document property of the XML Web Server Control to display it. You don't need to use the InnerXml property.
If identation is your only cocern and if you can afford to launch xternall process, you can process xml file with HTML Tidy console tool (~100K).
The code is:
tidy --input-xml y --output-xhtml y --indent "1" $(FilePath)
Then you can display idented string on web page once you get rid of special chars.
It would be also easy to create recursive function that makes such output - simply iterate nodes starting from the root and enter next recursion step for child node, passing identation as a parameter to each new recursion call.
Check out the free Actipro CodeHighlighter for ASP.NET - it can neatly display XML and other formats.
Or are you more interested in actually formatting your XML? Then have a look at the XmlTextWriter - you can specify things like Format (indenting or not) and the indent level, and then write out your XML to e.g. a MemoryStream and read it back from there into a string for display.
Marc
Use an XmlTextWriter with the XmlWriterSettings set up so that indentation is enabled. You can use a StringWriter as "temporary storage" if you want to write the resulting string onto screen.

UTF-8 encoding issue

I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:
public static XmlDocument FromUri(string uri)
{
XmlDocument xmlDoc;
WebClient webClient = new WebClient();
using (Stream rssStream = webClient.OpenRead(uri))
{
XmlTextReader reader = new XmlTextReader(rssStream);
xmlDoc = new XmlDocument();
xmlDoc.XmlResolver = null;
xmlDoc.Load(reader);
}
return xmlDoc;
}
Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.
How can I solve it?
The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.
If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.
If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.
EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:
string text = element.Value.Replace("š", "š")
.Replace(...);
Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

Categories