Xslt: escape XML nodes when producing HTML - c#

I am using Microsoft's System.Xml.Xsl to create HTML. When transforming XML that contains escape sequences (e.g. <script>) into HTML, if the nodes are emitted as attributes, they are not escaped.
I would like to produce HTML where both attributes and nodes are escaped.
Xml sample:
<Contact>
<Name>hello <script>alert('!')</script></Name>
</Contact>
Xslt sample:
<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" indent=""yes"" doctype-system=""html"" />
<xsl:template match=""/"">
<span data-title=""{{ 'title': '{/Contact/Name}' }}"">
Name: <xsl:value-of select=""/Contact/Name""/>
Input: <input type=""text"" value=""{/Contact/Name}""/>
</span>
</xsl:template>
</xsl:stylesheet>
Sample code:
using System;
using System.Xml;
using System.Xml.Xsl;
using System.IO;
public class Program
{
public static void Main()
{
var transform = new XslCompiledTransform();
var xml = #"<Contact><Name>hello <script>alert('!')</script></Name></Contact>";
var xslt = #"<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""html"" indent=""yes"" doctype-system=""html"" />
<xsl:template match=""/"">
<span data-title=""{{ 'title': '{/Contact/Name}' }}"">
Name: <xsl:value-of select=""/Contact/Name""/>
Input: <input type=""text"" value=""{/Contact/Name}""/>
</span>
</xsl:template></xsl:stylesheet>
";
transform.Load(XmlReader.Create(new StringReader(xslt)));
var settings = transform.OutputSettings.Clone();
using (var output = new MemoryStream())
using (var writer = XmlWriter.Create(output, settings))
{
var args = new System.Xml.Xsl.XsltArgumentList();
transform.Transform(XmlReader.Create(new StringReader(xml)), args, writer);
writer.Flush();
output.Position = 0;
Console.Write(new StreamReader(output).ReadToEnd());
}
}
}
Actual result (with a fiddle):
<!DOCTYPE html SYSTEM "html"><span data-title="{ 'title': 'hello <script>alert('!')</script>' }">
Name: hello <script>alert('!')</script>
Input: <input type="text" value="hello <script>alert('!')</script>"></span>
Expected / Desired result (with a fiddle):
<!DOCTYPE span SYSTEM "html">
<span data-title="{ 'title': 'hello <script>alert('!')</script>' }">
Name: hello <script>alert('!')</script>
Input: <input type="text" value="hello <script>alert('!')</script>" /></span>

Using XSLT 3 with xsl:output method="xhtml", possible with .NET framework using Saxon .NET 10.8 HE from Saxonica or with .NET 6/7 (Core) using SaxonCS 11 or 12 (commercial enterprise package) or using IKVM cross-compiled Saxon HE 11.4 Java, shown below, might give a result closer to your needs:
using net.sf.saxon.s9api;
using net.liberty_development.SaxonHE11s9apiExtensions;
using System.Reflection;
// force loading of updated xmlresolver (workaround until Saxon HE 11.5)
ikvm.runtime.Startup.addBootClassPathAssembly(Assembly.Load("org.xmlresolver.xmlresolver"));
ikvm.runtime.Startup.addBootClassPathAssembly(Assembly.Load("org.xmlresolver.xmlresolver_data"));
var processor = new Processor(false);
var xml = #"<Contact><Name>hello <script>alert('!')</script></Name></Contact>";
var xslt = #"<xsl:stylesheet version=""3.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"">
<xsl:output method=""xhtml"" indent=""yes"" html-version=""5.0"" doctype-system=""about:legacy-compat"" omit-xml-declaration=""yes""/>
<xsl:template match=""/"">
<span data-title=""{{ 'title': '{/Contact/Name}' }}"">
Name: <xsl:value-of select=""/Contact/Name""/>
Input: <input type=""text"" value=""{/Contact/Name}""/>
</span>
</xsl:template></xsl:stylesheet>
";
var xslt30Transformer = processor.newXsltCompiler().compile(xslt.AsSource()).load30();
var inputDoc = processor.newDocumentBuilder().build(xml.AsSource());
using var resultWriter = new StringWriter();
xslt30Transformer.applyTemplates(inputDoc, processor.NewSerializer(resultWriter));
var result = resultWriter.ToString();
Console.WriteLine(result);
Output is e.g.
<!DOCTYPE span
SYSTEM "about:legacy-compat">
<span data-title="{ 'title': 'hello <script>alert('!')</script>' }">
Name: hello <script>alert('!')</script>
Input: <input type="text" value="hello <script>alert('!')</script>"/></span>
Example project to show to use IKVM and Maven to include Saxon HE 11.4 Java is at https://github.com/martin-honnen/SaxonHE11NET7XHTMLOutputMethodExample1.

Related

C# HtmlAgilityPack // Getting p tag's innertext

I'm trying to get the innertext of p tag of below source:
<div class="blah" role="blaah">
<h3 class="blaaah" id="blaaaah">blaaaaah</h3>
<p> Text </p><p> // ... so on
</div>
What I want is: 'Text',
And below is what I tried:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load(url);
var div = htmlDoc.DocumentNode.SelectSingleNode("//h3[contains(#id, 'blaaaah')]//p");
textBox2.Text = div.InnerText;
It constantly gives me a null.
I'm not familiar with HTML.
It'd be great if anyone can help me.
Thank you in advance!

How to get the inner text for a single node using HtmlAgilityPack

My HTML looks like this:
<div id="footer">
<div id="footertext">
<p>
Copyright © FUCHS Online Ltd, 2013. All Rights Reserved.
</p>
</div>
</div>
I would like to obtain this text from the markup and store it as a string in my C# code: "Copyright © FUCHS Online Ltd, 2013. All Rights ".
This is what I have tried:
public string getvalue()
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.fuchsonline.com");
var link = doc.DocumentNode.SelectNodes("//div[#id='footertext']");
return link.ToString();
}
This returns an object of type "HtmlAgilityPack.HtmlNodeCollection". How do I get just this text value?
You need the value of one node. Therefore it is better to use SelectSingleNode method.
HtmlWeb web = new HtmlWeb();
var doc = web.Load("http://www.fuchsonline.com");
var link = doc.DocumentNode.SelectSingleNode("//div[#id='footertext']/p");
string rawText = link.InnerText.Trim();
string decodedText = HttpUtility.HtmlDecode(text); // or WebUtility
return decodedText;
Also you may need to decode the html entity ©.
Here's what you can do:
string html = #"
<div id='footer'>
<div id='footertext'>
<p>
Copyright © FUCHS Online Ltd, 2013. All Rights Reserved.
</p>
</div>
</div>";
//in my example I am not use HtmlWeb because I am working with the piece of html you provided. You will continue to you HtmlWeb and access the url...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var texts = htmlDoc.DocumentNode.SelectNodes("//*[#id='footertext']").Select(n => n.InnerText.Trim());
foreach (var text in texts)
{
Console.WriteLine(text);
}
Output:
public string getvalue()
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc =web.Load("www.fuchsonline.com");
var link = doc.DocumentNode.SelectNodes("//div[#id='footertext']");
return link.InnerText.ToString();
}

Basecamp classic api: how to add a new line into comment body

I use classic basecamp api, and I want to add a new line into comment body. I have been tried to use Envirovent.NewLine and CDATA, but basecamp remove it from result text.
Do anybody know how to do it? It is possible?
Getting a comment through a rest call reveals a div tag within the xml structure for a line break
Call
GET https://#{account_url}.basecamphq.com/comments/#{comment_id}.xml
Result
<?xml version="1.0" encoding="UTF-8" ?>
<comments count="1" type="array">
<comment>
...
<body>
<div> Comment-Text line ONE</div>
<div> Comment-Text line TWO</div>
</body>
...
</comment>
</comments>
However, posting xml to the API applying the same structure as above results in the following terrible looking comment in Basecamp Classic:
{"div"=>[" Comment-Text line ONE", " Comment-Text line TWO"]}
The CDATA tag does work but has to be implemented in the following manner:
<comment><![CDATA[
<body>
<div> Comment-Text line ONE</div>
<div> Comment-Text line TWO</div>
</body>
</comment>]]>
Or an example of feeding dynamic content with php
$comment_xml = "<comment><body><![CDATA[<div>Person: " . $first_name . " " . $last_name . "</div><div>Email: " . $email . "</div>]]></body></comment>";
<div> and <br /> tags will both work for new lines
I tried with \n , but it didn't worked..
table_config = [
{
'dbName': f'gomez_datalake_{env}_{team}_{dataset}_db',
'table': 'ConFac',
'partitionKey': 'DL_PERIODO',
'schema': [
['TIPO_DE_VALOR', 'STRING', 2, None,
"CÓDIGO DEL PARÁMETRO DE SISTEMA."
"EJEMPLOS:"
"UF: VALOR DE LA UF"
"IP: VALOR DEL IPC"
"MO: MONEDA"
"IV: VALOR DEL VA"
"UT: VALOR DEL UTM"],
['ORIGEN', 'STRING', 4, None, "IDENTIFICADOR DE USUARIO"]
]

Getting HTML values in Store apps

I am parsing a HTML file from my storage folder. I am going to parse to get some values.
StorageFile store = await appfolder.GetFileAsync("01MB154.html");
string content = await FileIO.ReadTextAsync(store);
XmlDocument doc = new XmlDocument();
doc.LoadXml(content);
XmlNodeList names = doc.GetElementsByTagName("img");
I am getting Exception in LoadXml(content) line.
"An exception of type 'System.Exception' occurred in IMG.exe but was not handled in user code,
Additional information: Exception from HRESULT: 0xC00CE584"
I tried this answer But not yet worked for me.link
This is some part from my HTML file.
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
<meta name="generator" content="Web Books Publishing" />
<link rel="stylesheet" type="text/css" href="style.css" />
<title>Main Text</title>
</head>
<body>
<div>
<div class="figcenter">
<img src="images/img2.jpg" alt="Cinderella" title="" />
</div>
I checked some files which I want to work with, not yet fine.
I want to know there is any other way to getting HTML values.
Thanks,
You HTML is not well formed according to W3Schools
Try with this
StorageFile store = await appfolder.GetFileAsync("01MB154.html");
string content = await FileIO.ReadTextAsync(store);
XmlDocument doc = new XmlDocument();
XmlLoadSettings loadSettings = new XmlLoadSettings();
loadSettings.ProhibitDtd = false;
doc.LoadXml(content, loadSettings);
XmlNodeList names = doc.GetElementsByTagName("img");
UPDATE 1
Here's my working code
StorageFile store = await Windows.ApplicationModel.Package.Current.InstalledLocation.GetFileAsync("01MB154.html");
string content = await FileIO.ReadTextAsync(store);
XmlDocument doc = new XmlDocument();
XmlLoadSettings loadSettings = new XmlLoadSettings();
loadSettings.ProhibitDtd = false;
doc.LoadXml(content, loadSettings);
XmlNodeList names = doc.GetElementsByTagName("img");
UPDATE 2
replace to &nbsp;, it worked for me.

HttpWebRequest login with hidden input

I'm trying to login in to a website using a HttpWebRequest in a Windows 8 Store Application. The login form looks like this:
<form method="post" action="" onsubmit="javascript:return FormSubmit();">
<div>
<div>
<span>gebruikersnaam</span>
<input size="17" type="text" id="username" name="username" tabindex="1" accesskey="g" />
</div>
<div>
<span>wachtwoord</span>
<input size="17" type="password" id="password" name="password" maxlength="48" tabindex="2" accesskey="w" autocomplete="off" />
</div>
<div class="controls">
<input type="submit" id="submit" accesskey="l" value="inloggen" tabindex="4" class="button"/>
</div>
<!-- The following hidden field must be part of the submitted Form -->
<input type="hidden" name="lt" value="_s3E91853A-222D-76B6-16F9-DB4D1FD397B7_c8424159E-BFAB-EA2A-0576-CD5058A579B4" />
<input type="hidden" name="_eventId" value="submit" />
<input type="hidden" name="credentialsType" value="ldap" />
</div>
</form>
I'm able to send out most of the required inputs except for the hidden input named "lt". This is a random generated code for security purposes so I can't hard-code it in my script. My current script is like this:
HttpWebRequest loginRequest2 = (HttpWebRequest)WebRequest.Create(LOGIN_URL_REDIRECT);
loginRequest2.CookieContainer = CC;
loginRequest2.Method = "POST";
loginRequest2.Accept = "image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, */*";
loginRequest2.ContentType = "application/x-www-form-urlencoded";
loginRequest2.Headers["Accept-Encoding"] = "gzip,deflate";
loginRequest2.Headers["Accept-Language"] = "en-us";
StreamWriter sw = new StreamWriter(await loginRequest2.GetRequestStreamAsync());
sw.Write("username=" + userName + "&password=" + passWord + "&_eventId=submit&credentialsType=ldap");
await sw.FlushAsync();
HttpWebResponse response2 = (HttpWebResponse)await loginRequest2.GetResponseAsync();
How can I get te content of the hidden input "lt" before doing the request?
Using HTML Agilty Pack, you can use this code snippet to get the Hidden Field value:
var doc = new HtmlWeb().Load(LOGIN_URL_REDIRECT);
var nodes = doc.DocumentNode
.SelectNodes("//input[#type='hidden' and #name='lt' and #value]");
foreach (var node in nodes) {
var inputName = node.Attributes["name"].Value;
var inputValue = node.Attributes["value"].Value;
Console.WriteLine("Name: {0}, Value: {1}", inputName, inputValue);
}
Using Linq:
var nodes = from n in doc.DocumentNode.DescendantNodes()
where n.Name == "input" &&
n.GetAttributeValue("type", "") != "" &&
n.GetAttributeValue("name", "") == "lt" &&
n.Attributes.Contains("value")
select new
{
n.Attributes["name"].Name,
n.Attributes["value"].Value
};
foreach (var node in nodes) {
Console.WriteLine("Name: {0}, Value: {1}", node.Name, node.Value);
}
You need to load the webpage and get the random key and use the same key to call the second page, I wrote a software that can easily accomplish this, read it here http://innosia.com/Home/Article/WEBSCRAPER
If you don't want to use WebScraper, at least you need to use the Cookie aware class which is CookieWebClient found in the downloaded solution, using it should be something like this :
// Use Cookie aware class, this class can be found in my WebScraper Solution
CookieWebClient cwc = new CookieWebClient;
string page = cwc.DownloadString("http://YourUrl.com"); // Cookie is set to the key
// Filter the key
string search = "name=\"lt\" value=\"";
int start = page.IndexOf(search);
int end = page.IndexOf("\"", start);
string key = page.Substring(start + search.Length, end-start-search.Length);
// Use special method in CookieWebClient to post data since .NET implementation has some issues.
// CookieWebClient is the class I wrote found in WebScraper solution you can download from my site
// Re use the cookie that is set to the key
string afterloginpage = cwc.PostPage("http://YourUrl.com", string.Format("username={0}&password={1}&lt={2}&_eventId=submit&credentialsType=ldap", userid, password, key));
// DONE

Categories