HtmlAgilityPack treats everything after < (less than sign) as attributes

HtmlAgilityPack treats everything after < (less than sign) as attributes - c#

I have some input I get via a textarea and I convert that input into a html document, that is later parsed into a PDF document.
When my users input the less than sign (<) everything brakes in my HtmlDocument. HtmlAgilityPack suddenly handles everything after the less than sign as an attribute. See the output:
Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">
It gets a little better if I just add the
htmlDocument.OptionOutputOptimizeAttributeValues = true;
which gives me:
Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>
I have tried all of the options on the htmldocument and none of them lets me specify that the parser should not be strict. On the other hand I might be able to live with it stripping away the <, but adding all the equal signs doesn't really work for me.
void Main()
{
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDoc = WrapContentInHtml(input);
htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}
private HtmlDocument WrapContentInHtml(string content)
{
var htmlBuilder = new StringBuilder();
htmlBuilder.AppendLine("<!DOCTYPE html>");
htmlBuilder.AppendLine("<html>");
htmlBuilder.AppendLine("<head>");
htmlBuilder.AppendLine("<title></title>");
htmlBuilder.AppendLine("</head>");
htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
htmlBuilder.AppendLine(content);
htmlBuilder.AppendLine("</div></body></html>");
var htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputOptimizeAttributeValues = true;
var htmlDoc = htmlBuilder.ToString();
htmlDocument.LoadHtml(htmlDoc);
return htmlDocument;
}
Does anybody have an idea to how I can solve this problem.
The closest question I can find is this:
Losing the 'less than' sign in HtmlAgilityPack loadhtml
Where he actually complains about the < disappearing which would be fine for me. Of course fixing the parsing error is the best solution.
EDIT:
I am using HtmlAgilityPack 1.4.9

Your content is blatantly wrong. This is not about "strictness", it's really about the fact that you're pretending a piece of text is valid HTML. In fact, the results you are getting are exactly because the parser is not strict.
When you need to insert plain text into HTML, you need to encode it first, so that all the various HTML control characters are converted to HTML properly - for example, < must be changed to < and & to &.
One way to handle this is to use the DOM - use InnerText on the target div, instead of slapping strings together and pretending they're HTML. Another is to use some explicit encoding method - for example HttpUtility.HtmlEncode.

You can use System.Net.WebUtility.HtmlEncode which works even without a reference to System.Web.dll which also has HttpServerUtility.HtmlEncode
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());
Result:
Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).

Related

C# Regex pattern for finding a tag in a string

For the below string,I want to select only the inner script tag containing the url http://cdn.walkme.com/users and replace the selected tag with an empty string so can somebody help me with the regex pattern
<script><script type="text/javascript">(function() {var walkme = document.createElement('script'); walkme.type = 'text/javascript'; walkme.async = true; walkme.src='http://cdn.walkme.com/users/cb643dab0d6f4c7cbc9d436e7c06f719/walkme_cb643dab0d6f4c7cbc9d436e7c06f719.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(walkme, s); window._walkmeConfig = {smartLoad:true}; })();</script></script>
I have tried this < script(.+)http://cdn.walkme.com/users/.+?\/script>

I agree that it's not really possible to have comprehensive and generic regex to parse any (x)HTML which standard supports. That's is true just by nature of these things.
But you're perfectly fine to do lots of smaller cool tasks using Regex. Just like in your case, in order to strip particular script out of the page markup, you could just use the following regex to find an entry and then replace it with an empty string:
\<script\>\<script type="text/javascript"\>\(function\(\) \{var walkme =.*\</script\>
It does very a simple thing - takes everything in between
<script><script type="text/javascript">(function() {var walkme =
(you can include more text to be more specific) and
</script>
Just ensure special symbols (like /, ( or )) are escaped properly.
Edited
In order to select inner need to use what is called positive lookahead to find first closing tag right after opening one:
<script type="text/javascript">\(function\(\) {var walkme =.*(?=</script>)

Why is this not HTMLEncoding - "<" or "&"

Can anyone tell me why this is not encoding using htmlencode
any string that has < before the string ie
<something or &something
is not being displayed back to the html page when looking at the encoding the < and & is not being encoded. I would have expected these characters to be encoded to < or &
edit: this is the code I use to encode the string:
var replacedHtml = Regex.Replace(html,
#"</?(\w*)[^>]*>",
me => AllowedTags.Any(s => s.Equals(me.Groups[1].Value, StringComparison.OrdinalIgnoreCase))
? me.Value
: HttpUtility.HtmlEncode(me.Value), RegexOptions.Singleline);
return replacedHtml;
edit: i think the issue is not on the server side but rather on the angular side. the ng-bind-html
<span ng-bind-html="ctl.linkGroup.Notes | TextToHtmlSafe">
angular.module('CPSCore.Filters')
.filter('TextToHtmlSafe', ['$sce',function ($sce) {
return function (text) {
if (!text)
return text;
var htmlText = text.replace(/\n/g, '<br />');
return $sce.trustAsHtml(htmlText);
};
}]);
is declaring that
<something
without the closing tag is not safe and therefore removes it from the view

Try System.Net.WebUtility.HtmlDecode to properly decode special characters. Using this, < changes to < and & changes to & which is properly displayed html pages.

In HTML, the ampersand character (“&”) declares the beginning of an entity reference (a special character). If you want one to appear in text on a web page you should use the encoded named entity “&”—more technical mumbo-jumbo at w3c.org. While most web browsers will let you get away without encoding them, stuff can get dicey in weird edge cases and fail completely in XML.
The other main characters to remember to encode are < (<) and > (>), you don’t want to confuse your browser about where HTML tags start and end

Ignoring when parsing with HtmlAgilityPack

I'm parsing html table in c# using Html Agility Pack that contains non-breaking space.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
Where page is string containing table with special characters   within text.
<td> test</td>
<td>number = 123 </td>
Using SelectSingleNode(".//td").InnerText will contains this special characters but i want to ignore them.
Is there some elegant way to ignore this (with or without help of Html Agility Pack) without modifying source table?

You could use HtmlDecode
string foo = HttpUtility.HtmlDecode("Special char:  ");
Will give you a string:
Special char:

The "Special Character" non-breaking-space of which you speak is a valid character which can perfectly legitimately appear in text, just as "fancy quotes", em-dash etc can.
Often we want to treat certain characters as being equivalent.
So you might want to treat an em-dash, en-dash and minus sign/dash as
being the same.
Or fancy quotes as the same as straight quotes.
Or the non-breaking-space as an ordinary space.
However this is not something HTML Agility pack can help with. You need to use something like string.Replace or your own canonicalization function to do this.
I would suggest something like:
static string CleanupStringForMyApp(string s){
// replace characters with their equivalents
s = s.Replace(string.FromCharCode(160), " ");
// Add any more replacements you want to do here
return s;
}

How to read/write text and avoid special character signs (<, , >, etc)

I am currently parsing some C# scripts that are stored in a database, extracting the body of some methods in the code, and then writing an XML file that shows the id, the body of the extracted methods, etc.
The problem I have write now is that when I write the code in the XML I have to write it as a literal string, so I thought I'd need to add " at the beginning and end:
new XElement("MethodName", #"""" + Extractor.GetMethodBody(rule.RuleScript, "MethodName") + #"""")
This works, but I have a problem, things that are written in the DB as
for (int n = 1; n < 10; n++)
are written into the XML file (or printed to console) as:
for (int n = 1; n < 10; n++)
How can I get it to print the actual character and not its code? The code in the database is written with the actual charaters, not the "safe" < like one.

Inside xml (as a text value) it is correct for < to be encoded as <. The internal representation of xml doesn't affect the value, so let it get encoded. You can get around this by forcing a CDATA section, but in all honesty - it isn't worth it. But here is an example using CDATA:
string noEncoding = new XElement("foo", new XCData("a < b")).ToString();

Why do you think that you have to write it as a literal string? That is not so. Besides, you are not writing it as a literal string at all, it's still a dynamic string value only that you have added quotation marks around it.
A literal string is a string that is written litteraly in the code, like "Hello world". If you get the string in any other way, it's not a literal string.
The quotation marks that you have added to the string simply adds quotation marks to the value, they don't do anything else to the string. You can add the string with the quotation marks just fine:
new XElement("MethodName", Extractor.GetMethodBody(rule.RuleScript, "MethodName"))
Now, the characters that are encoded when they are put in the XML, is because they need to be encoded. You can't put a < character inside a value without encoding it.
If you show the XML, you will see the encoded values, and that is just a sign that it works as it should. When you read the XML, the encoded characters will be decoded, and you end up with the original string.

I don't know what software he's going to use to read the XML, but any that I know of will throw an error on parsing any XML that does not escape < and > chars which aren't used as tag starts and ends. It's just part of the XML specification; these chars are reserved as part of the structure.
If I were you, then, I'd part ways with the System.XML utilities and write this file yourself. Any decent XML tool is going to encode those chars for you, so you should probably not use them. Go with a StreamWriter and create the output the way you are being told to. That way you can control the XML output yourself, even if it means breaking the XML specification.
using (StreamWriter sw = new StreamWriter("c:\\xmlText.xml", false, Encoding.UTF8))
{
sw.WriteLine("<?xml version=\"1.0\"?>");
sw.WriteLine("<Class>");
sw.Write("\t<Method Name=\"MethodName\">");
sw.Write(#"""" + Extractor.GetMethodBody(rule.RuleScript, "MethodName") + #"""");
sw.WriteLine("</Method>");
// ... and so on and so forth
sw.WriteLine("</Class>");
}

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?

Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.

The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.

Maybe read your HTML into an XML parser which should take care of the conversions for you.

Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry

Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.

Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack treats everything after < (less than sign) as attributes - c#

Related

C# Regex pattern for finding a tag in a string

Why is this not HTMLEncoding - "<" or "&"

Ignoring when parsing with HtmlAgilityPack

How to read/write text and avoid special character signs (<, , >, etc)

Convert > to HTML entity equivalent within HTML string

Categories

Resources