Remove Encoded HTML from Strings using RegEx

Remove Encoded HTML from Strings using RegEx - c#

I currently have an extension method from removing any HTML from strings.
Regex.Replace(s, #"<(.|\n)*?>", string.Empty);
This works fine on the whole, however, I am occasionally getting passed strings that have both standard HTML markup within them, along with encoded markup (I don't have control of the source data so can't correct things at the point of entry), e.g.
<p><p>Sample text</p></p>
I need an expression that will remove both encoded and non-encoded HTML (whether it be paragraph tags, anchor tags, formatting tags etc.) from a string.

I think you can do that in two passes with your same Extension method.
First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple

Related

Format/Safe string for "title" attribute in anchor

I have a function that builds an anchor tag. The function recieves the URL, Title as parameters. The problem is that sometime the text includes quotation marks and this results in a anchor tag generated with syntax errors.
Which is the best way to solve this problems? Is there any function that parses the text into a safe string, in this case, for the title attribute.
Otherwise I can check the string and strip all quotation marks, but I would like know if there is a better way to do this, e.g there might be some other characters that can crash my function as well.

Actually you want to use HttpUtility.HtmlAttributeEncode to encode your title attribute. The other encoders will do more work (and have different uses) whereas this one only escapes ", &, and < to generate a valid text for an attribute.
Example:
This is a <"test"> & something else. becomes This is a <"Test"> & something else.

How to convert UTF-8 to text in HTML entity?

I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .

That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)

Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.

You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.

Safely split/paginate an HTML String with ASP.NET

I have rendered one of my controls into a string. I want to safely split the html string. I don't want any hanging html tags. I am working on a pagination control adapter.
How can I split my string, around less than a set number of chars) safely taking HTML into account?

Take a look at HtmlAgilityPack. You can use it to parse and manipulate the html in your string without having to resort to regex.

If you're looking for a nice way to show the HTML code you should try HTML Tidy.
I did not use it with a limitation on the number of charecters per line, but I think HTML Tidy wrap option might get you close to your target.

ASP.NET MVC Html.Encode - New lines

Html.Encode seems to simply call HttpUtility.HtmlEncode to replace a few html specific characters with their escape sequences.
However this doesn't provide any consideration for how new lines and multiple spaces will be interpretted (markup whitespace). So I provide a text area for the a user to enter a plain text block of information, and then later display that data on another screen (using Html.Encode), the new lines and spacing will not be preserved.
I think there are 2 options, but maybe there is a better 3rd someone can suggest.
One option would be to just write a static method that uses HtmlEncode, and then replaces new lines in the resulting string with <br> and groups of multiple spaces with
Another option would be to mess about with the white-space: pre attribute in my style sheets - however I'm not sure if this would produce side effects when Html helper methods include new lines and tabbing to make the page source pretty.
Is there a third option, like a global flag, event or method override I can use to change how html encoding is done without having to redo the html helper methods?

HtmlEncode is only meant to encode characters for display in HTML. It specifically does not encode whitespace characters.
I would go with your first option, and make it an extension method for HtmlHelper. Something like:
public static string HtmlEncode(this HtmlHelper htmlHelper,
string text,
bool preserveWhitespace)
{
// ...
}
You could use String.Replace() to encode the newlines and spaces (or Regex.Replace if you need better matching).

Using the style="white-space:pre-wrap;" worked for me. Per this article.

If you use Razor you can do:
#MvcHtmlString.Create(Html.Encode(strToEncode).Replace(Environment.NewLine, "<br />"))
in your view, or in your controller:
HttpServerUtility httpUtil = new HttpServerUtility();
MvcHtmlString encoded = httpUtil.HtmlEncode(strToEncode).Replace(Environment.NewLine, "<br />");
I have not tested the controller method, but it should work the same way.

Put your output inside <pre></pre> and/or <code></code> blocks. E.g:
<pre>#someValue</pre> / <code>#someValue</code>
Use the equivalent css on an existing div:
<div style="white-space:pre-wrap;">#someValue</div>
Depends whether you want the semantic markup or whether you want to fiddle with css. I think these are both neater than inserting <br/> tags.

/// <summary>
/// Returns html string with new lines as br tags
/// </summary>
public static MvcHtmlString ConvertNewLinesToBr<TModel>(this HtmlHelper<TModel> html, string text)
{
return new MvcHtmlString(html.Encode(text).Replace(Environment.NewLine, "<br />"));
}

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?

Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.

The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.

Maybe read your HTML into an XML parser which should take care of the conversions for you.

Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry

Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.

Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove Encoded HTML from Strings using RegEx - c#

I think you can do that in two passes with your same Extension method. First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple

Related

Format/Safe string for "title" attribute in anchor

How to convert UTF-8 to text in HTML entity?

Safely split/paginate an HTML String with ASP.NET

ASP.NET MVC Html.Encode - New lines

Convert > to HTML entity equivalent within HTML string

Categories

Resources