How to convert UTF-8 to text in HTML entity? - c#

I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .

That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)

Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.

You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.

Related

XML invalid using the following characters £ ` –

I am trying to create an RSS feed that will validate using the W3C validator.
I keep getting problems from the following URLS containing the characters £, ` or -
Here are the URLs:
http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off
Here is the error:
This feed does not validate.
line 14, column 119: link must be a full and valid URL: http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off [help]
... –-why-the-royal-mail-are-ripping-you-off
I have tried replacing the symbols with escape characters but this doesn't work. Here are the escape characters I have been using:
Text = Text.Replace("-", "&#45");
Text = Text.Replace("£", "%C2%A");
Text = Text.Replace("`", "%60");
Text = Text.Replace("’", "%60");
Does anyone have any idea how to solve this problem? Here are some more links that are causing me problems:
http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000
Error:
This feed does not validate.
line 14, column 106: link must be a full and valid URL: http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000 [help]
... -sale-3-bed-detached-london-home-£15,000
You will need to URL encode the URLs before posting them in the RSS:
var encoded = HttpUtility.UrlEncode(aUrl);
Note that the URLs will not be usable directly as :, / etc will also get encoded.
If you want the values of these to be valid XML, use SecurityElement.Escape instead.
var escaped = SecurityElement.Escape(aUrl);
I'm building an API for my system, and I've been using some stuff to normalize the fields. Try filtering this with PHP:
$value = preg_replace('/[^a-z]/i', '', $value);
$value = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord($0).";"', $value);
$value = htmlentities($value, ENT_NOQUOTES, 'UTF-8', false);
Answer is either to use UTF-8 encoding or convert non-ascii characters to XML entities.
UTF-8 encoding: Make sure the document is output in UTF-8, and includes the relevant encoding headers.
See also UTF-8 encoding xml in PHP
Entity encoding: Convert all non ASCII characters to XML entities.
XML Entities look like this: £ (that one is for the £ sign). Most programming languages will either do this automatically for you as you generate the XML document, or provide standard functions for doing it. You didn't specify the language you're using, but the above should help you find the appropriate API functions.
One thing you should not be doing is generating XML data manually (ie outputting tags and attributes, as strings), or string-replacing the entities manually. You should be using the proper APIs for it. Generating XML (or any other standard data format) manually is always likely to end in problems like this, and does it seem to be a bit crazy to do it the hard way if the tools are right there in front of you to do it properly.

Remove Encoded HTML from Strings using RegEx

I currently have an extension method from removing any HTML from strings.
Regex.Replace(s, #"<(.|\n)*?>", string.Empty);
This works fine on the whole, however, I am occasionally getting passed strings that have both standard HTML markup within them, along with encoded markup (I don't have control of the source data so can't correct things at the point of entry), e.g.
<p><p>Sample text</p></p>
I need an expression that will remove both encoded and non-encoded HTML (whether it be paragraph tags, anchor tags, formatting tags etc.) from a string.
I think you can do that in two passes with your same Extension method.
First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple

Question about Encodings: How can I output from HtmlAgilityPack to a StringWriter and keep the encoding?

I am reading html in with HtmlAgilityPack, editing it, then outputting it to a StreamWriter. The HtmlAgilityPack Encoding is Latin1, and the StreamWriter is UnicdeEncoding.
I am losing some characters in the conversion, and I do not want to be.
I don't seem to be able to change the Encoding of a StreamWriter. What is the best around this problem?
If the web page is really Latin-1 (ISO-8859-1), it can't have any curly quotes in it; Latin-1 has no mappings for those characters. If you can see curly quotes when you open the page in your browser, they could be in the form of HTML entities (“ and ” or “ and ”). But I suspect the page's encoding is really windows-1252 despite what the headers and embedded declarations say.
windows-1252 is identical to Latin-1 except that it replaces the control characters in the \x80..\x9F range (decimal 128..159) with more useful (or at least prettier) printing characters. If HtmlAgilityPack is taking the page at its word and decoding it as ISO-8859-1, it will convert \x93 to the control character \u0093, which will look like garbage if you can get it to display at all. The browser, meanwhile, will convert it to \u201C, the Unicode code point for the Left Double Quotation Mark.
I'm not familiar with HtmlAgilityPack and I can't find any docs for it, but I would try to force it to use windows-1252. For example, you could create a windows-1252 (or "ANSI") StreamReader and have HAP use that.
At a guess; write to a Stream (not a string). If you write to a string (inc. StringWriter/StringBuilder, you are implicitly using .NET's UTF-16 string.
If you just want to tweak the reported encoding (but use a string), then look at Jon's answer here.
It is not clear which end you're losing characters at. In any case, a mere encoding mismatch isn't by itself an issue - you're still supposed to get the correct characters. If a Unicode StreamWriter writes out garbled characters, it means that it had received garbage on input in the first place. Which probably means that HtmlAgilityPack got encoding for your page wrong. If it has an option of setting the encoding manually, you might want to do just that.
It may also be that you have an HTML page which has a wrong encoding declaration in it. E.g. it might be an UTF-8 file which contains <meta> element declaring it as Latin-1. Where do you get the text from? Do you download it straight from the Web, or do you have it in a text file - and if it's the latter, how do you create that file? If you did it manually via Notepad, or in the code via StreamWriter, then you might have an UTF-8 file.

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Categories