XML invalid using the following characters £ `

XML invalid using the following characters £ ` – - c#

I am trying to create an RSS feed that will validate using the W3C validator.
I keep getting problems from the following URLS containing the characters £, ` or -
Here are the URLs:
http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off
Here is the error:
This feed does not validate.
line 14, column 119: link must be a full and valid URL: http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off [help]
... –-why-the-royal-mail-are-ripping-you-off
I have tried replacing the symbols with escape characters but this doesn't work. Here are the escape characters I have been using:
Text = Text.Replace("-", "&#45");
Text = Text.Replace("£", "%C2%A");
Text = Text.Replace("`", "%60");
Text = Text.Replace("’", "%60");
Does anyone have any idea how to solve this problem? Here are some more links that are causing me problems:
http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000
Error:
This feed does not validate.
line 14, column 106: link must be a full and valid URL: http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000 [help]
... -sale-3-bed-detached-london-home-£15,000

You will need to URL encode the URLs before posting them in the RSS:
var encoded = HttpUtility.UrlEncode(aUrl);
Note that the URLs will not be usable directly as :, / etc will also get encoded.
If you want the values of these to be valid XML, use SecurityElement.Escape instead.
var escaped = SecurityElement.Escape(aUrl);

I'm building an API for my system, and I've been using some stuff to normalize the fields. Try filtering this with PHP:
$value = preg_replace('/[^a-z]/i', '', $value);
$value = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord($0).";"', $value);
$value = htmlentities($value, ENT_NOQUOTES, 'UTF-8', false);

Answer is either to use UTF-8 encoding or convert non-ascii characters to XML entities.
UTF-8 encoding: Make sure the document is output in UTF-8, and includes the relevant encoding headers.
See also UTF-8 encoding xml in PHP
Entity encoding: Convert all non ASCII characters to XML entities.
XML Entities look like this: £ (that one is for the £ sign). Most programming languages will either do this automatically for you as you generate the XML document, or provide standard functions for doing it. You didn't specify the language you're using, but the above should help you find the appropriate API functions.
One thing you should not be doing is generating XML data manually (ie outputting tags and attributes, as strings), or string-replacing the entities manually. You should be using the proper APIs for it. Generating XML (or any other standard data format) manually is always likely to end in problems like this, and does it seem to be a bit crazy to do it the hard way if the tools are right there in front of you to do it properly.

Related

Is it possible to read French characters into a C# string from an .eml file?

I have a project where I need to generate a .pdf file based on the content in an .eml file. When dealing with just english characters, I'm fine, the pdf is created flawlessly and everything works (after I strip all the needless html junk).
However an issue arrives when I try to read in an .eml file that is filled with french characters. In particular the french characters are stored as number codes like =E9, =E8, &#339, so on and so forth.
So my issue is this. I read the .eml file in with:
string content = File.ReadAllText(filePath, Encoding.UTF8);
However it comes in as plain text and I don't know how to make the system interpret the =E9 and =E8, etc., codes as French Characters. I can always Regex.Replace everything but I'm hoping for a more elegant solution. Is there any way to take in that long string of plain text and interpret the codes embedded within properly so that the french characters appear instead of their respective codes without using like 30 Regex.Replace expressions?
Due note I can't use any built in iTextSharp functionality since I also need to be able to incorporate french characters (pulled from that .eml file) into the file name of the pdf.
Thanks

You can use regexes, but two regexes should be enough:
text = Regex.Replace(text, #"=([0-9A-Fa-f]{2})", match => ((char)uint.Parse(match.Groups[1].Value, NumberStyles.HexNumber)).ToString());
text = Regex.Replace(text, #"&#(\d+);", match => ((char)uint.Parse(match.Groups[1].Value)).ToString());
A different way would be to find a MIME parsing library which exposes methods for parsing parts of MIME messages, that way you'd decode the =E9 codes. Then, you'd need to call WebUtility.HtmlDecode to parse the HTML entities.

Encoding special characters in c#

In my c# application, i receive post data in the form of xml. Within the xml i have a attribute receiving as " SmÃ¥senter (Sandvika SmÃ¥senter)" . Before inserting to database i need to encode it as "Småsenter (Sandvika Småsenter)" . I tried to use below code ,
string name = "SmÃ¥senter (Sandvika SmÃ¥senter)";
name = HttpUtility.HtmlDecode(name);
Also tried name = HttpUtility.HtmlEncode(name);
But it is not giving expected output.
Is any suggessions to get in expected characters.
Regards
Sangeetha

You have just encountered Mojibake, which is caused by mixing text encodings. You need to use the same encoding for writing and reading the XML, preferably a Unicode encoding such as UTF-8. You should not try to repair a broken string such as "SmÃ¥senter", but rather make it not break in the first place.

How do you correctly escape a document name in .NET?

We store a bunch of weird document names on our web server (people upload them) that have various characters like spaces, ampersands, etc. When we generate links to these documents, we need to escape them so the server can look up the file by its raw name in the database. However, none of the built in .NET escape functions will work correctly in all cases.
Take the document Hello#There.docx:
UrlEncode will handle this correctly:
HttpUtility.UrlEncode("Hello#There");
"Hello%23There"
However, UrlEncode will not handle Hello There.docx correctly:
HttpUtility.UrlEncode("Hello There.docx");
"Hello+There.docx"
The + symbol is only valid for URL parameters, not document names. Interestingly enough, this actually works on the Visual Studio test web server but not on IIS.
The UrlPathEncode function works fine for spaces:
HttpUtility.UrlPathEncode("Hello There.docx");
"Hello%20There.docx"
However, it will not escape other characters such as the # character:
HttpUtility.UrlPathEncode("Hello#There.docx");
"Hello#There.docx"
This link is invalid as the # is interpreted as a URL hash and never even gets to the server.
Is there a .NET utility method to escape all non-alphanumeric characters in a document name, or would I have to write my own?

Have a look at the Uri.EscapeDataString Method:
Uri.EscapeDataString("Hello There.docx") // "Hello%20There.docx"
Uri.EscapeDataString("Hello#There.docx") // "Hello%23There.docx"

I would approach it a different way: Do not use the document name as key in your look-up - use a Guid or some other id parameter that you can map to the document name on disk in your database. Not only would that guarantee uniqueness but you also would not have this problem of escaping in the first place.

You can use # character to escape strings. See the below pieces of code.
string str = #"\n\n\n\n";
Console.WriteLine(str);
Output: \n\n\n\n
string str1 = #"\df\%%^\^\)\t%%";
Console.WriteLine(str1);
Output: \df\%%^\^)\t%%
This kind of formatting is very useful for pathnames and for creating regexes.

How to convert UTF-8 to text in HTML entity?

I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .

That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)

Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.

You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??

I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.

I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XML invalid using the following characters £ ` – - c#

Related

Is it possible to read French characters into a C# string from an .eml file?

Encoding special characters in c#

How do you correctly escape a document name in .NET?

How to convert UTF-8 to text in HTML entity?

C# Special Characters not displayed propely in XML

Categories

Resources