Encoding special characters in c# - c#

In my c# application, i receive post data in the form of xml. Within the xml i have a attribute receiving as " SmÃ¥senter (Sandvika SmÃ¥senter)" . Before inserting to database i need to encode it as "Småsenter (Sandvika Småsenter)" . I tried to use below code ,
string name = "Småsenter (Sandvika Småsenter)";
name = HttpUtility.HtmlDecode(name);
Also tried name = HttpUtility.HtmlEncode(name);
But it is not giving expected output.
Is any suggessions to get in expected characters.
Regards
Sangeetha

You have just encountered Mojibake, which is caused by mixing text encodings. You need to use the same encoding for writing and reading the XML, preferably a Unicode encoding such as UTF-8. You should not try to repair a broken string such as "Småsenter", but rather make it not break in the first place.

Related

How do I extract UTF-8 strings out of a JSON file using LitJSON, as JsonData does not seem to convert?

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.
I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.
I went to the very start of where I create the JsonData object and tried to run the following test:
public JsonData CreateJSONDataObject()
{
Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);
JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
return jsonDataObject;
}
I made sure my jsonString is using UTF-8, however the output shows this:
Test compatibility: ë | W�den
I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.
Thank you in advance.
This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:
string file = #"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);
Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:
string jsonString = File.ReadAllText(Application.dataPath + pathName,
Encoding.GetEncoding("Windows-1252"));
You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.
To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:
string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding);
The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.
For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:
It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));
You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)
The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

XML invalid using the following characters £ ` –

I am trying to create an RSS feed that will validate using the W3C validator.
I keep getting problems from the following URLS containing the characters £, ` or -
Here are the URLs:
http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off
Here is the error:
This feed does not validate.
line 14, column 119: link must be a full and valid URL: http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off [help]
... –-why-the-royal-mail-are-ripping-you-off
I have tried replacing the symbols with escape characters but this doesn't work. Here are the escape characters I have been using:
Text = Text.Replace("-", "&#45");
Text = Text.Replace("£", "%C2%A");
Text = Text.Replace("`", "%60");
Text = Text.Replace("’", "%60");
Does anyone have any idea how to solve this problem? Here are some more links that are causing me problems:
http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000
Error:
This feed does not validate.
line 14, column 106: link must be a full and valid URL: http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000 [help]
... -sale-3-bed-detached-london-home-£15,000
You will need to URL encode the URLs before posting them in the RSS:
var encoded = HttpUtility.UrlEncode(aUrl);
Note that the URLs will not be usable directly as :, / etc will also get encoded.
If you want the values of these to be valid XML, use SecurityElement.Escape instead.
var escaped = SecurityElement.Escape(aUrl);
I'm building an API for my system, and I've been using some stuff to normalize the fields. Try filtering this with PHP:
$value = preg_replace('/[^a-z]/i', '', $value);
$value = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord($0).";"', $value);
$value = htmlentities($value, ENT_NOQUOTES, 'UTF-8', false);
Answer is either to use UTF-8 encoding or convert non-ascii characters to XML entities.
UTF-8 encoding: Make sure the document is output in UTF-8, and includes the relevant encoding headers.
See also UTF-8 encoding xml in PHP
Entity encoding: Convert all non ASCII characters to XML entities.
XML Entities look like this: £ (that one is for the £ sign). Most programming languages will either do this automatically for you as you generate the XML document, or provide standard functions for doing it. You didn't specify the language you're using, but the above should help you find the appropriate API functions.
One thing you should not be doing is generating XML data manually (ie outputting tags and attributes, as strings), or string-replacing the entities manually. You should be using the proper APIs for it. Generating XML (or any other standard data format) manually is always likely to end in problems like this, and does it seem to be a bit crazy to do it the hard way if the tools are right there in front of you to do it properly.

How to convert UTF-8 to text in HTML entity?

I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .
That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)
Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.
You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.

C# Special Characters not displayed propely in XML

i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "&amp ;#8482;" so the output of xml is "&#8482"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.

Categories