How to properly decode accented characters for display

How to properly decode accented characters for display - c#

My raw input file text file contains a string:
Caf&eacute (Should be Café)
The text file is a UTF8 file.
The output lets say is to another text file, so its not necessarily for a web page.
What C# method(s) can i use to output the correct format, Café?
Apparently a common problem?

Have you tried System.Web.HttpUtility.HtmlDecode("Café")? it returns 538M results

This is HTML encoded text. You need to decode it:
string decoded = HttpUtility.HtmlDecode(text);
UPDATE: french symbol "é" has HTML code "é" so, you need to fix your input string.

You should use SecurityElement.Escape when working with XML files.
HtmlEncode will encode a lot of extra entities that are not required. XML only requires that you escape >, <, &, ", and ', which SecurityElement.Escape does.
When reading the file back through an XML parser, this conversion is done for you by the parser, you shouldn't need to "decode" it.
EDIT: Of course this is only helpful when writing XML files.

I think this works:
string utf8String = "Your string";
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
byte[] utf8Bytes = utf8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
char[] uniChars = new char[unicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length)];
unicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, uniChars, 0);
string unicodeString = new string(uniChars);

Use HttpUtility.HtmlDecode. Example:
class Program
{
static void Main()
{
XDocument doc = new XDocument(new XElement("test",
HttpUtility.HtmlDecode("café")));
Console.WriteLine(doc);
Console.ReadKey();
}
}

Related

Converting unknown characters to Greek characters

I have a file which contains the following characters:
ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ
I am trying to convert that to Greek words and the result should be:
ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
The file that the above value is stored in in Unicode format.
I am applying all possible encodings but no luck in the conversion.
private void Convert()
{
string textFilePhysicalPath = (#"C:\Users\Nec\Desktop\a.txt");
string contents = File.ReadAllText(textFilePhysicalPath);
List<string> sLines = new List<string>();
// For every encoding, get the property values.
EncodingInfo ei;
foreach (var ei in Encoding.GetEncodings())
{
Encoding e = ei.GetEncoding();
Encoding iso = Encoding.GetEncoding(ei.Name);
Encoding utfx = Encoding.Unicode;
byte[] utfBytes = utfx.GetBytes(contents);
byte[] isoBytes = Encoding.Convert(utfx, iso, utfBytes);
string msg = iso.GetString(isoBytes);
string xx = (ei.Name + " " + msg);
sLines.Add(xx);
}
using (StreamWriter file = new StreamWriter(#"C:\Users\Nec\Desktop\result.txt"))
{
foreach (var line in sLines)
file.WriteLine(line);
}
}
A website that converts it correctly is http://www.online-decoder.com/el
but even when I use the ISO-8859-1 to ISO-8859-7 it still doesn't work in .NET.

This code converts the string from the C# which is UTF-16 to an 8-bit representation using the common ISO-8859-1 codepage. Then it converts it back to UTF-16 using the greek codepage windows-1253. The result is ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ as you want.
string errorneousString = "ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ";
byte[] asIso88591Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(errorneousString);
string asGreekString = Encoding.GetEncoding("windows-1253").GetString(asIso88591Bytes);
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(asGreekString);
Edit: Since your file is encoded in an 8-bit format, you need to specify the codepage when reading it. Use this:
string fileContents = File.ReadAllText("189.dat", Encoding.GetEncoding("windows-1253"));
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(fileContents);
That reads the content as
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.' 'EH','Hb
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ' 'EP','PROTEIN
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ' 'FB','HAEMATOLOGY -
FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC' 'FR','FREE TEXT', 'GT','GLUCOSE
TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ' 'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ
ΔΕΛΤΙΟ' 'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ' 'SE','SEMEN
ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ' 'SP','SPECIAL PATHOLOGY','SPECIAL
PATHOLOGY' 'ST','STOOL EXAMINATION
','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ' 'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL' 'UR','URINE
ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ' 'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ
ΝΕΡΟΥ' 'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'

This is an ASCII file stored using the Greek (1253) codepage which was read using a different codepage.
File.ReadAllText tries to detect whether the file is UTF16 or UTF8 by checking the BOM bytes and falls back to UTF8 by default. UTF8 is essentially the 7-bit ANSI codepage for single-byte text, which means that trying to read a nonUnicode, nonANSI file like this will result in garbled text.
To load a file using a specific encoding/codepage, just pass the encoding as the Encoding parametter, eg :
var enc = Encoding.GetEncoding(1253);
var text=File.ReadAllText(#"189.dat",enc);
Strings in .NET are Unicode, specifically UTF16. This means that text doesn't need any conversions. Its contents will be :
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.'
'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ'
'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ'
'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC'
'FR','FREE TEXT',
'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ'
'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ'
'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ'
'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ'
'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY'
'ST','STOOL EXAMINATION ','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ'
'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL'
'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ'
'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ'
'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'
UTF16 uses two bytes for every character. If a UTF16 file was opened in a hex browser, every other character would be a NUL (0x00). It's not UTF8 either - outside the 7-bit ANSI range each character uses two or more bytes that always have the high bit set. Instead of one garbled character there would be two at least.
File and stream methods that could be affected by encoding or culture in .NET always have an overload that accepts an Encoding or CultureInfo parameter.
Console
Writing the output to the Console may display in garbled text. The text isn't really converted, just displayed the wrong way.
While the console can display Unicode text it assumes that the system's codepage is used by default. In the past it couldn't even support UTF8 as a codepage - there was no such option in the settings. After all, the label for the system locale settings is Language used for non-Unicode programs.
The latest Windows 10 Insider releases offer UTF8 as the system codepage as a beta option.
To ensure Unicode text appears properly in the console one would have to set its encoding to UTF8, eg :
var text=File.ReadAllText(#"189.dat",enc);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(text);

I don't know what codepage this is, but it seems to be simply offset by some values. You can convert the source string to the target string by adding 11 to the first byte and 16 to the second byte:
var input = Encoding.Default.GetBytes("ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ");
for (var i = 0; i < input.Length; i++)
{
if (input[i] == 32) continue;
input[i++] += 11;
input[i] += 16;
}
var output = Encoding.UTF8.GetString(input);
Result: ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
Not sure if this is a solution, but it may give you a hint.

Just it c#
HtmlWeb web = new HtmlWeb();
web.OverrideEncoding = Encoding.GetEncoding(65001);

Decoding Base64 string doesn't translate special chars

I have an XML file encoded as Base64, this file has embeded images in it.
My problem is when i try to decode it:
String sample is here
Here is my code:
UTF8 Version: This one can be decoded and loaded into XML but special chars doesn't decode properly (á,é,í,ó,ú,ñ,Ñ) all are decoded as ? symbol.
public XmlDocument GetXmlDocument(string Base64Text)
{
string DeCoded = string.Empty;
XmlDocument Datos = new XmlDocument();
byte[] Base64 = Convert.FromBase64String(Base64Text);
DeCoded = Encoding.UTF8.GetString(Base64);
Datos.LoadXml(DeCoded);
return Datos;
}
UTF7 Version: This one can be decoded and special chars are all decoded properly. If Has images can't be loaded to XML Document due embeded images doesn't decode properly.
public XmlDocument GetXmlDocument(string Base64Text)
{
string DeCoded = string.Empty;
XmlDocument Datos = new XmlDocument();
byte[] Base64 = Convert.FromBase64String(Base64Text);
DeCoded = Encoding.UTF7.GetString(Base64);
Datos.LoadXml(DeCoded);
return Datos;
}
EDIT 1: My encoded Base64 string came from VFP app it was encoded using
VFP Sintax
STRCONV(lcImage, 6)
In VFP help it says:
Converts character expressions between single-byte, double-byte, UNICODE, and locale-specific representations.
And 6 means:
6 Converts UNICODE (wide characters) to double-byte characters.
I'm trying with UTF7 and UTF8, just testing wich one could translate my string.
I just need my original XML with it's embeded images in .NET, If this is imposible with standard functions, How can I catch and translate ? symbol s to readable chars?

From your XML document:
<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
So, this works:
DeCoded = Encoding.GetEncoding(1252).GetString(Base64);

German character in Livelink

I want to extract file/folder and other item type that contains some German special characters in their name from Livelink server. Livelink server has encoding UTF-8.
Value is Test Dokument äöüß
var bytes = new List<byte>(value.Length);
foreach (var c in value)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());
the above code sample fixes s encoding problem in some character but ß is seen as ? character in Latin( ISO 8859-2) encoding. can anybody help me fix the problem.
Thanks in advance

You have to set UTF-8 encoding on the LL session:
LLSTATUS llSessionStatus = LL_SessionAllocEx( &llSession, server, port, "", login, password, NULL);
LLSTATUS status = LL_SetCodePage( llSession , LL_TRUE, LL_TRUE, (LLLONG) 65001 );
65001 - is the codepage for UTF-8

It doesn't make sense to store ISO-8859-1 in a C# string, since it stores Unicode characters.
What really makes sense is to convert the Unicode string into a byte[] representing the string in ISO-8859-1.
var test ="äöüßÄÖÜ";
var iso = Encoding.GetEncoding("ISO-8859-1");
var bytes = iso.GetBytes(test);
File.WriteAllBytes("Sample file ISO-8859-1.txt", bytes);
Try this and you'll see that the text file is correctly encoded.
You can even check with a hex editor or with the debugger that the ß is correctly encoded as 0xDF (see table on wikipedia)

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}

There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.

For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

C# Encoding.Converting Latin to Hebrew

I'm trying to fetch and parse an online excel document which is written in hebrew but unfortunately in a non-hebrew encoding.
As an example I'm trying to convert the following string: "âìéåï_1", which serves as the 1st sheet name to hebrew using C# code, but I'm unable to do so.
I know the above is convertible, since when I open it up in NotePad++ and select Encoding/Character Sets/Hebrew/Windows 1255, I can see: "גליון_1" which is the correct hebrew representation of the above string.
I'm using the below code
string str = "âìéåï_1";
Encoding windows = Encoding.GetEncoding("Windows-1255");
Encoding ascii = Encoding.GetEncoding("Windows-1252");
byte[] asciiBytes = ascii.GetBytes(str);
byte[] windowsBytes = Encoding.Convert(ascii, windows, asciiBytes);
char[] windowsChars = new char[windows.GetCharCount(windowsBytes, 0, windowsBytes.Length)];
windows.GetChars(windowsBytes, 0, windowsBytes.Length, windowsChars, 0);
string windowsString = new string(windowsChars);
I assumed that the encoding of the origin string is Windows-1252 since when I paste it in NotePad++ and change the encoding to Windows-1252 the string remains the same...
I'm probably doing something wrong here, anyone know how to convert the above correctly?
Thanks,
Mikey

const string Str = "âìéåï_1";
Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
Encoding hebrewEncoding = Encoding.GetEncoding("Windows-1255");
byte[] latinBytes = latinEncoding.GetBytes(Str);
string hebrewString = hebrewEncoding.GetString(latinBytes);
hebrewString:
גליון_1
In your supplied example "Window-1252" is not actualy ASCII, it is extended ASCII, and for some reason Encoding.Convert with these two encodings cannot convert extended range ASCII, so all +127 characters are converted as 63 (i.e. ?). When "converting" from one extended ASCII character byte[] to another, I would expect the bytes to be the same, it is only when you convert them to a .Net unicode string I would expect them to be different. Not sure why Convert is converting +127 chars to '?'.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to properly decode accented characters for display - c#

My raw input file text file contains a string: Caf&eacute (Should be Café) The text file is a UTF8 file. The output lets say is to another text file, so its not necessarily for a web page. What C# method(s) can i use to output the correct format, Café? Apparently a common problem?

Have you tried System.Web.HttpUtility.HtmlDecode("Café")? it returns 538M results

This is HTML encoded text. You need to decode it: string decoded = HttpUtility.HtmlDecode(text); UPDATE: french symbol "é" has HTML code "é" so, you need to fix your input string.

Use HttpUtility.HtmlDecode. Example: class Program { static void Main() { XDocument doc = new XDocument(new XElement("test", HttpUtility.HtmlDecode("café"))); Console.WriteLine(doc); Console.ReadKey(); } }

Related

Converting unknown characters to Greek characters

Decoding Base64 string doesn't translate special chars

German character in Livelink

Encoding not converting

C# Encoding.Converting Latin to Hebrew

Categories

Resources