Unicode surrogates character encoding c#

Unicode surrogates character encoding c# - c#

I've got problem with Unicode characters. When I want to encode surrogates character (between D800 and DFFF) it encodes as FFFD. I used Encoding.Unicode.GetString() method it doesn't work and Decoder.GetChars() method it doesnt work with every surrogate character.
I use following codes:
Encoding Codes:
string unicodeChars="a\uD800\uDA65";
FileStream stream=new FileStream (#"unicode_encoding.txt",FileMode.Create,FileAccess.Write);
byte[] buffer=Encoding.Unicode.GetBytes(unicodeChars);
stream.Write(buffer,0,buffer.Length);
stream.Close();
Decoding Codes:
string decodedUnicodeChars;
FileStream stream2=new FileStream (#"unicode_encoding.txt",FileMode.Open,FileAccess.Read);
StreamReader reader=new StreamReader(stream2,Encoding.Unicode);
decodedUnicodeChars=reader.ReadToEnd();
foreach(char c in decodedUnicodeChars)
{
Console.Write("{0} ",Convert.ToInt32(c).ToString("X4"));
}
Output is:
0061 FFFD FFFD

string unicodeChars="a\uD800\uD565";
This is a case of gigo, Garbage In, Garbage Out. The surrogate is not valid, the second one must be in the range \uDC00..\uDFFF.

Related

Converting unknown characters to Greek characters

I have a file which contains the following characters:
ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ
I am trying to convert that to Greek words and the result should be:
ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
The file that the above value is stored in in Unicode format.
I am applying all possible encodings but no luck in the conversion.
private void Convert()
{
string textFilePhysicalPath = (#"C:\Users\Nec\Desktop\a.txt");
string contents = File.ReadAllText(textFilePhysicalPath);
List<string> sLines = new List<string>();
// For every encoding, get the property values.
EncodingInfo ei;
foreach (var ei in Encoding.GetEncodings())
{
Encoding e = ei.GetEncoding();
Encoding iso = Encoding.GetEncoding(ei.Name);
Encoding utfx = Encoding.Unicode;
byte[] utfBytes = utfx.GetBytes(contents);
byte[] isoBytes = Encoding.Convert(utfx, iso, utfBytes);
string msg = iso.GetString(isoBytes);
string xx = (ei.Name + " " + msg);
sLines.Add(xx);
}
using (StreamWriter file = new StreamWriter(#"C:\Users\Nec\Desktop\result.txt"))
{
foreach (var line in sLines)
file.WriteLine(line);
}
}
A website that converts it correctly is http://www.online-decoder.com/el
but even when I use the ISO-8859-1 to ISO-8859-7 it still doesn't work in .NET.

This code converts the string from the C# which is UTF-16 to an 8-bit representation using the common ISO-8859-1 codepage. Then it converts it back to UTF-16 using the greek codepage windows-1253. The result is ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ as you want.
string errorneousString = "ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ";
byte[] asIso88591Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(errorneousString);
string asGreekString = Encoding.GetEncoding("windows-1253").GetString(asIso88591Bytes);
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(asGreekString);
Edit: Since your file is encoded in an 8-bit format, you need to specify the codepage when reading it. Use this:
string fileContents = File.ReadAllText("189.dat", Encoding.GetEncoding("windows-1253"));
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(fileContents);
That reads the content as
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.' 'EH','Hb
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ' 'EP','PROTEIN
ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ' 'FB','HAEMATOLOGY -
FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC' 'FR','FREE TEXT', 'GT','GLUCOSE
TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ' 'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ
ΔΕΛΤΙΟ' 'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ' 'SE','SEMEN
ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ' 'SP','SPECIAL PATHOLOGY','SPECIAL
PATHOLOGY' 'ST','STOOL EXAMINATION
','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ' 'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL' 'UR','URINE
ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ' 'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ
ΝΕΡΟΥ' 'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'

This is an ASCII file stored using the Greek (1253) codepage which was read using a different codepage.
File.ReadAllText tries to detect whether the file is UTF16 or UTF8 by checking the BOM bytes and falls back to UTF8 by default. UTF8 is essentially the 7-bit ANSI codepage for single-byte text, which means that trying to read a nonUnicode, nonANSI file like this will result in garbled text.
To load a file using a specific encoding/codepage, just pass the encoding as the Encoding parametter, eg :
var enc = Encoding.GetEncoding(1253);
var text=File.ReadAllText(#"189.dat",enc);
Strings in .NET are Unicode, specifically UTF16. This means that text doesn't need any conversions. Its contents will be :
'CS','C.S.F. EXAMINATION','ΕΞΕΤΑΣΗ Ε.Ν.Υ.'
'EH','Hb ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ'
'EP','PROTEIN ELECTROPHORESIS','ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΠΡΩΤΕΙΝΩΝ'
'FB','HAEMATOLOGY - FBC','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΑΙΜΑΤΟΣ - FBC'
'FR','FREE TEXT',
'GT','GLUCOSE TOLERANCE TEST','ΔΟΚΙΜΑΣΙΑ ΑΝΟΧΗΣ ΓΛΥΚΟΖΗΣ'
'MI','MICROBIOLOGY','ΜΙΚΡΟΒΙΟΛΟΓΙΑ'
'NO','NORMAL FORM','ΚΑΝΟΝΙΚΟ ΔΕΛΤΙΟ'
'RE','RENAL CALCULUS','ΧΗΜΙΚΗ ΑΝΑΛΥΣΗ ΟΥΡΟΛΙΘΟΥ'
'SE','SEMEN ANALYSIS','ΣΠΕΡΜΟΔΙΑΓΡΑΜΜΑ'
'SP','SPECIAL PATHOLOGY','SPECIAL PATHOLOGY'
'ST','STOOL EXAMINATION ','ΕΞΕΤΑΣΗ ΚΟΠΡΑΝΩΝ'
'SW','SEMEN WASH','SEMEN WASH'
'TH','THROMBOPHILIA PANEL','THROMBOPHILIA PANEL'
'UR','URINE ANALYSIS','ΓΕΝΙΚΗ ΕΞΕΤΑΣΗ ΟΥΡΩΝ'
'WA','WATER CULTURE REPORT','ΑΝΑΛΥΣΗ ΝΕΡΟΥ'
'WI','WIDAL ','ΑΝΟΣΟΒΙΟΛΟΓΙΑ'
UTF16 uses two bytes for every character. If a UTF16 file was opened in a hex browser, every other character would be a NUL (0x00). It's not UTF8 either - outside the 7-bit ANSI range each character uses two or more bytes that always have the high bit set. Instead of one garbled character there would be two at least.
File and stream methods that could be affected by encoding or culture in .NET always have an overload that accepts an Encoding or CultureInfo parameter.
Console
Writing the output to the Console may display in garbled text. The text isn't really converted, just displayed the wrong way.
While the console can display Unicode text it assumes that the system's codepage is used by default. In the past it couldn't even support UTF8 as a codepage - there was no such option in the settings. After all, the label for the system locale settings is Language used for non-Unicode programs.
The latest Windows 10 Insider releases offer UTF8 as the system codepage as a beta option.
To ensure Unicode text appears properly in the console one would have to set its encoding to UTF8, eg :
var text=File.ReadAllText(#"189.dat",enc);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine(text);

I don't know what codepage this is, but it seems to be simply offset by some values. You can convert the source string to the target string by adding 11 to the first byte and 16 to the second byte:
var input = Encoding.Default.GetBytes("ÇËÅÊÔÑÏÖÏÑÇÓÇ ÁÉÌÏÓÖÁÉÑÉÍÇÓ");
for (var i = 0; i < input.Length; i++)
{
if (input[i] == 32) continue;
input[i++] += 11;
input[i] += 16;
}
var output = Encoding.UTF8.GetString(input);
Result: ΗΛΕΚΤΡΟΦΟΡΗΣΗ ΑΙΜΟΣΦΑΙΡΙΝΗΣ
Not sure if this is a solution, but it may give you a hint.

Just it c#
HtmlWeb web = new HtmlWeb();
web.OverrideEncoding = Encoding.GetEncoding(65001);

BinaryReader in c# reads '\0' between all characters of a string

I am trying to write and read a binary file using c# BinaryWriter and BinaryReader classes.
When I am storing a string in file, it is storing it properly, but when I am trying to read it is returning a string which has '\0' character on every alternate place within the string.
Here is the code:
public void writeBinary(BinaryWriter bw)
{
bw.Write("Hello");
}
public void readBinary(BinaryReader br)
{
BinaryReader br = new BinaryReader(fs);
String s;
s = br.ReadString();
}
Here s is getting value as = "H\0e\0l\0l\0o\0".

You are using different encodings when reading and writing the file.
You are using UTF-16 when writing the file, so each character ends up as a 16 bit character code, i.e. two bytes.
You are using UTF-8 or some of the 8-bit encodings when reading the file, so each byte will end up as one character.
Pick one encoding and use for both reading and writing the file.

Encoding not converting

An ASP.NET page (ashx) receives a GET request with a UTF8 string. It reads a SqlServer database with Windows-1255 data.
I can't seem to get them to work together. I've used information gathered on SO (mainly Convert a string's character encoding from windows-1252 to utf-8) as well as msdn on the subject.
When I run anything through the functions below - it always ends up the same as it started - not converted at all.
Is something done wrong?
EDIT
What I'm specifically trying to do (getData returns a Dictionary<int, string>):
getData().Where(a => a.Value.Contains(context.Request.QueryString["q"]))
Result is empty, unless I send a "neutral" character such as "'" or ",".
CODE
string windows1255FromUTF8(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(p);
byte[] winBytes = Encoding.Convert(utf8, win, utfBytes);
return win.GetString(winBytes);
}
string UTF8FromWindows1255(string p)
{
Encoding win = Encoding.GetEncoding(1255);
Encoding utf8 = Encoding.UTF8;
byte[] winBytes = win.GetBytes(p);
byte[] utfBytes = Encoding.Convert(win, utf8, winBytes);
return utf8.GetString(utfBytes);
}

There is nothing wrong with the functions, they are simply useless.
What the functions do is to encode the strings into bytes, convert the data from one encoding to another, then decode the bytes back to a string. Unless the string contains a character that is not possible to encode using the windows-1255 encoding, the returned value should be identical to the input.
Strings in .NET doesn't have an encoding. If you get a string from a source where the text was encoded using for example UTF-8, once it's decoded into a string it doesn't have that encoding any more. You don't have to do anyting to a string to use it when the destination has a specific encoding, whatever library you are using that takes the string will take care of the encoding.

For some reason this worked:
byte[] fromBytes = (fromEncoding.UTF8).GetBytes(myString);
string finalString = (Encoding.GetEncoding(1255)).GetString(fromBytes);
Switching encoding without the conversion...

Filestream prepends junk characters while read

I am reading a simple text file which contains single line using filestream class. But it seems filestream.read prepends some junk character in the beginning.
Below the code.
using (var _fs = File.Open(_idFilePath, FileMode.Open, FileAccess.ReadWrite, FileShare.Read))
{
byte[] b = new byte[_fs.Length];
UTF8Encoding temp = new UTF8Encoding(true);
while (_fs.Read(b, 0, b.Length) > 0)
{
Console.WriteLine(temp.GetString(b));
Console.WriteLine(ASCIIEncoding.ASCII.GetString(b));
}
}
for example: My data in text file is just "sample". But the above code returns
"?sample" and
"???sample"
Whats the reason?? is it start of the file indicator? is there a way to read only my actual content??

The byte order mark(BOM) consists of the Unicode char 0xFEFF and is used to mark a file with the encoding used for it.
So if you correctly decode the file as UTF8 you get that character as first char of your string. If you incorrectly decode it as ANSI you get 3 chars, since the UTF8 encoding of 0xFEFF is the byte sequence "EF BB BF" which is 3 bytes.
But your whole code can be replaced with
File.ReadAllText(fileName,Encoding.UTF8)
and that should remove the BOM too. Or you leave out the encoding parameter and let the function autodetect the encoding(for which it uses the BOM)

Could be the BOM - a.k.a byte order mark.

You are reading the BOM from the stream. If you are reading text, try using a StreamReader which will handle this automatically.

Try instead
using (StreamReader sr = new StreamReader(File.Open(path),Encoding.UTF8))
It will definitely strip you the BOM

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?
I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.
What step am I missing?

You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
Encoding.GetEncoding("iso-8859-1")))
{
using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
outFileName, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
}
However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, data);
It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.
In the interest of fully exploring the issue, something like this would work:
using (System.IO.FileStream input = new System.IO.FileStream(fileName,
System.IO.FileMode.Open,
System.IO.FileAccess.Read))
{
byte[] buffer = new byte[input.Length];
int readLength = 0;
while (readLength < buffer.Length)
readLength += input.Read(buffer, readLength, buffer.Length - readLength);
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, buffer);
using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
System.IO.FileMode.Create,
System.IO.FileAccess.Write))
{
output.Write(converted, 0, converted.Length);
}
}
In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.
Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.

If the files are relatively small (say, ~10 megabytes), you'll only need two lines of code:
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
System.IO.File.WriteAllText(outPath, txt);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unicode surrogates character encoding c# - c#

string unicodeChars="a\uD800\uD565"; This is a case of gigo, Garbage In, Garbage Out. The surrogate is not valid, the second one must be in the range \uDC00..\uDFFF.

Related

Converting unknown characters to Greek characters

BinaryReader in c# reads '\0' between all characters of a string

Encoding not converting

Filestream prepends junk characters while read

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

Categories

Resources