c# Converting codepage input to display as Unicode - c#

I searched and find some partial answers that work in some instances but nothing that works in all. The scenarion is I get sent via network an XML file. The file has the encoding named e.g. encoding = "Windows-932" or encoding = "Windows-1254" and so on. I need to parse the file and get certain info and the convert that info to Unicode chars and send it on to another machine that can only read Unicode.
So is the encoding is
1253 it is Cyrillic so Char E1 = ASCII225 = Unicode 0431.
1254 it is Turkish so Char E1 = ASCII225 = Unicode 00E1.
1251 it is Greek so Char E1 = ASCII225 = Unicode 03B1.
So far I thought I could have a loookup table that looked at the encoding and then I just add the Unicode page in front of the E1 BUT that will not work as in Unicode they do not have the same page position as you see above.
To further complicate things I can also get encoding such as Japanese (shift-JIS) which is codepage 932. Now this does not get all the japanese from the same page and almost every character on the ASCII pages comes from a different Unicode page.
So the question is how in C# do I convert the XML data to Unicode and get it correct everytime? Any ideas?

Encoding.GetEncoding("windows-1253").GetString(new byte[] {0xE1}) // -> "\u03B1" α
Encoding.GetEncoding("windows-1254").GetString(new byte[] {0xE1}) // -> "\u00E1" á
Encoding.GetEncoding("windows-1251").GetString(new byte[] {0xE1}) // -> "\u0431" б
But for an XML file you should be using an existing XML parser (eg XmlReader or XDocument.Load) which will deal with encodings for you.

Related

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).
You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.

Reading file containing Arabic language

I have a file that contains both Arabic and English word/letters/numbers. I'm trying to print the file using the code from Here. When I open the file in notepad, I see all the funny and unprintable chars. When I save the same file as Unicode in Save as... file menu (notepad) and choosing Unicode, the file is displayed properly (I see Arabic letters, etc.).
When I open the same file in notepad++ the only option that displays the file correctly is
Menu->Encoding->Character set->Arabic
With C#, I'm trying to read the file line by line and print it using
ev.Graphics.DrawString(line, printFont, Brushes.Red, leftMargin, yPos, _sf);
where line is the line from the file. When the file is saved in right encode, everything prints out fine. But when we have encoding issues, we get bunch of diamonds, question marks, etc.
Here are a few ways (from various sources) that I tried opening the file with right encoding (please let me know if one of them should work and I'll try again):
Attempt 1
var arabic = Encoding.GetEncoding(1252);
var bytes = arabic.GetBytes(line);
line = arabic.GetString(bytes);`
Attempt 2
streamToPrint = new StreamReader(this.filepath,System.Text.Encoding.UTF8,true);
Attempt 3
byte[] utf8Bytes = Encoding.UTF8.GetBytes(line);
line = Encoding.Unicode.GetString(utf8Bytes);`
None of them work. Can someone kindly show me what changes I have to make to Here code so that it will read the file and print it?
var arabic = Encoding.GetEncoding(1252);
That's not it, 1252 is the Windows codepage for Western Europe and the Americas. Your next guess is 1256, the default Windows codepage for Arabic. Your next guess should be the legacy MS-Dos code pages, 864 and 720.
This kind of misery ought to inspire you to contact the company or programmer that created the file. It is high time they update. Best argument you can give them is that you are available now, probably won't be whenever they need to update.
You need to look at the BOM (Byte Order Mark, U+FEFF), which should be the first Unicode character in the file. If it's not found, It's either plain ASCI, UTF-8 without a byte order mark or something odd.
Read the first several octets of the file. The BOM is encoding differently for different encodings:
hex FE BB BF indicates UTF-8. HOWEVER, for UTF-8, the BOM is optional, it being meaningless, what with UTF-8 being an 8-bit encoding and all. If it's not found, it's no guarantee that the file is UTF-8, though. It could be plain ASCII or encoding with some other non-Unicode DBCS scheme.
hex FE FF indicates UTF-16, big-endian (network byte order).
hex FF FE indicates UTF-16, little-endian.
hex 00 00 FE FF indicates UTF-32, big-endian (network byte order).
hex FF FE 00 00 indicates UTF-32, little endian.
etc. See http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding for more.
You might notice that this isn't fool-proof. A little-endian, UTF-16 encoding file would be hard to differentiate from a little-endian, UTF-32 encoded file...if it's first non-BOM Unicode character was an ascii NUL (U+0000).

German character ß encoding in Livelink using C#

I have folder name that contains German special character such äÄéöÖüß.The following screenshot display contents of LiveLink server.
I want to extract folder from Livelink server using C#.
valueis obtained from LLserver.
var bytes = new List<byte>(value.Length);
foreach (var c in value)
{
bytes.Add((byte)c);
}
var result = Encoding.UTF8.GetString(bytes.ToArray());
Finally, the result is äÄéöÖü�x .where ß is seen as box character '�x'. All other characters present in folder name are decoded successfully/properly except the ß character.
I am just wondering why the same code works for all other German special characters but not for ß.
Could anybody help to fix this problem in C#?
Thanks in advance.
Go to admin panel of server Livelink/livelink.exe?func=admin.sysvars
and set Character Set: UTF-8
and code section change as follow
byte[] bytes = Encoding.Default.GetBytes(value);
var retValue = Encoding.UTF8.GetString(bytes);
It works fine.
You guessed your encoding to be UTF8 and it obviously is not. You will need to find out what encoding the byte stream really represents and use that instead. We cannot help you with that, you will have to ask the sender of said bytes.

C# UNICODE to ANSI conversion

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...
I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.
So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.
I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.
/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
Encoding windows = Encoding.Default;
Encoding unicode = Encoding.Unicode;
Encoding sp = Encoding.GetEncoding(codePage);
if (sp != null && !String.IsNullOrEmpty(value))
{
// First get bytes in windows encoding
byte[] wbytes = windows.GetBytes(value);
// Check if CodePage to use is different from current Windows one
if (windows.CodePage != sp.CodePage)
{
// Convert to Unicode using SP code page
byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
return unicode.GetString(ubytes);
}
else
{
// Directly convert to Unicode using windows code page
byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
return unicode.GetString(ubytes);
}
}
else
{
return value;
}
}
Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...
So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...
I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.
But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:
1st example (я)
char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);
string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));
So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\
2nd example (Σ)
char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);
string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));
At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why?
Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!
So how can the Convert() function in the .NET framework manage this kind of equivalence?
And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
I should have ...'?' if the character has not been found, but not 'S'. Why?
This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ë→e), or mapping to a cognate (Σ→S), a character that's related (≤→=), a character that's unrelated but looks a bit similar (∞→8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.
You can see the tables for cp1252, including that Sigma mapping, here.
Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.
does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.
(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)

convert from ascii (dos) to windows

Hi I have a string written in ascii code whose output is " ”˜‰ƒ ‰™˜€" this is a name in Hebrew. How can I convert it to Hebrew letters?
.net c# winform
There are no Hebrew letters in ASCII, so you have to mean ANSI. There is a default encoding for the system that is used for encoding ANSI, which you need to know to decode it.
It's probably the Windows-1255 or ISO 8859-8 encoding that was used. You can use the Encoding class to decode the data. Example:
Encoding.GetEncoding("ISO 8859-8").GetString(data);
If you already have a string, the problem is that you have decoded data using the wrong encoding. You have to go further back in the process before the data was a string, so that you can get the actual encoded bytes.
If you for example are reading the string from a file, you have to either read the file as bytes instead, or set the encoding that the stream reader uses to decode the file data into characters.

Categories