What's relation between CharacterSet here:
http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx
and ascii encoding here:
http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.getbytes(VS.71).aspx
ANSI is the current Windows ANSI code page, equivalent to Encoding.Default.
OEM is the current OEM code page typically used by console applications.
You can get this using:
Encoding.GetEncoding(CultureInfo.CurrentCulture.TextInfo.OEMCodePage)
In a console application, the OEM encoding will also be available using
Console.OutputEncoding
This is really, really ancient. ODBC dates from the stone age, back when Windows starting taking over from MS-DOS. Back then, lots of text was still encoded in the original IBM-PC character set, named the "OEM Character Set" by Microsoft. The standard IBM-PC set had some accented characters and pseudo graphics glyphs in the upper half, codes 0x80-0xff.
Too limited for text output in non-English languages, Microsoft started using code pages, ranges of character glyphs suitable for a certain language group. The American English set of characters were standardized by ANSI, that label is now attached (incorrectly) to any non-OEM code page.
Nobody encodes text in the OEM character set anymore, it went the way of the dodo at least 10 years ago. The proper setting here is ANSI. And keeping your fingers crossed behind your back that the code page used to encode the text matches your system's default code page. That's dodo too, Unicode solved it.
The short answer to your question, there's no direct relation.
The longer version:
CharacterSet for the "Schema.ini" file can be either ANSI or OEM.
ANSI and ASCII refer to different thing.
You can read more of it here:
Understanding ASCII and ANSI Characters
ASCII vs ANSI Encoding by Alex Hoffman
From my understanding, CharacterSet=ANSI is equivalent to Encoding.Default. OEM might be ASCIIEncoding then.
However, ANSI uses the system ANSI code page, so incompatibilities may arise if the same file is accessed from computers with different code pages.
I've compiled my own reference in order to switch between the two:
Windows code page Name System.Text.Encoding schema.ini CharacterSet
20127 ASCII (US) ASCII 20127
1252 ANSI Latin I Default ANSI
65001 UTF-8 UTF8 65001
1200 UTF-16 LE Unicode Unicode
1201 UTF-16 BE BigEndianUnicode 1201
Related
Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.
So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft
says:
Gets an encoding for the operating system's current ANSI code page.
However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.
Code used:
Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")
Output:
So my question in points:
Is there something I'm missing here?
Why is it not recommended to use Encoding.Default when reading a file?
I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
Is there even another way to prevent this from happening?
Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.
File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.
Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.
Saved in ANSI and opened in Unicode may not work
Saved in Unicode and opened in ANSI will not work
Saved in ANSI and opened in another ANSI may not work
I know the default encoding for Windows in Western Europe is ISO-8859-1 and the default for web standards is UTF8 but I'm hoping (google is failing me) that someone knows the default for Windows/Visual Studio/C# software in India?
The reason is that we have an India-based company contacting our web services and getting a parse exception and my suspicion is that they aren't setting the encoding right (to UTF8) but testing with the English Windows default (ISO-8859-1) works so I'm investigating alternatives.
I may be wrong, but after a bit of research I concluded that if they are not using en_IN locale, they have no codepage for either GUI or console.
This MS official source lists Hindi codepage as 0.
This random copy of this list says that Hindi is a Unicode-only locale.
IANA claims codepage numbers 0, 1 and 2 are reserved.
Here we have Moodle developer who discovered that while he can use specialised codepages for text files under most of locales, they had to resort to UTF-8 (aka codepage 65001) text files under Hindi locale – files which in most other versions of Windows are called "Unicode files".
Here we have another developer who discovered that Hindi doesn't have a default codepage.
According to MSDN, all locale-sensitive functions default to C locale, which means ASCII for 8-bit strings.
So:
you cannot type Hindi without Unicode
Hindi locale probably treats all bytes >=128 in 8-bit strings as invalid characters, while in Windows-1252 most of them are valid; I'm guessing the application performs too many conversions bytes-text without taking encoding into account (or those Indians do)
and finally, other languages of India also have no ANSI codepage
I'm right now on Linux, but if you can, I suggest running programs via Applocale under various locales. I recommend Hindi, Japanese and Turkish – for the largest chance of revealing bugs.
But my bet is that they read that XML off the wire, convert to string with default encoding and it blows up.
I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ‚È‚‚Æ‚à1‚‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?
As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.
I have a C# form based program and have been using
System.Text.Encoding.GetEncoding(1252)
but I've had trouble reading non-English characters, I've discovered
System.Text.Encoding.GetEncoding(1255)
works however I don't know the implications of changing this so I'm hoping someone can shed some light on the difference and possible implications.
I recommend that you read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
When you use GetEncoding(1252), you're specifying the Windows-1252 Encoding, which specifies a latin alphabet for Western Europe. GetEncoding(1255) is the Windows-1255 encoding, which is used to write Hebrew.
Character encoding 1255 includes Hebrew symbols whereas 1252 is geared towards Western Languages. Is it the case that the non-English symbols happen to be Hebrew?
1252 is Windows-1252 Western European (Windows)
1255 is Windows-1255 Hebrew (Windows)
source: http://msdn.microsoft.com/en-us/library/system.text.encodinginfo.codepage.aspx
Your encoding should always match the one that was used to create the file. If there is no metadata (or person) available to guide this selection, then the only thing to do would be to try each one and see which is legible. Since this is apparently in a language that you don't know, you may need to ask someone who speaks the language if it's legible. Do you know anyone who can read Hebrew?
You probably want to use one of the "named" Unicode encodings, eg., Encoding.UTF8. But, to answer your question - page 1252 is "Western European (Windows)" and 1255 is "Hebrew (Windows)".
If you're not aware, code pages are pretty much a relic of ASCII and you should try to stick with Unicode where possible.
How can I print UTF8 characters in the console?
With Console.Writeline("îăşâţ") I see îasât in console.
Console.OutputEncoding = Encoding.UTF8;
There are some hacks you can find that demonstrate how to write multibyte character sets to the Console, but they are unreliable. They require your console font to be one that supports it, and in general, are something I would avoid. (All of these techniques break if your user doesn't do extra work on their part... so they are not reliable.)
If you need to write Unicode output, I highly recommend making a GUI application to handle this, instead of using the Console. It's fairly easy to make a simple GUI to just write your output to a control which supports Unicode.
Try this :
using System.Diagnostics
...
Debug.WriteLine(..);// will output utf-8 charset
Using Console.OutputEncoding will be sufficient for this. All string objects in .NET are by default unicode so changing output encoding for console to UTF-8 will work as you want in modern Windows installations.
Default encoding in console depends on configuration but it will be most likely IBM437 for US language or some local codepage.
You can't print Unicode characters in the console, it only supports the characters that are available in the current code page. Characters that are not available are converted to the closest equivalent, or a question mark.