What's the default character encoding for Windows in India? - c#

I know the default encoding for Windows in Western Europe is ISO-8859-1 and the default for web standards is UTF8 but I'm hoping (google is failing me) that someone knows the default for Windows/Visual Studio/C# software in India?
The reason is that we have an India-based company contacting our web services and getting a parse exception and my suspicion is that they aren't setting the encoding right (to UTF8) but testing with the English Windows default (ISO-8859-1) works so I'm investigating alternatives.

I may be wrong, but after a bit of research I concluded that if they are not using en_IN locale, they have no codepage for either GUI or console.
This MS official source lists Hindi codepage as 0.
This random copy of this list says that Hindi is a Unicode-only locale.
IANA claims codepage numbers 0, 1 and 2 are reserved.
Here we have Moodle developer who discovered that while he can use specialised codepages for text files under most of locales, they had to resort to UTF-8 (aka codepage 65001) text files under Hindi locale – files which in most other versions of Windows are called "Unicode files".
Here we have another developer who discovered that Hindi doesn't have a default codepage.
According to MSDN, all locale-sensitive functions default to C locale, which means ASCII for 8-bit strings.
So:
you cannot type Hindi without Unicode
Hindi locale probably treats all bytes >=128 in 8-bit strings as invalid characters, while in Windows-1252 most of them are valid; I'm guessing the application performs too many conversions bytes-text without taking encoding into account (or those Indians do)
and finally, other languages of India also have no ANSI codepage
I'm right now on Linux, but if you can, I suggest running programs via Applocale under various locales. I recommend Hindi, Japanese and Turkish – for the largest chance of revealing bugs.
But my bet is that they read that XML off the wire, convert to string with default encoding and it blows up.

Related

Can Encoding.Default recognize utf8 characters? Should I really not use it?

Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.
So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft
says:
Gets an encoding for the operating system's current ANSI code page.
However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.
Code used:
Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")
Output:
So my question in points:
Is there something I'm missing here?
Why is it not recommended to use Encoding.Default when reading a file?
I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
Is there even another way to prevent this from happening?
Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.
File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.
This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.
Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.
Saved in ANSI and opened in Unicode may not work
Saved in Unicode and opened in ANSI will not work
Saved in ANSI and opened in another ANSI may not work

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ­‚È‚­‚Æ‚à1‚‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?
As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

Japanese culture in C# application in English OS

I have an application in C# using .net 3.5. Using this application I save an file and zip it using vjlib library and while opening the file I unzip it. However when I try to give file name as Japanese when saving it in English OS machine the file while opening it the application it not able to understand Japanese character. It due to some Windows Language pack etc.
This problem is most likely induced by the application that created the .zip file. File names are encoded in 8 bit characters inside the file. The ZIP specification says that the name should be encoded either in code page 437 or in utf-8. Code page 437 is the original IBM PC character set, an encoding that doesn't support any Japanese characters. It is not unusual for an app to just use its own 8-bit encoding, not untypically determined by the default system code page.
The library you use is the .NET runtime support library for JScript. Not sure it support specifying a different encoding, it is hard to find docs for it these days since it has been deprecated for so long. Consider, say, dotnetzip. Its ZipFile class has an AlternateEncoding property you can initialize from Encoding.GetEncoding(). You still need to find out what encoding was used, knowing where the file came from is important help to make the right guess. One common code page for Japanese is 932.

What's the difference between Encoding.GetEncoding(1255) and Encoding.GetEncoding(1252)?

I have a C# form based program and have been using
System.Text.Encoding.GetEncoding(1252)
but I've had trouble reading non-English characters, I've discovered
System.Text.Encoding.GetEncoding(1255)
works however I don't know the implications of changing this so I'm hoping someone can shed some light on the difference and possible implications.
I recommend that you read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
When you use GetEncoding(1252), you're specifying the Windows-1252 Encoding, which specifies a latin alphabet for Western Europe. GetEncoding(1255) is the Windows-1255 encoding, which is used to write Hebrew.
Character encoding 1255 includes Hebrew symbols whereas 1252 is geared towards Western Languages. Is it the case that the non-English symbols happen to be Hebrew?
1252 is Windows-1252 Western European (Windows)
1255 is Windows-1255 Hebrew (Windows)
source: http://msdn.microsoft.com/en-us/library/system.text.encodinginfo.codepage.aspx
Your encoding should always match the one that was used to create the file. If there is no metadata (or person) available to guide this selection, then the only thing to do would be to try each one and see which is legible. Since this is apparently in a language that you don't know, you may need to ask someone who speaks the language if it's legible. Do you know anyone who can read Hebrew?
You probably want to use one of the "named" Unicode encodings, eg., Encoding.UTF8. But, to answer your question - page 1252 is "Western European (Windows)" and 1255 is "Hebrew (Windows)".
If you're not aware, code pages are pretty much a relic of ASCII and you should try to stick with Unicode where possible.

Relation between .NET Encoding and Characterset

What's relation between CharacterSet here:
http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx
and ascii encoding here:
http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.getbytes(VS.71).aspx
ANSI is the current Windows ANSI code page, equivalent to Encoding.Default.
OEM is the current OEM code page typically used by console applications.
You can get this using:
Encoding.GetEncoding(CultureInfo.CurrentCulture.TextInfo.OEMCodePage)
In a console application, the OEM encoding will also be available using
Console.OutputEncoding
This is really, really ancient. ODBC dates from the stone age, back when Windows starting taking over from MS-DOS. Back then, lots of text was still encoded in the original IBM-PC character set, named the "OEM Character Set" by Microsoft. The standard IBM-PC set had some accented characters and pseudo graphics glyphs in the upper half, codes 0x80-0xff.
Too limited for text output in non-English languages, Microsoft started using code pages, ranges of character glyphs suitable for a certain language group. The American English set of characters were standardized by ANSI, that label is now attached (incorrectly) to any non-OEM code page.
Nobody encodes text in the OEM character set anymore, it went the way of the dodo at least 10 years ago. The proper setting here is ANSI. And keeping your fingers crossed behind your back that the code page used to encode the text matches your system's default code page. That's dodo too, Unicode solved it.
The short answer to your question, there's no direct relation.
The longer version:
CharacterSet for the "Schema.ini" file can be either ANSI or OEM.
ANSI and ASCII refer to different thing.
You can read more of it here:
Understanding ASCII and ANSI Characters
ASCII vs ANSI Encoding by Alex Hoffman
From my understanding, CharacterSet=ANSI is equivalent to Encoding.Default. OEM might be ASCIIEncoding then.
However, ANSI uses the system ANSI code page, so incompatibilities may arise if the same file is accessed from computers with different code pages.
I've compiled my own reference in order to switch between the two:
Windows code page Name System.Text.Encoding schema.ini CharacterSet
20127 ASCII (US) ASCII 20127
1252 ANSI Latin I Default ANSI
65001 UTF-8 UTF8 65001
1200 UTF-16 LE Unicode Unicode
1201 UTF-16 BE BigEndianUnicode 1201

Categories