What's the difference between Encoding.GetEncoding(1255) and Encoding.GetEncoding(1252)? - c#

I have a C# form based program and have been using
System.Text.Encoding.GetEncoding(1252)
but I've had trouble reading non-English characters, I've discovered
System.Text.Encoding.GetEncoding(1255)
works however I don't know the implications of changing this so I'm hoping someone can shed some light on the difference and possible implications.

I recommend that you read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

When you use GetEncoding(1252), you're specifying the Windows-1252 Encoding, which specifies a latin alphabet for Western Europe. GetEncoding(1255) is the Windows-1255 encoding, which is used to write Hebrew.

Character encoding 1255 includes Hebrew symbols whereas 1252 is geared towards Western Languages. Is it the case that the non-English symbols happen to be Hebrew?

1252 is Windows-1252 Western European (Windows)
1255 is Windows-1255 Hebrew (Windows)
source: http://msdn.microsoft.com/en-us/library/system.text.encodinginfo.codepage.aspx

Your encoding should always match the one that was used to create the file. If there is no metadata (or person) available to guide this selection, then the only thing to do would be to try each one and see which is legible. Since this is apparently in a language that you don't know, you may need to ask someone who speaks the language if it's legible. Do you know anyone who can read Hebrew?

You probably want to use one of the "named" Unicode encodings, eg., Encoding.UTF8. But, to answer your question - page 1252 is "Western European (Windows)" and 1255 is "Hebrew (Windows)".
If you're not aware, code pages are pretty much a relic of ASCII and you should try to stick with Unicode where possible.

Related

What's the default character encoding for Windows in India?

I know the default encoding for Windows in Western Europe is ISO-8859-1 and the default for web standards is UTF8 but I'm hoping (google is failing me) that someone knows the default for Windows/Visual Studio/C# software in India?
The reason is that we have an India-based company contacting our web services and getting a parse exception and my suspicion is that they aren't setting the encoding right (to UTF8) but testing with the English Windows default (ISO-8859-1) works so I'm investigating alternatives.
I may be wrong, but after a bit of research I concluded that if they are not using en_IN locale, they have no codepage for either GUI or console.
This MS official source lists Hindi codepage as 0.
This random copy of this list says that Hindi is a Unicode-only locale.
IANA claims codepage numbers 0, 1 and 2 are reserved.
Here we have Moodle developer who discovered that while he can use specialised codepages for text files under most of locales, they had to resort to UTF-8 (aka codepage 65001) text files under Hindi locale – files which in most other versions of Windows are called "Unicode files".
Here we have another developer who discovered that Hindi doesn't have a default codepage.
According to MSDN, all locale-sensitive functions default to C locale, which means ASCII for 8-bit strings.
So:
you cannot type Hindi without Unicode
Hindi locale probably treats all bytes >=128 in 8-bit strings as invalid characters, while in Windows-1252 most of them are valid; I'm guessing the application performs too many conversions bytes-text without taking encoding into account (or those Indians do)
and finally, other languages of India also have no ANSI codepage
I'm right now on Linux, but if you can, I suggest running programs via Applocale under various locales. I recommend Hindi, Japanese and Turkish – for the largest chance of revealing bugs.
But my bet is that they read that XML off the wire, convert to string with default encoding and it blows up.

Special characters in SharpPDF

I'm using sharpPDF dll (http://sharppdf.sourceforge.net) to create PDF's in C#. Everything works great but I don't get any special characters (actually these are Polish letters such as "ą, ć, ł, Ó...") in my output. I'm saving strings in that PDF.
Is there any way to get that working?
Thanks.
Unfortunately SharpPDF has a lot of issues with special characters and there is no evolution planned for a correction of the special characters problem.
Sorry.

In which order RTL language (Hebrew, Arabic, etc) strings are stored in memory?

And how does the OS know whether to apply bidi algorithms on the string for displaying purposes?
I know that Hebrew might come in an ISO-Logical form, but how does the OS know how to point that a specific string contains Hebrew (or any other RTL language)?
According to How to detect whether a character belongs to a Right To Left language? - it seems they are stored left-to-right, and it's the character codes that dictate whether it's a RTL language.
The way to do this nowadays, as recommended by the Unicode standard, is to store text in logical order (good explanation here), which means the order in which it is read.
The OS knows that a specific string contains Hebrew by looking at the character codes. It applies the Unicode Bidirectional Algorithm to determine the correct display order. Typically an OS will do a quick scan of the string first to see if there are any right-to-left characters or control codes constraining the order. If not, the string doesn't need reordering.

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ­‚È‚­‚Æ‚à1‚‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?
As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

Relation between .NET Encoding and Characterset

What's relation between CharacterSet here:
http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx
and ascii encoding here:
http://msdn.microsoft.com/en-us/library/system.text.asciiencoding.getbytes(VS.71).aspx
ANSI is the current Windows ANSI code page, equivalent to Encoding.Default.
OEM is the current OEM code page typically used by console applications.
You can get this using:
Encoding.GetEncoding(CultureInfo.CurrentCulture.TextInfo.OEMCodePage)
In a console application, the OEM encoding will also be available using
Console.OutputEncoding
This is really, really ancient. ODBC dates from the stone age, back when Windows starting taking over from MS-DOS. Back then, lots of text was still encoded in the original IBM-PC character set, named the "OEM Character Set" by Microsoft. The standard IBM-PC set had some accented characters and pseudo graphics glyphs in the upper half, codes 0x80-0xff.
Too limited for text output in non-English languages, Microsoft started using code pages, ranges of character glyphs suitable for a certain language group. The American English set of characters were standardized by ANSI, that label is now attached (incorrectly) to any non-OEM code page.
Nobody encodes text in the OEM character set anymore, it went the way of the dodo at least 10 years ago. The proper setting here is ANSI. And keeping your fingers crossed behind your back that the code page used to encode the text matches your system's default code page. That's dodo too, Unicode solved it.
The short answer to your question, there's no direct relation.
The longer version:
CharacterSet for the "Schema.ini" file can be either ANSI or OEM.
ANSI and ASCII refer to different thing.
You can read more of it here:
Understanding ASCII and ANSI Characters
ASCII vs ANSI Encoding by Alex Hoffman
From my understanding, CharacterSet=ANSI is equivalent to Encoding.Default. OEM might be ASCIIEncoding then.
However, ANSI uses the system ANSI code page, so incompatibilities may arise if the same file is accessed from computers with different code pages.
I've compiled my own reference in order to switch between the two:
Windows code page Name System.Text.Encoding schema.ini CharacterSet
20127 ASCII (US) ASCII 20127
1252 ANSI Latin I Default ANSI
65001 UTF-8 UTF8 65001
1200 UTF-16 LE Unicode Unicode
1201 UTF-16 BE BigEndianUnicode 1201

Categories