Guessing best encoding from set for UTF-8 data in c#

Guessing best encoding from set for UTF-8 data in c# - c#

I have text which is encoded in UTF-8, well formatted. I can't predict what languages the text will be in.
To print this text on a receipt thermal printer, I need to choose the best encoding to display the text with and convert to it. Unfortunately, UTF-8 is not supported. All available encodings can represent only a subset of characters.
So I would need to find the best option (e.g. from the list mentioned here https://reference.epson-biz.com/modules/ref_escpos/index.php?content_id=321) to convert the data too and lose as little characaters in the process (I am aware I will not be able to print Cyrillic together with Thai)
It is not a "guess the encoding" question, but rather choosing the best encoding for minimum loss of representable characters.
Has anybody seen a good approach?

Related

Change encoding from Default to Unicode for encryption

I have a project where everything that is stored in database is encrypted. For encoding we use System.Text.Encoding.Default.GetBytes(text).
The problem is that now the client wants to add support for polish (and other nordic) characters and using the Default encoding doesn't work, the polish characters get converted to english characters (e.g Ą gets converted to A).
I can't change the encoding (Unicode seems to work) as the previous data will be lost.
Is there any way to get around this and add support for new characters while keeping the old data?

To be clear; you realise that "encoding" is not "encrypting", but I suppose you encrypt the byte array you get from encoding your string data?
Then I'd suggest either decrypting and re-encoding and re-encrypting all existing data using UTF-8 (the most efficient encoding for Western alphabets), or add a "version" or "encoding" column indicating with which encoding the data was encrypted.

Wrong characters for accents in one Windows-1252 encoded XML

In the XML i need to read in C#, i find characters such as
Ã©, Ã‰.
As far as i know , i should not find those characters in a windows-1252 encoded XML. Can i fix that problem in C# or the XML itself must be updated?
Thanks in advance.

It does look like the XML needs to be updated.
You could certainly write something that reads it in as the UTF-8 it really is and writes it back out as the Windows-1252 it claimed to be, but why bother? XML in Windows-1252 is like someone using their smart-phone while dressed ye olde knight at a Renaissance Faire anyway. Just drop the incorrect declaration from the first line and away you go.

The simple answer is: you're probably using the wrong encoding. From this I'd say you should be using UTF-8. You can force it by downloading the document before parsing it.
I should note that downloading URL's is tricky: web servers often report the wrong encoding. That is also the reason why the HTML5 standard includes a section on encoding detection. I'm afraid there's no easy generic solution for this -- we ended up implementing our own encoding detection algorithms for our web crawlers.

What's the default character encoding for Windows in India?

I know the default encoding for Windows in Western Europe is ISO-8859-1 and the default for web standards is UTF8 but I'm hoping (google is failing me) that someone knows the default for Windows/Visual Studio/C# software in India?
The reason is that we have an India-based company contacting our web services and getting a parse exception and my suspicion is that they aren't setting the encoding right (to UTF8) but testing with the English Windows default (ISO-8859-1) works so I'm investigating alternatives.

I may be wrong, but after a bit of research I concluded that if they are not using en_IN locale, they have no codepage for either GUI or console.
This MS official source lists Hindi codepage as 0.
This random copy of this list says that Hindi is a Unicode-only locale.
IANA claims codepage numbers 0, 1 and 2 are reserved.
Here we have Moodle developer who discovered that while he can use specialised codepages for text files under most of locales, they had to resort to UTF-8 (aka codepage 65001) text files under Hindi locale – files which in most other versions of Windows are called "Unicode files".
Here we have another developer who discovered that Hindi doesn't have a default codepage.
According to MSDN, all locale-sensitive functions default to C locale, which means ASCII for 8-bit strings.
So:
you cannot type Hindi without Unicode
Hindi locale probably treats all bytes >=128 in 8-bit strings as invalid characters, while in Windows-1252 most of them are valid; I'm guessing the application performs too many conversions bytes-text without taking encoding into account (or those Indians do)
and finally, other languages of India also have no ANSI codepage
I'm right now on Linux, but if you can, I suggest running programs via Applocale under various locales. I recommend Hindi, Japanese and Turkish – for the largest chance of revealing bugs.
But my bet is that they read that XML off the wire, convert to string with default encoding and it blows up.

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ‚È‚‚Æ‚à1‚Â‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?

As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

How does Encoding.Default work in .NET?

I'm reading a file using:
var source = File.ReadAllText(path);
and the character © wasn't being loaded correctly.
Then, I changed it to:
var source = File.ReadAllText(path, Encoding.UTF8);
and nothing.
I decided to try using
var source = File.ReadAllText(path, Encoding.Default);
and it worked perfectly.
Then I debugged it and tried to find which Encoding did the trick, and I found that it was UTF-7.
What I want to know is:
Is it recommended to use Encoding.Default, and can it guarantee all the characters of the file will be read without problems?

Encoding.Default will only guarantee that all UTF-7 character sets will be read correctly (google for the whole set). On the other hand, if you try to read a file not encoded with UTF-8 in the UTF-8 mode, you'll get corrupted characters like you did.
For instance if the file is encoded UTF-16 and if you read it in UTF-16 mode, you'll be fine even if the file does not contain a single UTF-16 specific character. It all boils down to the file's encoding.
You'll need to do the save - reopen stuff with the same encoding to be safe from corruptions. Otherwise, try to use UTF-7 as much as you can since it is the most compact yet 'email safe' encoding possible, which is why it is default in most .NET framework setups.

It is not recommended to use Encoding.Default.
Quote from MSDN:
Different computers can use different
encodings as the default, and the
default encoding can even change on a
single computer. Therefore, data
streamed from one computer to another
or even retrieved at different times
on the same computer might be
translated incorrectly. In addition,
the encoding returned by the Default
property uses best-fit fallback to map
unsupported characters to characters
supported by the code page. For these
two reasons, using the default
encoding is generally not recommended.
To ensure that encoded bytes are
decoded properly, your application
should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with
a preamble. Another option is to use a
higher-level protocol to ensure that
the same format is used for encoding
and decoding.

It sounds like you are interested in auto-detecting the encoding of a file, in some sort of situation where you are not in control of the encoding used to save it. There are several questions on StackOverflow addressing this; some cursory browsing points to Determine a string's encoding in C# as a pretty good one. My favorite answer is the one pointing to a C# port of Mozilla's universal charset detector.

I think the ur file is in utf-7 encoding.nothing more.
visit this page Your Answer

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Guessing best encoding from set for UTF-8 data in c# - c#

Related

Change encoding from Default to Unicode for encryption

Wrong characters for accents in one Windows-1252 encoded XML

What's the default character encoding for Windows in India?

ANSI vs SHIFT JIS vs UTF-8 in c#

How does Encoding.Default work in .NET?

Categories

Resources