How to encode & decode non Ascii characters? - c#

I am developing an application in which i want to encode the Spanish text.
But the problem is that,it doesn't encode the special characters such as á, é, í, ó, ú, ü,Á, É, Í, Ó, Ú, Ü,Ñ,ñ .
How can i do this?i want to encode-decode the spanish text.

For international support using simple UTF-8 encoding to encode/decode your data should be enough.
Utf-8 has a beautiful capability to be able to read ASCII with one byte, as ordinary ASCII, and Unicode characters with 2 bytes. So it's able "to shrink" when it's necesary.
For complete C# documentation look on
UTF-8
EDIT
Encoding enc = new UTF8Encoding(true, true);
string value = " á, é, í, ó, ú, ü,Á, É, Í, Ó, Ú, Ü,Ñ,ñ ";
byte[] bytes= enc.GetBytes(value); //convert to BYTE array
//save in some file
//after can read from the file like
string decodedString = enc.GetString(byteArrayReadFromFile);

ok,I am answering my own question ,Hope it will help someone; to print spanish or any other non-ascii character in the given string replace all non-ascii characters by their unicode escape character set
E.g repalce á by \u00e1
And then simply print the string.
i.e
string str="árgrgrgrááhhttá";
str=str.Replace("á", "\u00e1");

Related

Chinese Simplified to Hex GB2312 encoding in C#

I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}
The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74

convert utf-8 in sting into character

How can i convert a string that contain UTF8 encode into a string that contain only character ?
At the beginning I have a UTF8 string example : "Thành" but after some action i perfom it convert all UTF8 character into UTF8 encode ( in this case it convert "Thành" into "Thành" ). How can i convert it back to origin string ? ( convert "Thành" into "Thành" ). I'm using c#. Thank you all
The text you see is encoded with XML/HTML numeric character references, nothing to do with UTF-8. You can use HttpUtility.HtmlDecode to decode it.

Change Encoding in C#?

Theoretical question :
Let's say there is one source which knows only how to transmit ASCII chars. (0..127)
And let's say there is an endpoint which receives these chars .
Can the endpoint decode those chars as utf8 ?
ascii chars
...
...
|
|
V
read as utf ?
Something like this pseudo code :
var txt="אבג";
var _bytes=Encoding.ASCII.GetBytes(txt); <= it wont recognize [א] here
...transmit...
var myUtfString=Encoding.UTF8.GetString(getBytesFromWire(); <= some magic has to be done here
That is possible, but not using UTF8.
UTF8 works by encoding multibyte characters into sequences of bytes that are between 128 and 255.
Your ASCII protocol will not be able to transmit those bytes.
Instead, you need some mechanism to store arbitrary Unicode codepoints or bytes in pure ASCII text:
You can encode the Unicode text using any encoding to get a stream of (non-ASCII) bytes, then transmit those bytes using Base64 encoding
You can use the UTF7 encoding to encode Unicode codepoints using pure ASCII characters.
This will be substantially more space-efficient than Base64 if your text is mostly ASCII.
var txt = "אבג";
var str = Convert.ToBase64String(Encoding.UTF8.GetBytes(txt)); //<--ASCII
//Transmit
var txt2 = Encoding.UTF8.GetString(Convert.FromBase64String(str));

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

In. c#
We can use below classes to do encoding:
System.Text.Encoding.UTF8
System.Text.Encoding.UTF16
System.Text.Encoding.ASCII
Why there is no System.Text.Encoding.Base64?
We can only use Convert.From(To)Base64String method, what's special of base64?
Can I say base64 is the same encoding method as UTF-8? Or UTF-8 is one of base64?
UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.
See: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Base64 is a method to encode a byte sequence to a string.
So, these are widely different concepts and should not be confused.
Things to keep in mind:
Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16.
Not every Unicode string represents a byte sequence encoded in Base64.
Base64 is a way to encode binary data, while UTF8 and UTF16 are ways to encode Unicode text. Note that in a language like Python 2.x, where binary data and strings are mixed, you can encode strings into base64 or utf8 the same way:
u'abc'.encode('utf16')
u'abc'.encode('base64')
But in languages where there's a more well-defined separation between the two types of data, the two ways of representing data generally have quite different utilities, to keep the concerns separate.
UTF-8 is like the other UTF encodings a character encoding to encode characters of the Unicode character set UCS.
Base64 is an encoding to represent any byte sequence by a sequence of printable characters (i.e. A–Z, a–z, 0–9, +, and /).
There is no System.Text.Encoding.Base64 because Base64 is not a text encoding but rather a base conversion like the hexadecimal that uses 0–9 and A–F (or a–f) to represent numbers.
Simply speaking, a charcter enconding, like UTF8 , or UTF16 are useful for to match numbers, i.e. bytes to characters and viceversa, for example in ASCII 65 is matched to "A" , while a base encoding is used mainly to translate bytes to bytes so that the resulting bytes converted from a single byte are printable and are a subset of the ASCII charachter encoding, for that reason you can see Base64 also as a bytes to text encoding mechanism. The main reason to use Base64 is to be trasmit data over a channel that doesn't allow binary data transfer.
That said, now it should be clear that you can have a stream encoded in Base64 that rapresent a stream UTF8 encoded.

ASCII Encoding and Umlauts and Accents

I have a requirement to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?
string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();
When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters.
As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?
I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.
Or, am I way off and doing/saying something stupid?
You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.
Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).
The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.
The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).
To support characters not in Latin-1 you need to encode using different code pages, for example:
874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese
UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.
UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.
Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.
IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.
The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.
You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.
The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.
You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.
You are correct.
Pure US ASCII is a 7-bit encoding, featuring English characters only.
You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.
UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.
There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.
Various of the encodings mentioned by other answers can be loosely described as extended ASCII.
When your users are asking for ASCII encoding, they are probably asking for one of these.
A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.
The above is only partially correct. While it's true that you can't encode those characters in ASCII, you can represent them. They exist because some typewriters and early computers couldn't handle those characters.
Ä=Ae
ä=ae
ö=oe
Ö=Oe
ü=ue
Ü=Ue
ß=sz
Edit:
Andyraddaz has already written code that replaces lots of Unicode Characters with ASCII Representations. They might not be correct for some Languages/Cultures, but at least you wont have encoding errors.
https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c

Categories