Chinese Simplified to Hex GB2312 encoding in C# - c#

I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}

The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74

Related

Convert Latin characters from Shift JIS to Latin characters in Unicode

I'm working on parsing files with Shift-JIS encoded strings within the binary data. My current code is this:
public static string DecodeShiftJISString(this byte[] data, int index, int length)
{
byte[] utf8Bytes = Encoding.Convert(Encoding.GetEncoding(932), Encoding.UTF8, data);
return Encoding.UTF8.GetString(utf8Bytes);
}
It works fine and I am able to get usable strings from this method, although when I display strings with Latin characters into my WinForms application, I see that the characters are wider than normal.
Latin characters in Shift-JIS string
I'm not sure if this is an issue with my encoding logic, or the way I'm supposed to display the strings (I just pass them directly into my controls). Any help would be appreciated!
These aren't normal ASCII characters, they're ‘fullwidth variants’ in the range U+FF01 fullwidth exclamation mark upwards. They're for lining up formatting when setting a mixture of Latin and CJK characters.
Unicode would prefer weird characters like this, which are just semantically-identical stylistic variants of existing characters, not to exist. But it has to include them to round-trip to legacy encodings like Shift-JIS. For this reason they are called Compatibility characters.
You can convert compatibility characters to their basic variants by using Unicode normalisation with a ‘K’ format such as NFKC. In Win32 you can do this using NormalizeString().

ASCII Code Of Characters

In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!
But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).
You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.

How can I display extended Unicode character in a C# console?

I'm trying to display a set of playing cards, which have Unicode values in the 1F0A0 to 1F0DF range. Whenever I try to use chars with more than 4 chars in their code, I get errors. Is it possible to use these characters in this context? I'm using Visual Studio 2012.
char AceOfSpades = '\u1F0A0'; immediately upon typing gives me the error "Too many characters in character literal" This still shows up with either of the Unicode or UTF8 encodings. If I try to display '\u1F0A' like above... With Unicode it shows '?' With UTF8 it shows 3 characters.
I tried all the given options for OutputEncoding string AceOfSpades = "\U0001F0A0";
Default, Unicode, ASCII: ??
UTF7: +2DzcoA-
UTF8: four wierd characters
UTF32 , BigEndianUnicode: IOException
Console.OutputEncoding = System.Text.Encoding.UTF32;, despite being an option, crashes even if it's the only line of code.
UTF16 was not on the list.
How can I check which version of Unicode I'm using?
In order to use characters from outside the Basic Multilingual Plane, you need to escape them with \U, not \u. That requires eight hexadecimal digits (which, to be honest, makes little sense, since all Unicode characters can be written with six hexadecimal digits).
However, the type char in .NET can only represent UTF-16 code units, meaning that characters outside the BMP require a pair of chars (a surrogate pair). So you have to make it a string.
string AceOfSpades = "\U0001F0A0";
I am going to assume (until you edit your post for clarity) that your symbols are not displaying properly. If this is not the case, I will delete this answer.
Set your console's encoding to Unicode or UTF-8.
Console.OutputEncoding = System.Text.Encoding.Unicode
or
Console.OutputEncoding = System.Text.Encoding.UTF8.
Make sure the font can display Unicode/UTF-8 characters (like Lucida Console).
I am displaying 10 Egyptian hieroglyphs from the Extended Unicode in a Windows application (not console) like that:
string single_character = "\U00013000";//first ancient hieroglyph
//get the Unicode index
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0);
for (int i = 0; i < 10; i++)
{
//convert from int Unicode index to display character
string glyph = Char.ConvertFromUtf32(code); //single one
textBox1.Text += glyph;
code++;
}
You also need a font that supports these.

Change Encoding in C#?

Theoretical question :
Let's say there is one source which knows only how to transmit ASCII chars. (0..127)
And let's say there is an endpoint which receives these chars .
Can the endpoint decode those chars as utf8 ?
ascii chars
...
...
|
|
V
read as utf ?
Something like this pseudo code :
var txt="אבג";
var _bytes=Encoding.ASCII.GetBytes(txt); <= it wont recognize [א] here
...transmit...
var myUtfString=Encoding.UTF8.GetString(getBytesFromWire(); <= some magic has to be done here
That is possible, but not using UTF8.
UTF8 works by encoding multibyte characters into sequences of bytes that are between 128 and 255.
Your ASCII protocol will not be able to transmit those bytes.
Instead, you need some mechanism to store arbitrary Unicode codepoints or bytes in pure ASCII text:
You can encode the Unicode text using any encoding to get a stream of (non-ASCII) bytes, then transmit those bytes using Base64 encoding
You can use the UTF7 encoding to encode Unicode codepoints using pure ASCII characters.
This will be substantially more space-efficient than Base64 if your text is mostly ASCII.
var txt = "אבג";
var str = Convert.ToBase64String(Encoding.UTF8.GetBytes(txt)); //<--ASCII
//Transmit
var txt2 = Encoding.UTF8.GetString(Convert.FromBase64String(str));

How to recognize if a string contains unicode chars?

I have a string and I want to know if it has unicode characters inside or not.
(if its fully contains ASCII or not)
How can I achieve that?
Thanks!
If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.
public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";
bool hasUnicode;
//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);
//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}
public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;
return input.Any(c => c > MaxAnsiCode);
}
Update
This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.
If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string
so a one liner check in c# could look like..
String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.
Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.
ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).
As for the actual code to do this, #chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.
[*] Also known as Latin 1 Windows (Win-1252)
As long as it contains characters, it contains Unicode characters.
From System.String:
Represents text as a series of Unicode
characters.
public static bool ContainsUnicodeChars(string text)
{
return !string.IsNullOrEmpty(text);
}
You normally have to worry about different Unicode encodings when you have to:
Encode a string into a stream of bytes with a particular encoding.
Decode a string from a stream of bytes with a particular encoding.
Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.
Each character in a string is defined
by a Unicode scalar value, also called
a Unicode code point or the ordinal
(numeric) value of the Unicode
character. Each code point is encoded
by using UTF-16 encoding, and the
numeric value of each element of the
encoding is represented by a Char
object.
Perhaps you might also find these questions relevant:
How can you strip non-ASCII characters from a string? (in C#)
C# Ensure string contains only ASCII
And this article by Jon Skeet: Unicode and .NET
This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:
Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
Dim inputCharArray() As Char = inputstr.ToCharArray
For i As Integer = 0 To inputCharArray.Length - 1
If CInt(AscW(inputCharArray(i))) > 255 Then Return True
Next
Return False
End Function

Categories