Convert Latin characters from Shift JIS to Latin characters in Unicode - c#

I'm working on parsing files with Shift-JIS encoded strings within the binary data. My current code is this:
public static string DecodeShiftJISString(this byte[] data, int index, int length)
{
byte[] utf8Bytes = Encoding.Convert(Encoding.GetEncoding(932), Encoding.UTF8, data);
return Encoding.UTF8.GetString(utf8Bytes);
}
It works fine and I am able to get usable strings from this method, although when I display strings with Latin characters into my WinForms application, I see that the characters are wider than normal.
Latin characters in Shift-JIS string
I'm not sure if this is an issue with my encoding logic, or the way I'm supposed to display the strings (I just pass them directly into my controls). Any help would be appreciated!

These aren't normal ASCII characters, they're ‘fullwidth variants’ in the range U+FF01 fullwidth exclamation mark upwards. They're for lining up formatting when setting a mixture of Latin and CJK characters.
Unicode would prefer weird characters like this, which are just semantically-identical stylistic variants of existing characters, not to exist. But it has to include them to round-trip to legacy encodings like Shift-JIS. For this reason they are called Compatibility characters.
You can convert compatibility characters to their basic variants by using Unicode normalisation with a ‘K’ format such as NFKC. In Win32 you can do this using NormalizeString().

Related

WriteAllText, Character Encoding, £ and?

Take the following example:
string testfile1 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test1.txt");
if (!System.IO.File.Exists(testfile1))
{
System.IO.File.WriteAllText(testfile1, "£100", System.Text.Encoding.ASCII);
}
string testfile2 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test2.txt");
if (!System.IO.File.Exists(testfile2))
{
System.IO.File.WriteAllText(testfile2, "£100", System.Text.Encoding.UTF8);
}
Note the encoding. The first outputs ?100. The second outputs £100.
I know the encoding is different, but can somebody explain why ASCII encoding can't write the £ sign?
ASCII doesn't include the "£" character. That is - there is no byte value (nor a multiple byte value - they don't exist in ASCII) that denotes that symbol. So it shows you a "?" to tell you that. UTF8, on the other hand, does include it.
See here a list of all of the printable characters in ASCII.
If you must use ASCII, consider using "GBP" as mentioned here for Pound sterling. (Also might be relevant: Extended ASCII.)
To deal with ASCII and certain characters it depended largely on what code page you're using. £ isn't a character that is required or used universally within the latin alphabet so didn't appear in the standard ASCII set.
Look at this article or this one on code pages to see how the character limitation was resolved and for an idea as to why it won't show up everywhere.
As Hans pointed out, ASCII is designed to Americans using only code points 0-127, the negligible rest of the English speaking world can live with that unless they try to use obscure symbols like £ with code points outside the range 0-127. I presume you live in the UK and aim only at customers from the UK, or Western Europe. Don't use Encoding.ASCII but Encoding.Default which would be code page 1252 in the UK, not in Turkey of course. You get real ASCII for every character in the ASCII range 0-127 but can also use characters in the range 128-255 where the pound symbol lives. But note, if someone tries to read the file assuming it is encoded in UTF8, the £ sign will obscure the content since it includes a byte that is non-existing in UTF8. This is indicated by some weird glyph like �.

Chinese Simplified to Hex GB2312 encoding in C#

I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}
The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74

How can I display extended Unicode character in a C# console?

I'm trying to display a set of playing cards, which have Unicode values in the 1F0A0 to 1F0DF range. Whenever I try to use chars with more than 4 chars in their code, I get errors. Is it possible to use these characters in this context? I'm using Visual Studio 2012.
char AceOfSpades = '\u1F0A0'; immediately upon typing gives me the error "Too many characters in character literal" This still shows up with either of the Unicode or UTF8 encodings. If I try to display '\u1F0A' like above... With Unicode it shows '?' With UTF8 it shows 3 characters.
I tried all the given options for OutputEncoding string AceOfSpades = "\U0001F0A0";
Default, Unicode, ASCII: ??
UTF7: +2DzcoA-
UTF8: four wierd characters
UTF32 , BigEndianUnicode: IOException
Console.OutputEncoding = System.Text.Encoding.UTF32;, despite being an option, crashes even if it's the only line of code.
UTF16 was not on the list.
How can I check which version of Unicode I'm using?
In order to use characters from outside the Basic Multilingual Plane, you need to escape them with \U, not \u. That requires eight hexadecimal digits (which, to be honest, makes little sense, since all Unicode characters can be written with six hexadecimal digits).
However, the type char in .NET can only represent UTF-16 code units, meaning that characters outside the BMP require a pair of chars (a surrogate pair). So you have to make it a string.
string AceOfSpades = "\U0001F0A0";
I am going to assume (until you edit your post for clarity) that your symbols are not displaying properly. If this is not the case, I will delete this answer.
Set your console's encoding to Unicode or UTF-8.
Console.OutputEncoding = System.Text.Encoding.Unicode
or
Console.OutputEncoding = System.Text.Encoding.UTF8.
Make sure the font can display Unicode/UTF-8 characters (like Lucida Console).
I am displaying 10 Egyptian hieroglyphs from the Extended Unicode in a Windows application (not console) like that:
string single_character = "\U00013000";//first ancient hieroglyph
//get the Unicode index
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0);
for (int i = 0; i < 10; i++)
{
//convert from int Unicode index to display character
string glyph = Char.ConvertFromUtf32(code); //single one
textBox1.Text += glyph;
code++;
}
You also need a font that supports these.

How to recognize if a string contains unicode chars?

I have a string and I want to know if it has unicode characters inside or not.
(if its fully contains ASCII or not)
How can I achieve that?
Thanks!
If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.
public void test()
{
const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
const string WithoutUnicodeCharacter = "an ANSI character:Æ";
bool hasUnicode;
//true
hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
Console.WriteLine(hasUnicode);
//false
hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
Console.WriteLine(hasUnicode);
}
public bool ContainsUnicodeCharacter(string input)
{
const int MaxAnsiCode = 255;
return input.Any(c => c > MaxAnsiCode);
}
Update
This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.
If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string
so a one liner check in c# could look like..
String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.
Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.
ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).
As for the actual code to do this, #chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.
[*] Also known as Latin 1 Windows (Win-1252)
As long as it contains characters, it contains Unicode characters.
From System.String:
Represents text as a series of Unicode
characters.
public static bool ContainsUnicodeChars(string text)
{
return !string.IsNullOrEmpty(text);
}
You normally have to worry about different Unicode encodings when you have to:
Encode a string into a stream of bytes with a particular encoding.
Decode a string from a stream of bytes with a particular encoding.
Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.
Each character in a string is defined
by a Unicode scalar value, also called
a Unicode code point or the ordinal
(numeric) value of the Unicode
character. Each code point is encoded
by using UTF-16 encoding, and the
numeric value of each element of the
encoding is represented by a Char
object.
Perhaps you might also find these questions relevant:
How can you strip non-ASCII characters from a string? (in C#)
C# Ensure string contains only ASCII
And this article by Jon Skeet: Unicode and .NET
This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:
Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
Dim inputCharArray() As Char = inputstr.ToCharArray
For i As Integer = 0 To inputCharArray.Length - 1
If CInt(AscW(inputCharArray(i))) > 255 Then Return True
Next
Return False
End Function

ASCII Encoding and Umlauts and Accents

I have a requirement to produce text files with ASCII encoding. I have a database full of Greek, French, and German characters with Umlauts and Accents. Is this even possible?
string reportString = report.makeReport();
Dictionary<string, string> replaceCharacters = new Dictionary<string, string>();
byte[] encodedReport = Encoding.ASCII.GetBytes(reportString);
Response.BufferOutput = false;
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "attachment;filename=" + reportName + ".txt");
Response.OutputStream.Write(encodedReport, 0, encodedReport.Length);
Response.End();
When I get the reportString back the characters are represented faithfully. When I save the text file I have ? in place of the special characters.
As I understand it the ASCII standard is for American English only and something UTF 8 would be for the international audience. Is this a correct?
I'm going to make the statement that if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly.
Or, am I way off and doing/saying something stupid?
You cannot represent accents and umlauts in an ASCII encoded file simply because these characters are not defined in the standard ASCII charset.
Before Unicode this was handled by "code pages", you can think of a code page as a mapping between Unicode characters and the 256 values that can fit into a single byte (obviously, in every code page most of the Unicode characters are missing).
The original ASCII code page includes only English letters - but it's unlikely someone really wants the original 7-bit code page, they probably call any 8-bit character set ASCII.
The English code page known as Latin-1 is ISO-8859-1 or Windows-1252 (the first is the ISO standard, the second is the closest code page supported by Windows).
To support characters not in Latin-1 you need to encode using different code pages, for example:
874 — Thai
932 — Japanese
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
950 — Chinese (traditional) (Taiwan, Hong Kong)
1250 — Latin (Central European languages)
1251 — Cyrillic
1252 — Latin (Western European languages)
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Latin (Baltic languages)
1258 — Vietnamese
UTF-8 is something completely different, it encodes the entire Unicode character set by using variable number of bytes per characters, numbers and English letters are encoded the same as ASCII (and Windows-1252) most other languages are encoded at 2 to 4 bytes per character.
UTF-8 is mostly compatible with ASCII systems because English is encoded the same as ASCII and there are no embedded nulls in the strings.
Converting between .net strings (UTF-16LE) and other encoding is done by the System.Text.Encoding class.
IMPORTANT NOTE: the most important thing is that the system on the receiving end will use the same code page and teh system on the sending end - otherwise you will get gibberish.
The ASCII characer set only contains A-Z in upper and lowe case, digits, and some punctuation. No greek characters, no umlauts, no accents.
You can use a character set from the group that is sometimes referred to as "extended ASCII", which uses 256 characters instead of 128.
The problem with using a different character set than ASCII is that you have to use the correct one, i.e. the one that the receiving part is expecting, or it will fail to interpret any of the extended characters correctly.
You can use Encoding.GetEncoding(...) to create an extended encoding. See the reference for the Encoding class for a list of possible encodings.
You are correct.
Pure US ASCII is a 7-bit encoding, featuring English characters only.
You need a different encoding to capture characters from other alphabets. UTF-8 is a good choice.
UTF-8 is backward compatible with ASCII, so if you encode your files as UTF-8, then ASCII clients can read whatever is in their character set, and Unicode clients can read all the extended characters.
There's no way to get all the accents you want in ASCII; some accented characters (like ü) are however available in the "extended ASCII" (8-bit) character set.
Various of the encodings mentioned by other answers can be loosely described as extended ASCII.
When your users are asking for ASCII encoding, they are probably asking for one of these.
A statement like "if the requirement is ASCII encoding we can't have the accents and umlauts represented correctly" risks sounding pedantic to a non-technical user. An alternative is to get a sample of what they want (probably either the ANSI or OEM code page of their PC), determine the appropriate code page, and specify that.
The above is only partially correct. While it's true that you can't encode those characters in ASCII, you can represent them. They exist because some typewriters and early computers couldn't handle those characters.
Ä=Ae
ä=ae
ö=oe
Ö=Oe
ü=ue
Ü=Ue
ß=sz
Edit:
Andyraddaz has already written code that replaces lots of Unicode Characters with ASCII Representations. They might not be correct for some Languages/Cultures, but at least you wont have encoding errors.
https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c

Categories