How does WChar relate to Unicode and ASCII

How does WChar relate to Unicode and ASCII - c#

I am about to show my total ignorance of how encoding works and different string formats.
I am passing a string to a compiler (Microsoft as it happens amd for their Flight Simulator). The string is passed as part of an XML document which is used as the source for the compiler. This is created using using standard NET strings. I have not needed to specifically specify any encoding or setting of type since the XML is just text.
The string is just a collection of characters. This is an example of one that gives the error:
ARG, AFL, AMX, ACA, DAH, CCA, AEL, AGN, MAU, SEY, TSC, AZA, AAL, ANA, BBC, CPA, CAL, COA, CUB, DAL, UGX, ELY, UAE, ERT, ETH, EEZ, GHA, IRA, JAL, NWA, KAL, KAC, LAN, LDI, MAS, MEA, PIA, QTR, RAM, RJA, SVA, SIA, SWR, ROT, THA, THY, AUI, UAL, USA, ACA, TAR, UZB, IYE, QFA
If I create the string using my C# managed program then there is no issue. However this string is coming from a c++ program that can create the compiled file using its own compiler that is not compliant with the MS one
The MS compiler does not like the string. It throws two errors:
INTERNAL COMPILER ERROR: #C2621: Couldn't convert WChar string!
INTERNAL COMPILER ERROR: #C2029: Failed to convert attribute value from UNICODE!
Unfortunately there is not any useful documentation with the compiler on its errors. We just makethe best of what we see!
I have seen other errors of this type but these contain hidden characters and control characters that I can trap and remove.
In this case I looked at the string as a Char[] and could not see anything unusual. Only what I expected. No values above the ascii limit of 127 and no control characters.
I understand that WChar is something that C++ understands (but I don't), Unicode is a two byte representation of characters and ASCII is a one byte representation.
I would like to do two things - first identify a string that will fail if passed to the compiler and second fix the string. I assume the compiler is expecting ASCII.
EDIT
I told an untruth - in fact I do use encoding. I checked the code I used to convert a byte array into a string.
public static string Bytes2String(byte[] bytes, int start, int length) {
string temp = Encoding.Defaut.GetString(bytes, start, length);
}
I realized that Default might be an issue but changing it to ASCII makes no difference. I am beginning to believe that the error message is not what it seems.

It looks like you are taking a byte array, and converting it as a string using the encoding returned by Encoding.Default.
It is recommended that you do not do this (in the Microsoft documentation).
You need to work out what encoding is being used in the C++ program to generate the byte array, and use the same one (or a compatible one) to convert the byte array back to a string again in the C# code.
E.g. if the byte array is using ASCII encoding, you could use:
System.Text.ASCIIEncoding.GetString(bytes, start, length);
or
System.Text.UTF8Encoding.GetString(bytes, start, length);
P.S. I hope Joel doesn't catch you ;)

I have to come clean that the compiler error has nothing to do with the encoding format of the string. It turns out that it is the length of the string that is at fault. As per the sample there are a number of entries separated by commas. The compiler throws the rather unhelful messages if the entry count exceeds 50.
However Thanks everyone for your help - it has raised the issue of encoding in my mind and I will now look at it much more carefully

Related

I have many byte arrays; each is a string. How do I find the encoding each uses?

I have an application that reads binary data from a database. Each byte array retrieved represents a string. The strings, though, have all come from different encodings (most commonly ASCII, UTF-8 BOM, and UTF-16 LE, but there are others). In my own application, I'm trying to convert the byte array back to a string, but the encoding that was used to go from string to bytes is not stored with the bytes. Is it possible in C# to determine or infer the encoding used from the byte array?
The use case is simplified below. Assume the byte array is always a string. Also assume the string can use any encoding.
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
string originalString = Encoding.???.GetString(bytes);

For text that is XML, the XML specification gives requirements and how to determine the encoding.
In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The
Text Declaration) containing an encoding declaration:
…
In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8.
—https://www.w3.org/TR/xml/#charencoding
It seems that the storage design was to drop any "information provided by an external transport protocol". It is possible that what was stored meets the specification. You can inspect your data.
If the data is complete, just let your XML processing do the job:
byte[] bytes = Convert.FromBase64(stringAsBytesAsBase64);
using (var stream = new MemoryStream(bytes))
{
var doc = XDocument.Load(stream);
}
If you do need the XML back as text with a known encoding, you can then serialize it using whichever encoding you need.

Someone downvoted this. Perhaps because it didn't start out with a clear answer:
Is it possible in C# to determine or infer the encoding used from the byte array?
No.
Below is the best you can do and you'll see why it's problematic:
You can start with a list of known Encodings.GetEncodings() and eliminate possibilities. In the end, you will have many known possibilities, many known impossibilities and potentially unknown possibilities (for encodings that aren't supported in .NET, if any). That is all as far a hard facts go.
You could then apply heuristics or some knowledge of expected content to narrow the list further. And if the results of applying each of the remaining encodings are all the same, then you've very probably got the correct text even if you didn't identify the original encoding.

Weird behavior C#

Somehow I'm getting a weird result from a GetString(). So, in my project I got this code:
byte[] arrayBytes = System.Convert.FromBase64String(n["spo_fdat"].InnerText);
string str = System.Text.Encoding.UTF8.GetString(arrayBytes);
The InnerText Value and the code is in: https://dotnetfiddle.net/mMUlti
So, my problem is that somehow I'm getting this result on my Visual Studio:
While in the online compiler that I post above the output is as expected.
This output is an output for a printer and this \0 are destroying the format.
Anyone have a clue of what is going on and what should I do/try?

It looks like for some reason every other byte in your input is null. If you strip those out you get something that looks much more plausible as printer commands (though I am no expert). Hopefully you can verify things...
To do this all I did was added this line in:
arrayBytes = arrayBytes.Where((x,i)=>i%2==0).ToArray();
The where command takes the value (x), and index (i) and if the index mode 2 is 0 (ie its even) then the where clause allows it - if its odd it throws it away.
The output I get from this starts:
CT~~CD,~CC^~CT~
^XA~TA000~JSN^LT0^MNW^MTT^PON^PMN^LH0,0^JMA^PR2,2~SD15^JUS^LRN^CI0^XZ
^XA
^MMT
^PW607
^LL0406
There are some non-printing character in there too that look like possible printing commands (eg 16 is the first character that is "data link escape" character.
Edited afterthought:
The problem you have here is obviously a problem with the specification. It seems to be that your input is wrong. You need to talk to whoever generated it find out the specification they are using to generate it, make sure their ode matches that spec and then right your code to accept that spec. With a solid specification you should both be writing compatible code.

Try inspecting the bytes instead. You'll see that what you have encoded in the base-64 string is much closer to what Visual Studio shows to you in comparison to the output from dotnetfiddle. Consoles usually don't escape non-printables (such as \0 - the null character) whereas Visual Studio string inspector does so in attempt to provide as much value to its user as possible.
Looking at your base-64 encoded data, it looks way more like UTF-16 than UTF-8. If you decode it like so, you'll perhaps get rid of the null characters in Visual Studio inspector as well.
Regardless of that, the base-64 data don't make much sense. More semantical context is required to figure out what the issue is.
According to inspection by Chris, it looks like the data is UTF-8 encoded in UTF-16.
You should be able to get proper results with the following:
var xml = //your base-64 input...
var arrayBytes = Convert.FromBase64String(xml);
var utf16 = Encoding.Unicode.GetString(arrayBytes);
var utf8Bytes = utf16.Select(c => (byte)c).ToArray();
var utf8 = Encoding.UTF8.GetString(utf8Bytes);
Console.WriteLine(utf8);
The opposite is probably how your input was created. However, you could also go for Chris' solution of ignoring every odd byte as it is basically the same with less weird encoding things going on (although this may be more explicit to what really goes on: UTF-8 inside UTF-16).

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
Ù…Ø¯Ù„-Ø±Ù†Ú¯-Ù…ÙˆÛŒ-Ø¬Ø¯ÛŒØ¯-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?

It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "Ø±", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

How do I hard code invalid chars in a string in VS

There is this environment I'm working on which only allows some very limited namespaces. I've came up with an encoding struct which encodes a file into a single hard coded string, and then I can load the string as a file during the runtime.
After I refined the struct to utilize the char type as an unsigned 16 bit, I encountered a problem that not all chars can be displayed and hard coded into a string; or sometimes a generated string is shortened when cast back. Is there any way I can approach this with a better method?

Convert UCS-2 characters to UTF-8 Using C#

I'm pulling some internationalized text from a MS SQL Server 2005 database. As per the defaults for that DB, the characters are stored as UCS-2. However, I need to output the data in UTF-8 format, as I'm sending it out over the web. Currently, I have the following code to convert:
SqlString dbString = resultReader.GetSqlString(0);
byte[] dbBytes = dbString.GetUnicodeBytes();
byte[] utf8Bytes = System.Text.Encoding.Convert(System.Text.Encoding.Unicode,
System.Text.Encoding.UTF8, dbBytes);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
string outputString = encoder.GetString(utf8Bytes);
However, when I examine the output in the browser, it appears to be garbage, no matter what I set the encoding to.
What am I missing?
EDIT:
In response to the answers below, the reason I thought I had to perform a conversion is because I can output literal multibyte strings just fine. For example:
OutputControl.Text = "カルフォルニア工科大学とチューリッヒ工科大学は共同で、太陽光を保管可能な燃料に直接変えることのできる装置の開発に成功したとのこと";
works. Here, OutputControl is an ASP.Net Literal. However,
OutputControl.Text = outputString; //Output from above snippet
results in mangled output as described above. My hypothesis was that the database's output was somehow getting mangled by ASP.Net. If that's not the case, then what are some other possibilities?
EDIT 2:
Okay, I'm stupid. It turns out that there's nothing wrong with the database at all. When I tried inserting my own literal double byte characters (材料,原料;木料), I could read and output them just fine even without any conversion process at all. It seems to me that whatever is inserting the data into the DB is mangling the characters somehow, so I'm going to look at that. With my verified, "clean" data, the following code works:
OutputControl.Text = dbString.ToString();
as the responses below indicate it should.

Your code does essentially the same as:
SqlString dbString = resultReader.GetSqlString(0);
string outputString = dbString.ToString();
string itself is a UNICODE string (specifically, UTF-16, which is 'almost' the same as UCS-2, except for codepoints not fitting into the lowest 16 bits). In other words, the conversions you are performing are redundant.
Your web app most likely mangles the encoding somewhere else as well, or sets a wrong encoding for the HTML output. However, that can't be diagnosed from the information you provided so far.

String in .net is 'encoding agnostic'.
You can convert bytes to string using a particular encoding to tell .net how to interprets your bytes.
You can convert string to bytes using a particular encoding to tell .net how you want your bytes served.
But trying to convert a string to another string using encodings makes no sens at all.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.