How do I hard code invalid chars in a string in VS - c#

There is this environment I'm working on which only allows some very limited namespaces. I've came up with an encoding struct which encodes a file into a single hard coded string, and then I can load the string as a file during the runtime.
After I refined the struct to utilize the char type as an unsigned 16 bit, I encountered a problem that not all chars can be displayed and hard coded into a string; or sometimes a generated string is shortened when cast back. Is there any way I can approach this with a better method?

Related

Convert python byte in string format to byte array in c#

In my python code i have a value in byte code, whenever print that byte code it will give something like this,
b'\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb'
now, that value in string format in c# that is,
string byteString = "b'\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb'";
so question is how can i convert that byteString to byte array in c#
but, my actual problem is i have a string value in python which is not in English, when i run the python code it will print the string(in non English, work fine).
But, whenever run that python code in c# from process class it work fine for English and i can get the value. but it not working for non English characters, it was a null value. therefore, in python if i print that non English value in byte code i can get the value in c#. problem is how can i convert that in byte code into byte array in c#.
First, you want to modify your string slightly for usage in C#.
var str = "\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb";
You can then get your bytes fairly easily with LINQ.
var bytes = str.Select(x => Convert.ToByte(x)).ToArray();
An odd case that can occur with trying to use byte strings between Python and C# is that python will sometimes put out straight ASCII characters for certain byte values, leaving you with a mixed string like b'\xe0ello'. C# recognizes \x##, but it also attempts to parse \x####, which will tend to break when dealing with the output of a python bytestring that mixes hex codes and ascii.

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

Where does (char)int get its symbols from?

Being a computer programming rookie, I was given homework involving the use of the playing card suit symbols. In the course of my research I came across an easy way to retrieve the symbols:
Console.Write((char)6);
gives you ♠
Console.Write((char)3);
gives you ♥
and so on...
However, I still don't understand what logic C# uses to retrieve those symbols. I mean, the ♠ symbol in the Unicode table is U+2660, yet I didn't use it. The ASCII table doesn't even contain these symbols.
So my question is, what is the logic behind (char)int?
For these low numbers (below 32), this is an aspect of the console rather than C#, and it comes from Code page 437 - though it won't include the ones that have other meanings that the console actually uses, such as tab, carriage return, and bell. This isn't really portable to any context where you're not running directly in a console window, and you should use e.g. 0x2660 instead, or just '\u2660'.
The logic behind (char)int is that char is a UTF-16 code unit, one or two of which encode a Unicode codepoint. Codepoints are naturally ordinal numbers, being an identifier for a member of a character set. They are often written in hexadecimal, and specifically for Unicode, preceded by U+, for example U+2660.
UTF-16 is a mapping between codepoint and code units. Code units being 16 bits can be operated on as integers. Since a char holds one code unit, you can convert an short to a char. Since the different integer types can interoperate, you can convert an int to a char.
So, your short (or int) has meaning as text only when it represents a UTF-16 code unit for a codepoint that only has one code unit. (You could also convert an int holding a whole codepoint to a string.)
Of course, you could let the compiler figure it out for you and make it easier for your readers, too, with:
Console.Write('♥');
Also, forget ASCII. It's never the right encoding (except when it is). In case it's not clear, a string is a counted sequence of UTF-16 code units.

Why do some character literals cause Syntax Errors in Java?

In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.
Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.

How does WChar relate to Unicode and ASCII

I am about to show my total ignorance of how encoding works and different string formats.
I am passing a string to a compiler (Microsoft as it happens amd for their Flight Simulator). The string is passed as part of an XML document which is used as the source for the compiler. This is created using using standard NET strings. I have not needed to specifically specify any encoding or setting of type since the XML is just text.
The string is just a collection of characters. This is an example of one that gives the error:
ARG, AFL, AMX, ACA, DAH, CCA, AEL, AGN, MAU, SEY, TSC, AZA, AAL, ANA, BBC, CPA, CAL, COA, CUB, DAL, UGX, ELY, UAE, ERT, ETH, EEZ, GHA, IRA, JAL, NWA, KAL, KAC, LAN, LDI, MAS, MEA, PIA, QTR, RAM, RJA, SVA, SIA, SWR, ROT, THA, THY, AUI, UAL, USA, ACA, TAR, UZB, IYE, QFA
If I create the string using my C# managed program then there is no issue. However this string is coming from a c++ program that can create the compiled file using its own compiler that is not compliant with the MS one
The MS compiler does not like the string. It throws two errors:
INTERNAL COMPILER ERROR: #C2621: Couldn't convert WChar string!
INTERNAL COMPILER ERROR: #C2029: Failed to convert attribute value from UNICODE!
Unfortunately there is not any useful documentation with the compiler on its errors. We just makethe best of what we see!
I have seen other errors of this type but these contain hidden characters and control characters that I can trap and remove.
In this case I looked at the string as a Char[] and could not see anything unusual. Only what I expected. No values above the ascii limit of 127 and no control characters.
I understand that WChar is something that C++ understands (but I don't), Unicode is a two byte representation of characters and ASCII is a one byte representation.
I would like to do two things - first identify a string that will fail if passed to the compiler and second fix the string. I assume the compiler is expecting ASCII.
EDIT
I told an untruth - in fact I do use encoding. I checked the code I used to convert a byte array into a string.
public static string Bytes2String(byte[] bytes, int start, int length) {
string temp = Encoding.Defaut.GetString(bytes, start, length);
}
I realized that Default might be an issue but changing it to ASCII makes no difference. I am beginning to believe that the error message is not what it seems.
It looks like you are taking a byte array, and converting it as a string using the encoding returned by Encoding.Default.
It is recommended that you do not do this (in the Microsoft documentation).
You need to work out what encoding is being used in the C++ program to generate the byte array, and use the same one (or a compatible one) to convert the byte array back to a string again in the C# code.
E.g. if the byte array is using ASCII encoding, you could use:
System.Text.ASCIIEncoding.GetString(bytes, start, length);
or
System.Text.UTF8Encoding.GetString(bytes, start, length);
P.S. I hope Joel doesn't catch you ;)
I have to come clean that the compiler error has nothing to do with the encoding format of the string. It turns out that it is the length of the string that is at fault. As per the sample there are a number of entries separated by commas. The compiler throws the rather unhelful messages if the entry count exceeds 50.
However Thanks everyone for your help - it has raised the issue of encoding in my mind and I will now look at it much more carefully

Categories