In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.
Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.
Related
In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.
Being a computer programming rookie, I was given homework involving the use of the playing card suit symbols. In the course of my research I came across an easy way to retrieve the symbols:
Console.Write((char)6);
gives you ♠
Console.Write((char)3);
gives you ♥
and so on...
However, I still don't understand what logic C# uses to retrieve those symbols. I mean, the ♠ symbol in the Unicode table is U+2660, yet I didn't use it. The ASCII table doesn't even contain these symbols.
So my question is, what is the logic behind (char)int?
For these low numbers (below 32), this is an aspect of the console rather than C#, and it comes from Code page 437 - though it won't include the ones that have other meanings that the console actually uses, such as tab, carriage return, and bell. This isn't really portable to any context where you're not running directly in a console window, and you should use e.g. 0x2660 instead, or just '\u2660'.
The logic behind (char)int is that char is a UTF-16 code unit, one or two of which encode a Unicode codepoint. Codepoints are naturally ordinal numbers, being an identifier for a member of a character set. They are often written in hexadecimal, and specifically for Unicode, preceded by U+, for example U+2660.
UTF-16 is a mapping between codepoint and code units. Code units being 16 bits can be operated on as integers. Since a char holds one code unit, you can convert an short to a char. Since the different integer types can interoperate, you can convert an int to a char.
So, your short (or int) has meaning as text only when it represents a UTF-16 code unit for a codepoint that only has one code unit. (You could also convert an int holding a whole codepoint to a string.)
Of course, you could let the compiler figure it out for you and make it easier for your readers, too, with:
Console.Write('♥');
Also, forget ASCII. It's never the right encoding (except when it is). In case it's not clear, a string is a counted sequence of UTF-16 code units.
I'm trying to use a DLL generated by ikvmc from a jar file compiled from Scala code (yeah my day is THAT great). The Scala compiler seems to generate identifiers containing dollar signs for operator overloads, and IKVM uses those in the generated DLL (I can see it in Reflector). The problem is, dollar signs are illegal in C# code, and so I can't reference those methods.
Any way to work around this problem?
You should be able to access the funky methods using reflection. Not a nice solution, but at least it should work. Depending on the structure of the API in the DLL it may be feasible to create a wrapper around the methods to localise the reflection code. Then from the rest of your code just call the nice wrapper.
The alternative would be to hack on the IL in the target DLL and change the identifiers. Or do some post-build IL-hacking on your own code.
Perhaps you can teach IKVM to rename these identifiers such that they have no dollar sign? I'm not super familar, but a quick search pointed me at these:
http://weblog.ikvm.net/default.aspx?date=2005-05-02
What is the format of the Remap XML file for IKVM?
String and complex data types in Map.xml for IKVM!
Good Hunting
Write synonyms for those methods:
def +(a:A,b:A) = a + b
val plus = + _
I fear that you will have to use Reflection in order to access those members. Escaping simply doesn't work in your case.
But for thoose of you, who interested in escaping mechanics I've wrote an explanation.
In C# you can use the #-sign in order to escape keywords and use them as identifiers. However, this does not help to escape invalid characters:
bool #bool = false;
There is a way to write identifiers differently by using a Unicode escape sequence:
int i\u0064; // '\u0064' == 'd'
id = 5;
Yes this works. However, even with this trick you can still not use the $-sign in an identifier. Trying...
int i\u0024; // '\u0024' == '$'
... gives the compiler error "Unexpected character '\u0024'". The identifier must still be a valid identifier! The c# compiler probably resolves the escape sequence in a kind of pre-processing and treats the resulting identifier as if it had been entered normally
So what is this escaping good for? Maybe it can help you, if someone uses a foreign language character that is not on your keyboard.
int \u00E4; // German a-Umlaut
ä = 5;
E.g:
isValidCppIdentifier("_foo") // returns true
isValidCppIdentifier("9bar") // returns false
isValidCppIdentifier("var'") // returns false
I wrote some quick code but it fails:
my regex is "[a-zA-Z_$][a-zA-Z0-9_$]*"
and I simply do regex.IsMatch(inputString).
Thanks..
It should work with some added anchoring:
"^[a-zA-Z_][a-zA-Z0-9_]*$"
If you really need to support ludicrous identifiers using Unicode, feel free to read one of the various versions of the standard and add all the ranges into your regexp (for example, pages 713 and 714 of http://www-d0.fnal.gov/~dladams/cxx_standard.pdf)
Matti's answer will work to sanitize identifiers before inserting into C++ code, but won't handle C++ code as input very well. It will be annoying to separate things like L"wchar_t string", where L is not an identifier. And there's Unicode.
Clang, Apple's compiler which is built on a philosophy of modularity, provides a set of tokenizer functions. It looks like you would want clang_createTranslationUnitFromSourceFile and clang_tokenize.
I didn't check to see if it handles \Uxxxx or anything. Can't make any kind of gurarantees. Last time I used LLVM was five years ago and it wasn't the greatest experience… but not the worst either.
On the other hand, GCC certainly has it, although you have to figure out how to use cpp_lex_direct.
In C#, if you want a String to be taken literally, i.e. ignore escape characters, you can use:
string myString = #"sadasd/asdaljsdl";
However there is no equivalent in Java. Is there any reason Java has not included something similar?
Edit:
After reviewing some answers and thinking about it, what I'm really asking is:
Is there any compelling argument against adding this syntax to Java? Some negative to it, that I'm just not seeing?
Java has always struck me as a minimalist language - I would imagine that since verbatim strings are not a necessity (like properties for instance) they were not included.
For instance in C# there are many quick ways to do thing like properties:
public int Foo { get; set; }
and verbatim strings:
String bar = #"some
string";
Java tends to avoid as much syntax-sugar as possible. If you want getters and setters for a field you must do this:
private int foo;
public int getFoo() { return this.foo; }
public int setFoo(int foo) { this.foo = foo; }
and strings must be escaped:
String bar = "some\nstring";
I think it is because in a lot of ways C# and Java have different design goals. C# is rapidly developed with many features being constantly added but most of which tend to be syntax sugar. Java on the other hand is about simplicity and ease of understanding. A lot of the reasons that Java was created in the first place were reactions against C++'s complexity of syntax.
I find it funny "why" questions. C# is a newer language, and tries to improve in what is seen as shortcomings in other languages such as Java. The simple reason for the "why" question is - the Java standard does not define the # operator such as in C#.
Like said, mostly when you want to escape characters is for regexes. In that case use:
Pattern.quote()
I think one of the reasons is that regular expressions (which are a major reason for these kind of String literals) where not part of the Java platform until Java 1.4 (if I remember correctly). There simply wasn't so much of a need for this, when the language was defined.
Java (unfortunately) doesn't have anything like this, but Groovy does:
assert '''hello,
world''' == 'hello,\nworld'
//triple-quotes for multi-line strings, adds '\n' regardless of host system
assert 'hello, \
world' == 'hello, world' //backslash joins lines within string
I really liked this feature of C# back when I did some .NET work. It was especially helpful for cut and pasted SQL queries.
I am not sure on the why, but you can do it by escaping the escape character. Since all escape characters are preceded by a backslash, by inserting a double backslash you can effectively cancel the escape character. e.g. "\now" will produce a newline then the letters "ow" but "\now" will produce "\now"
I think this question is like: "Why java is not indentation-sensitive like Python?"
Mentioned syntax is a sugar, but it is redundant (superfluous).
You should find your IDE handles the problem for you.
If you are in the middle of a String and copy-paste raw text into it, it should escape the text for you.
PERL has a wider variety of ways to set String literals and sometimes wish Java supported these as well. ;)