Why doesn't Java have a way of specifying unescaped String literals? - c#

In C#, if you want a String to be taken literally, i.e. ignore escape characters, you can use:
string myString = #"sadasd/asdaljsdl";
However there is no equivalent in Java. Is there any reason Java has not included something similar?
Edit:
After reviewing some answers and thinking about it, what I'm really asking is:
Is there any compelling argument against adding this syntax to Java? Some negative to it, that I'm just not seeing?

Java has always struck me as a minimalist language - I would imagine that since verbatim strings are not a necessity (like properties for instance) they were not included.
For instance in C# there are many quick ways to do thing like properties:
public int Foo { get; set; }
and verbatim strings:
String bar = #"some
string";
Java tends to avoid as much syntax-sugar as possible. If you want getters and setters for a field you must do this:
private int foo;
public int getFoo() { return this.foo; }
public int setFoo(int foo) { this.foo = foo; }
and strings must be escaped:
String bar = "some\nstring";
I think it is because in a lot of ways C# and Java have different design goals. C# is rapidly developed with many features being constantly added but most of which tend to be syntax sugar. Java on the other hand is about simplicity and ease of understanding. A lot of the reasons that Java was created in the first place were reactions against C++'s complexity of syntax.

I find it funny "why" questions. C# is a newer language, and tries to improve in what is seen as shortcomings in other languages such as Java. The simple reason for the "why" question is - the Java standard does not define the # operator such as in C#.

Like said, mostly when you want to escape characters is for regexes. In that case use:
Pattern.quote()

I think one of the reasons is that regular expressions (which are a major reason for these kind of String literals) where not part of the Java platform until Java 1.4 (if I remember correctly). There simply wasn't so much of a need for this, when the language was defined.

Java (unfortunately) doesn't have anything like this, but Groovy does:
assert '''hello,
world''' == 'hello,\nworld'
//triple-quotes for multi-line strings, adds '\n' regardless of host system
assert 'hello, \
world' == 'hello, world' //backslash joins lines within string
I really liked this feature of C# back when I did some .NET work. It was especially helpful for cut and pasted SQL queries.

I am not sure on the why, but you can do it by escaping the escape character. Since all escape characters are preceded by a backslash, by inserting a double backslash you can effectively cancel the escape character. e.g. "\now" will produce a newline then the letters "ow" but "\now" will produce "\now"

I think this question is like: "Why java is not indentation-sensitive like Python?"
Mentioned syntax is a sugar, but it is redundant (superfluous).

You should find your IDE handles the problem for you.
If you are in the middle of a String and copy-paste raw text into it, it should escape the text for you.
PERL has a wider variety of ways to set String literals and sometimes wish Java supported these as well. ;)

Related

Any way to use string (without escaping manually) that contains double quotes

Let's say I want to assign a text (which contains many double quotes) into variable. However, the only way seems to manually escape:
string t = "Lorem \"Ipsum\" dummy......
//or//
string t = #"Lorem ""Ipsum"" dummy.....
Is there any way to avoid manual escaping, and instead use something universal (which I dont know in C#) keywoard/method to do that automatically? In PHP, it's untoldly simple, by just using single quote:
$t = 'Lorem "Ipsum" dummy .......
btw, please don't bomb me with critiques "Why do you need to use that" or etc. I need answer to the question what I ask.
I know this answer may not be satisfying, but C# sytnax simply won't allow you to do such thing (at the time of writing this answer).
I think the best solution is to use resources. Adding/removing and using strings from resources is super easy:
internal class Program
{
private static void Main(string[] args)
{
string myStringVariable = Strings.MyString;
Console.WriteLine(myStringVariable);
}
}
The Strings is the name of the resources file without the extension (resx):
MyString is the name of your string in the resources file:
I may be wrong, but I conjecture this is the simplest solution.
No. In C# syntax, the only way to define string literals is the use of the double quote " with optional modifiers # and/or $ in front. The single quote is the character literal delimiter, and cannot be used in the way PHP would allow - in any version, including the current 8.0.
Note that the PHP approach suffers from the need to escape ' as well, which is, especially in the English language, frequently used as the apostrophe.
To back that up, the EBNF of the string literal in current C# is still this:
regular_string_literal '"' { regular_string_literal_character } '"'
The only change in the compiler in version 8.0 was that now, the order of the prefix modifiers $ (interpolated) and # (verbatim) can be either #$ or $#; it used to matter annoyingly in earlier versions.
Alternatives:
Save it to a file and use File.ReadAllText for the assignment, or embed it as a managed ressource, then the compiler will provide a variable in the namespace of your choice with the verbatim text as its runtime value.
Or use single quotes (or any other special character of your choice), and go
var t = #"Text with 'many quotes' inside".Replace("'", #"""");
where the Replace part could be modeled as an extension to the String class for brevity.

Why do some character literals cause Syntax Errors in Java?

In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.
Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.

How to reference identifiers with dollar signs from C#?

I'm trying to use a DLL generated by ikvmc from a jar file compiled from Scala code (yeah my day is THAT great). The Scala compiler seems to generate identifiers containing dollar signs for operator overloads, and IKVM uses those in the generated DLL (I can see it in Reflector). The problem is, dollar signs are illegal in C# code, and so I can't reference those methods.
Any way to work around this problem?
You should be able to access the funky methods using reflection. Not a nice solution, but at least it should work. Depending on the structure of the API in the DLL it may be feasible to create a wrapper around the methods to localise the reflection code. Then from the rest of your code just call the nice wrapper.
The alternative would be to hack on the IL in the target DLL and change the identifiers. Or do some post-build IL-hacking on your own code.
Perhaps you can teach IKVM to rename these identifiers such that they have no dollar sign? I'm not super familar, but a quick search pointed me at these:
http://weblog.ikvm.net/default.aspx?date=2005-05-02
What is the format of the Remap XML file for IKVM?
String and complex data types in Map.xml for IKVM!
Good Hunting
Write synonyms for those methods:
def +(a:A,b:A) = a + b
val plus = + _
I fear that you will have to use Reflection in order to access those members. Escaping simply doesn't work in your case.
But for thoose of you, who interested in escaping mechanics I've wrote an explanation.
In C# you can use the #-sign in order to escape keywords and use them as identifiers. However, this does not help to escape invalid characters:
bool #bool = false;
There is a way to write identifiers differently by using a Unicode escape sequence:
int i\u0064; // '\u0064' == 'd'
id = 5;
Yes this works. However, even with this trick you can still not use the $-sign in an identifier. Trying...
int i\u0024; // '\u0024' == '$'
... gives the compiler error "Unexpected character '\u0024'". The identifier must still be a valid identifier! The c# compiler probably resolves the escape sequence in a kind of pre-processing and treats the resulting identifier as if it had been entered normally
So what is this escaping good for? Maybe it can help you, if someone uses a foreign language character that is not on your keyboard.
int \u00E4; // German a-Umlaut
ä = 5;

Splitting string on commas when data can contain commas

I have a CSV file (which I didn't design and I can't change now nor will I ever be able to change it) that contains lines like the following:
"Surname, Firstname", yes, no, somestring, whatever, etc
As you can see here, the first , is not a comma on which I'd want to split the string. Notice that this particular comma is enclosed within the quotation marks.
Because of this, a simple string.split(',') obviously won't work, as it would give me an array of length 7 for the above string instead of 6.
Is there a way to get around this? I was thinking of using regex to split the string instead but I'm not competent enough in regex to think of a pattern that would only split on commas that are not enclosed inside quotation marks.
I can think of ugly, hacky ways to do it by reading each string char by char but this would have to be a last resort as I'm sure there's a better way to do it!
You can handle this easily by using the TextFieldParser class. Just set HasFieldsEnclosedInQuotes to true.
I would suggest using a CSV parser library - there are other cases that you wouldn't have thought of (new line as part of a quoted field).
The VisualBasic namespace has a nice library that can help - the TextFieldParser.
I know there's a lot of people here who think character-by-character comparisons should never be used and will strongly disagree with me but I'm not convinced companies like Microsoft aren't the only ones who should be doing that sort of programming.
Afterall, Split does character-by-character comparisons so why is it any less ugly when you call existing code that doesn't quite do exactly what you want?
At any rate, my approach was to write my own code. And I've posted the code online at http://www.blackbeltcoder.com/Articles/files/reading-and-writing-csv-files-in-c.

How to detect a C++ identifier string?

E.g:
isValidCppIdentifier("_foo") // returns true
isValidCppIdentifier("9bar") // returns false
isValidCppIdentifier("var'") // returns false
I wrote some quick code but it fails:
my regex is "[a-zA-Z_$][a-zA-Z0-9_$]*"
and I simply do regex.IsMatch(inputString).
Thanks..
It should work with some added anchoring:
"^[a-zA-Z_][a-zA-Z0-9_]*$"
If you really need to support ludicrous identifiers using Unicode, feel free to read one of the various versions of the standard and add all the ranges into your regexp (for example, pages 713 and 714 of http://www-d0.fnal.gov/~dladams/cxx_standard.pdf)
Matti's answer will work to sanitize identifiers before inserting into C++ code, but won't handle C++ code as input very well. It will be annoying to separate things like L"wchar_t string", where L is not an identifier. And there's Unicode.
Clang, Apple's compiler which is built on a philosophy of modularity, provides a set of tokenizer functions. It looks like you would want clang_createTranslationUnitFromSourceFile and clang_tokenize.
I didn't check to see if it handles \Uxxxx or anything. Can't make any kind of gurarantees. Last time I used LLVM was five years ago and it wasn't the greatest experience… but not the worst either.
On the other hand, GCC certainly has it, although you have to figure out how to use cpp_lex_direct.

Categories