Related
First of all: I know how to work around this issue. I'm not searching for a solution. I am interested in the reasoning behind the design choices that led to some implicit conversions and didn't lead to others.
Today I came across a small but influential error in our code base, where an int constant was initialised with a char representation of that same number. This results in an ASCII conversion of the char to an int. Something like this:
char a = 'a';
int z = a;
Console.WriteLine(z);
// Result: 97
I was confused why C# would allow something like this. After searching around I found the following SO question with an answer by Eric Lippert himself: Implicit Type cast in C#
An excerpt:
However, we can make educated guesses as to why implicit
char-to-ushort was considered a good idea. The key idea here is that
the conversion from number to character is a "possibly dodgy"
conversion. It's taking something that you do not KNOW is intended to
be a character, and choosing to treat it as one. That seems like the
sort of thing you want to call out that you are doing explicitly,
rather than accidentally allowing it. But the reverse is much less
dodgy. There is a long tradition in C programming of treating
characters as integers -- to obtain their underlying values, or to do
mathematics on them.
I can agree with the reasoning behind it, though an IDE hint would be awesome. However, I have another situation where the implicit conversion suddenly is not legal:
char a = 'a';
string z = a; // CS0029 Cannot implicitly convert type 'char' to 'string'
This conversion is in my humble opinion, very logical. It cannot lead to data loss and the intention of the writer is also very clear. Also after I read the rest of the answer on the char to int implicit conversion, I still don't see any reason why this should not be legal.
So that leads me to my actual question:
What reasons could the C# design team have, to not implement the implicit conversion from char to a string, while it appears so obvious by the looks of it (especially when comparing it to the char to int conversion).
First off, as I always say when someone asks "why not?" question about C#: the design team doesn't have to provide a reason to not do a feature. Features cost time, effort and money, and every feature you do takes time, effort and money away from better features.
But I don't want to just reject the premise out of hand; the question might be better phrased as "what are design pros and cons of this proposed feature?"
It's an entirely reasonable feature, and there are languages which allow you to treat single characters as strings. (Tim mentioned VB in a comment, and Python also treats chars and one-character strings as interchangeable IIRC. I'm sure there are others.) However, were I pitched the feature, I'd point out a few downsides:
This is a new form of boxing conversion. Chars are cheap value types. Strings are heap-allocated reference types. Boxing conversions can cause performance problems and produce collection pressure, and so there's an argument to be made that they should be more visible in the program, not less visible.
The feature will not be perceived as "chars are convertible to one-character strings". It will be perceived by users as "chars are one-character strings", and now it is perfectly reasonable to ask lots of knock-on questions, like: can call .Length on a char? If I can pass a char to a method that expects a string, and I can pass a string to a method that expects an IEnumerable<char>, can I pass a char to a method that expects an IEnumerable<char>? That seems... odd. I can call Select and Where on a string; can I on a char? That seems even more odd. All the proposed feature does is move your question; had it been implemented, you'd now be asking "why can't I call Select on a char?" or some such thing.
Now combine the previous two points together. If I think of chars as one-character strings, and I convert a char to an object, do I get a boxed char or a string?
We can also generalize the second point a bit further. A string is a collection of chars. If we're going to say that a char is convertible to a collection of chars, why stop with strings? Why not also say that a char can also be used as a List<char>? Why stop with char? Should we say that an int is convertible to IEnumerable<int>?
We can generalize even further: if there's an obvious conversion from char to sequence-of-chars-in-a-string, then there is also an obvious conversion from char to Task<char> -- just create a completed task that returns the char -- and to Func<char> -- just create a lambda that returns the char -- and to Lazy<char>, and to Nullable<char> -- oh, wait, we do allow a conversion to Nullable<char>. :-)
All of these problems are solvable, and some languages have solved them. That's not the issue. The issue is: all of these problems are problems that the language design team must identify, discuss and resolve. One of the fundamental problems in language design is how general should this feature be? In two minutes I've gone from "chars are convertible to single-character strings" to "any value of an underlying type is convertible to an equivalent value of a monadic type". There is an argument to be made for both features, and for various other points on the spectrum of generality. If you make your language features too specific, it becomes a mass of special cases that interact poorly with each other. If you make them too general, well, I guess you have Haskell. :-)
Suppose the design team comes to a conclusion about the feature: all of that has to be written up in the design documents and the specification, and the code, and tests have to be written, and, oh, did I mention that any time you make a change to convertibility rules, someone's overload resolution code breaks? Convertibility rules you really have to get right in the first version, because changing them later makes existing code more fragile. There are real design costs, and there are real costs to real users if you make this sort of change in version 8 instead of version 1.
Now compare these downsides -- and I'm sure there are more that I haven't listed -- to the upsides. The upsides are pretty tiny: you avoid a single call to ToString or + "" or whatever you do to convert a char to a string explicitly.
That's not even close to a good enough benefit to justify the design, implementation, testing, and backwards-compat-breaking costs.
Like I said, it's a reasonable feature, and had it been in version 1 of the language -- which did not have generics, or an installed base of billions of lines of code -- then it would have been a much easier sell. But now, there are a lot of features that have bigger bang for smaller buck.
Char is a value type. No, it is not numerical like int or double, but it still derives from System.ValueType. Then a char variable lives on the stack.
String is a reference type. So it lives on the heap. If you want to use a value type as a reference type, you need to box it first. The compiler needs to be sure that you know what you're doing via boxing. So, you can't have an implicit cast.
Cause when you do a cast of a char to int, you aren't just changing its' type, you are getting its' index in the ASCII table which is a really different thing, they are not letting you to "convert a char to an int" they are giving you a useful data for a possible need.
Why not letting the cast of a char to string?
Cause they actually represent much different things, and are stored in a much different way.
A char and a string aren't 2 ways of expressing the same thing.
I think I understand strong typing, but every time I look for examples for what is weak typing I end up finding examples of programming languages that simply coerce/convert types automatically.
For instance, in this article named Typing: Strong vs. Weak, Static vs. Dynamic says that Python is strongly typed because you get an exception if you try to:
Python
1 + "1"
Traceback (most recent call last):
File "", line 1, in ?
TypeError: unsupported operand type(s) for +: 'int' and 'str'
However, such thing is possible in Java and in C#, and we do not consider them weakly typed just for that.
Java
int a = 10;
String b = "b";
String result = a + b;
System.out.println(result);
C#
int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);
In this another article named Weakly Type Languages the author says that Perl is weakly typed simply because I can concatenate a string to a number and viceversa without any explicit conversion.
Perl
$a=10;
$b="a";
$c=$a.$b;
print $c; #10a
So the same example makes Perl weakly typed, but not Java and C#?.
Gee, this is confusing
The authors seem to imply that a language that prevents the application of certain operations on values of different types is strongly typed and the contrary means weakly typed.
Therefore, at some point I have felt prompted to believe that if a language provides a lot of automatic conversions or coercion between types (as perl) may end up being considered weakly typed, whereas other languages that provide only a few conversions may end up being considered strongly typed.
I am inclined to believe, though, that I must be wrong in this interepretation, I just do not know why or how to explain it.
So, my questions are:
What does it really mean for a language to be truly weakly typed?
Could you mention any good examples of weakly typing that are not related to automatic conversion/automatic coercion done by the language?
Can a language be weakly typed and strongly typed at the same time?
UPDATE: This question was the subject of my blog on the 15th of October, 2012. Thanks for the great question!
What does it really mean for a language to be "weakly typed"?
It means "this language uses a type system that I find distasteful". A "strongly typed" language by contrast is a language with a type system that I find pleasant.
The terms are essentially meaningless and you should avoid them. Wikipedia lists eleven different meanings for "strongly typed", several of which are contradictory. This indicates that the odds of confusion being created are high in any conversation involving the term "strongly typed" or "weakly typed".
All that you can really say with any certainty is that a "strongly typed" language under discussion has some additional restriction in the type system, either at runtime or compile time, that a "weakly typed" language under discussion lacks. What that restriction might be cannot be determined without further context.
Instead of using "strongly typed" and "weakly typed", you should describe in detail what kind of type safety you mean. For example, C# is a statically typed language and a type safe language and a memory safe language, for the most part. C# allows all three of those forms of "strong" typing to be violated. The cast operator violates static typing; it says to the compiler "I know more about the runtime type of this expression than you do". If the developer is wrong, then the runtime will throw an exception in order to protect type safety. If the developer wishes to break type safety or memory safety, they can do so by turning off the type safety system by making an "unsafe" block. In an unsafe block you can use pointer magic to treat an int as a float (violating type safety) or to write to memory you do not own. (Violating memory safety.)
C# imposes type restrictions that are checked at both compile-time and at runtime, thereby making it a "strongly typed" language compared to languages that do less compile-time checking or less runtime checking. C# also allows you to in special circumstances do an end-run around those restrictions, making it a "weakly typed" language compared with languages which do not allow you to do such an end-run.
Which is it really? It is impossible to say; it depends on the point of view of the speaker and their attitude towards the various language features.
As others have noted, the terms "strongly typed" and "weakly typed" have so many different meanings that there's no single answer to your question. However, since you specifically mentioned Perl in your question, let me try to explain in what sense Perl is weakly typed.
The point is that, in Perl, there is no such thing as an "integer variable", a "float variable", a "string variable" or a "boolean variable". In fact, as far as the user can (usually) tell, there aren't even integer, float, string or boolean values: all you have are "scalars", which are all of these things at the same time. So you can, for example, write:
$foo = "123" + "456"; # $foo = 579
$bar = substr($foo, 2, 1); # $bar = 9
$bar .= " lives"; # $bar = "9 lives"
$foo -= $bar; # $foo = 579 - 9 = 570
Of course, as you correctly note, all of this can be seen as just type coercion. But the point is that, in Perl, types are always coerced. In fact, it's quite hard for a user to tell what the internal "type" of a variable might be: at line 2 in my example above, asking whether the value of $bar is the string "9" or the number 9 is pretty much meaningless, since, as far as Perl is concerned, those are the same thing. Indeed, it's even possible for a Perl scalar to internally have both a string and a numeric value at the same time, as is e.g. the case for $foo after line 2 above.
The flip side of all this is that, since Perl variables are untyped (or, rather, don't expose their internal type to the user), operators cannot be overloaded to do different things for different types of arguments; you can't just say "this operator will do X for numbers and Y for strings", because the operator can't (won't) tell which kind of values its arguments are.
Thus, for example, Perl has and needs both a numeric addition operator (+) and a string concatenation operator (.): as you saw above, it's perfectly fine to add strings ("1" + "2" == "3") or to concatenate numbers (1 . 2 == 12). Similarly, the numeric comparison operators ==, !=, <, >, <=, >= and <=> compare the numeric values of their arguments, while the string comparison operators eq, ne, lt, gt, le, ge and cmp compare them lexicographically as strings. So 2 < 10, but 2 gt 10 (but "02" lt 10, while "02" == 2). (Mind you, certain other languages, like JavaScript, try to accommodate Perl-like weak typing while also doing operator overloading. This often leads to ugliness, like the loss of associativity for +.)
(The fly in the ointment here is that, for historical reasons, Perl 5 does have a few corner cases, like the bitwise logical operators, whose behavior depends on the internal representation of their arguments. Those are generally considered an annoying design flaw, since the internal representation can change for surprising reasons, and so predicting just what those operators do in a given situation can be tricky.)
All that said, one could argue that Perl does have strong types; they're just not the kind of types you might expect. Specifically, in addition to the "scalar" type discussed above, Perl also has two structured types: "array" and "hash". Those are very distinct from scalars, to the point where Perl variables have different sigils indicating their type ($ for scalars, # for arrays, % for hashes)1. There are coercion rules between these types, so you can write e.g. %foo = #bar, but many of them are quite lossy: for example, $foo = #bar assigns the length of the array #bar to $foo, not its contents. (Also, there are a few other strange types, like typeglobs and I/O handles, that you don't often see exposed.)
Also, a slight chink in this nice design is the existence of reference types, which are a special kind of scalars (and which can be distinguished from normal scalars, using the ref operator). It's possible to use references as normal scalars, but their string/numeric values are not particularly useful, and they tend to lose their special reference-ness if you modify them using normal scalar operations. Also, any Perl variable2 can be blessed to a class, turning it into an object of that class; the OO class system in Perl is somewhat orthogonal to the primitive type (or typelessness) system described above, although it's also "weak" in the sense of following the duck typing paradigm. The general opinion is that, if you find yourself checking the class of an object in Perl, you're doing something wrong.
1 Actually, the sigil denotes the type of the value being accessed, so that e.g. the first scalar in the array #foo is denoted $foo[0]. See perlfaq4 for more details.
2 Objects in Perl are (normally) accessed through references to them, but what actually gets blessed is the (possibly anonymous) variable the reference points to. However, the blessing is indeed a property of the variable, not of its value, so e.g. that assigning the actual blessed variable to another one just gives you a shallow, unblessed copy of it. See perlobj for more details.
In addition to what Eric has said, consider the following C code:
void f(void* x);
f(42);
f("hello");
In contrast to languages such as Python, C#, Java or whatnot, the above is weakly typed because we lose type information. Eric correctly pointed out that in C# we can circumvent the compiler by casting, effectively telling it “I know more about the type of this variable than you”.
But even then, the runtime will still check the type! If the cast is invalid, the runtime system will catch it and throw an exception.
With type erasure, this doesn’t happen – type information is thrown away. A cast to void* in C does exactly that. In this regard, the above is fundamentally different from a C# method declaration such as void f(Object x).
(Technically, C# also allows type erasure through unsafe code or marshalling.)
This is as weakly typed as it gets. Everything else is just a matter of static vs. dynamic type checking, i.e. of the time when a type is checked.
A perfect example comes from the wikipedia article of Strong Typing:
Generally strong typing implies that the programming language places severe restrictions on the intermixing that is permitted to occur.
Weak Typing
a = 2
b = "2"
concatenate(a, b) # returns "22"
add(a, b) # returns 4
Strong Typing
a = 2
b = "2"
concatenate(a, b) # Type Error
add(a, b) # Type Error
concatenate(str(a), b) #Returns "22"
add(a, int(b)) # Returns 4
Notice that a weak typing language can intermix different types without errors. A strong type language requires the input types to be the expected types. In a strong type language a type can be converted (str(a) converts an integer to a string) or cast (int(b)).
This all depends on the interpretation of typing.
I would like to contribute to the discussion with my own research on the subject, as others comment and contribute I have been reading their answers and following their references and I have found interesting information. As suggested, it is probable that most of this would be better discussed in the Programmers forum, since it appears to be more theoretical than practical.
From a theoretical standpoint, I think the article by Luca Cardelli and Peter Wegner named On Understanding Types, Data Abstraction and Polymorphism has one of the best arguments I have read.
A type may be viewed as a set of clothes (or a suit of armor) that
protects an underlying untyped representation from arbitrary or
unintended use. It provides a protective covering that hides the
underlying representation and constrains the way objects may interact
with other objects. In an untyped system untyped objects are naked
in that the underlying representation is exposed for all to see.
Violating the type system involves removing the protective set of
clothing and operating directly on the naked representation.
This statement seems to suggest that weakly typing would let us access the inner structure of a type and manipulate it as if it was something else (another type). Perhaps what we could do with unsafe code (mentioned by Eric) or with c type-erased pointers mentioned by Konrad.
The article continues...
Languages in which all expressions are type-consistent are called
strongly typed languages. If a language is strongly typed its compiler
can guarantee that the programs it accepts will execute without type
errors. In general, we should strive for strong typing, and adopt
static typing whenever possible. Note that every statically typed
language is strongly typed but the converse is not necessarily true.
As such, strong typing means the absence of type errors, I can only assume that weak typing means the contrary: the likely presence of type errors. At runtime or compile time? Seems irrelevant here.
Funny thing, as per this definition, a language with powerful type coercions like Perl would be considered strongly typed, because the system is not failing, but it is dealing with the types by coercing them into appropriate and well defined equivalences.
On the other hand, could I say than the allowance of ClassCastException and ArrayStoreException (in Java) and InvalidCastException, ArrayTypeMismatchException (in C#) would indicate a level of weakly typing, at least at compile time? Eric's answer seems to agree with this.
In a second article named Typeful Programming provided in one of the references provided in one of the answers in this question, Luca Cardelli delves into the concept of type violations:
Most system programming languages allow arbitrary type violations,
some indiscriminately, some only in restricted parts of a program.
Operations that involve type violations are called unsound. Type
violations fall in several classes [among which we can mention]:
Basic-value coercions: These include conversions between integers, booleans, characters, sets, etc. There is no need for type violations
here, because built-in interfaces can be provided to carry out the
coercions in a type-sound way.
As such, type coercions like those provided by operators could be considered type violations, but unless they break the consistency of the type system, we might say that they do not lead to a weakly typed system.
Based on this neither Python, Perl, Java or C# are weakly typed.
Cardelli mentions two type vilations that I very well consider cases of truly weak typing:
Address arithmetic. If necessary, there should be a built-in (unsound) interface, providing the adequate operations on addresses
and type conversions. Various situations involve pointers into the
heap (very dangerous with relocating collectors), pointers to the
stack, pointers to static areas, and pointers into other address
spaces. Sometimes array indexing can replace address arithmetic.
Memory mapping. This involves looking at an area of memory as an unstructured array, although it contains structured data. This is
typical of memory allocators and collectors.
This kind of things possible in languages like C (mentioned by Konrad) or through unsafe code in .Net (mentioned by Eric) would truly imply weakly typing.
I believe the best answer so far is Eric's, because the definition of this concepts is very theoretical, and when it comes to a particular language, the interpretations of all these concepts may lead to different debatable conclusions.
Weak typing does indeed mean that a high percentage of types can be implicitly coerced, attempting to guess what the coder intended.
Strong typing means that types are not coerced, or at least coerced less.
Static typing means your variables' types are determined at compile time.
Many people have recently been confusing "manifestly typed" with "strongly typed". "Manifestly typed" means that you declare your variables' types explicitly.
Python is mostly strongly typed, though you can use almost anything in a boolean context, and booleans can be used in an integer context, and you can use an integer in a float context. It is not manifestly typed, because you don't need to declare your types (except for Cython, which isn't entirely python, albeit interesting). It is also not statically typed.
C and C++ are manifestly typed, statically typed, and somewhat strongly typed, because you declare your types, types are determined at compile time, and you can mix integers and pointers, or integers and doubles, or even cast a pointer to one type into a pointer to another type.
Haskell is an interesting example, because it is not manifestly typed, but it's also statically and strongly typed.
The strong <=> weak typing is not only about the continuum on how much or how little of the values are coerced automatically by the language for one datatype to another, but how strongly or weakly the actual values are typed. In Python and Java, and mostly in C#, the values have their types set in stone. In Perl, not so much - there are really only a handful of different valuetypes to store in a variable.
Let's open the cases one by one.
Python
In Python example 1 + "1", + operator calls the __add__ for type int giving it the string "1" as an argument - however, this results in NotImplemented:
>>> (1).__add__('1')
NotImplemented
Next, the interpreter tries the __radd__ of str:
>>> '1'.__radd__(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute '__radd__'
As it fails, the + operator fails with the the result TypeError: unsupported operand type(s) for +: 'int' and 'str'. As such, the exception does not say much about strong typing, but the fact that the operator + does not coerce its arguments automatically to the same type, is a pointer to the fact that Python is not the most weakly typed language in the continuum.
On the other hand, in Python 'a' * 5 is implemented:
>>> 'a' * 5
'aaaaa'
That is,
>>> 'a'.__mul__(5)
'aaaaa'
The fact that the operation is different requires some strong typing - however the opposite of * coercing the values to numbers before multiplying still would not necessarily make the values weakly typed.
Java
The Java example, String result = "1" + 1; works only because as a fact of convenience, the operator + is overloaded for strings. The Java + operator replaces the sequence with creating a StringBuilder (see this):
String result = a + b;
// becomes something like
String result = new StringBuilder().append(a).append(b).toString()
This is rather an example of very static typing, without no actual coercion - StringBuilder has a method append(Object) that is specifically used here. The documentation says the following:
Appends the string representation of the Object argument.
The overall effect is exactly as if the argument were converted to a
string by the method String.valueOf(Object), and the characters of
that string were then appended to this character sequence.
Where String.valueOf then
Returns the string representation of the Object argument.
[Returns] if the argument is null, then a string equal to "null"; otherwise, the value of obj.toString() is returned.
Thus this is a case of absolutely no coercion by the language - delegating every concern to the objects itself.
C#
According to the Jon Skeet answer here, operator + is not even overloaded for the string class - akin to Java, this is just convenience generated by the compiler, thanks to both static and strong typing.
Perl
As the perldata explains,
Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as "hashes". A scalar is a single string (of any size, limited only by the available memory), number, or a reference to something (which will be discussed in perlref). Normal arrays are ordered lists of scalars indexed by number, starting with 0. Hashes are unordered collections of scalar values indexed by their associated string key.
Perl however does not have a separate data type for numbers, booleans, strings, nulls, undefineds, references to other objects etc - it just has one type for these all, the scalar type; 0 is a scalar value as much as is "0". A scalar variable that was set as a string can really change into a number, and from there on behave differently from "just a string", if it is accessed in a numerical context. The scalar can hold anything in Perl, it is as much the object as it exists in the system. whereas in Python the names just refers to the objects, in Perl the scalar values in the names are changeable objects. Furthermore, the Object Oriented Type system is glued on top of this: there are just 3 datatypes in perl - scalars, lists and hashes. A user defined object in Perl is a reference (that is a pointer to any of the 3 previous) blessed to a package - you can take any such value and bless it to any class at any instant you want.
Perl even allows you to change the classes of values at whim - this is not possible in Python where to create a value of some class you need to explicitly construct the value belonging to that class with object.__new__ or similar. In Python you cannot really change the essence of the object after the creation, in Perl you can do much anything:
package Foo;
package Bar;
my $val = 42;
# $val is now a scalar value set from double
bless \$val, Foo;
# all references to $val now belong to class Foo
my $obj = \$val;
# now $obj refers to the SV stored in $val
# thus this prints: Foo=SCALAR(0x1c7d8c8)
print \$val, "\n";
# all references to $val now belong to class Bar
bless \$val, Bar;
# thus this prints Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# we change the value stored in $val from number to a string
$val = 'abc';
# yet still the SV is blessed: Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# and on the course, the $obj now refers to a "Bar" even though
# at the time of copying it did refer to a "Foo".
print $obj, "\n";
thus the type identity is weakly bound to the variable, and it can be changed through any reference on the fly. In fact, if you do
my $another = $val;
\$another does not have the class identity, even though \$val will still give the blessed reference.
TL;DR
There are much more about weak typing to Perl than just automatic coercions, and it is more about that the types of the values themselves are not set into stone, unlike the Python which is dynamically yet very strongly typed language. That python gives TypeError on 1 + "1" is an indication that the language is strongly typed, even though the contrary one of doing something useful, as in Java or C# does not preclude them being strongly typed languages.
As many others have expressed, the entire notion of "strong" vs "weak" typing is problematic.
As a archetype, Smalltalk is very strongly typed -- it will always raise an exception if an operation between two objects is incompatible. However, I suspect few on this list would call Smalltalk a strongly-typed language, because it is dynamically typed.
I find the notion of "static" versus "dynamic" typing more useful than "strong" versus "weak." A statically-typed language has all the types figured out at compile-time, and the programmer has to explicitly declare if otherwise.
Contrast with a dynamically-typed language, where typing is performed at run-time. This is typically a requirement for polymorphic languages, so that decisions about whether an operation between two objects is legal does not have to be decided by the programmer in advance.
In polymorphic, dynamically-typed languages (like Smalltalk and Ruby), it's more useful to think of a "type" as a "conformance to protocol." If an object obeys a protocol the same way another object does -- even if the two objects do not share any inheritance or mixins or other voodoo -- they are considered the same "type" by the run-time system. More correctly, an object in such systems is autonomous, and can decide if it makes sense to respond to any particular message referring to any particular argument.
Want an object that can make some meaningful response to the message "+" with an object argument that describes the colour blue? You can do that in dynamically-typed languages, but it is a pain in statically-typed languages.
I like #Eric Lippert's answer, but to address the question - strongly typed languages typically have explicit knowledge of the types of variables at each point of the program. Weakly typed languages do not, so they can attempt to perform an operation that may not be possible for a particular type.
It think the easiest way to see this is in a function.
C++:
void func(string a) {...}
The variable a is known to be of type string and any incompatible operation will be caught at compile time.
Python:
def func(a)
...
The variable a could be anything and we can have code that calls an invalid method, which will only get caught at runtime.
What are the differences between these two and which one should I use?
string s = "Hello world!";
String s = "Hello world!";
string is an alias in C# for System.String.
So technically, there is no difference. It's like int vs. System.Int32.
As far as guidelines, it's generally recommended to use string any time you're referring to an object.
e.g.
string place = "world";
Likewise, I think it's generally recommended to use String if you need to refer specifically to the class.
e.g.
string greet = String.Format("Hello {0}!", place);
This is the style that Microsoft tends to use in their examples.
It appears that the guidance in this area may have changed, as StyleCop now enforces the use of the C# specific aliases.
Just for the sake of completeness, here's a brain dump of related information...
As others have noted, string is an alias for System.String. Assuming your code using String compiles to System.String (i.e. you haven't got a using directive for some other namespace with a different String type), they compile to the same code, so at execution time there is no difference whatsoever. This is just one of the aliases in C#. The complete list is:
object: System.Object
string: System.String
bool: System.Boolean
byte: System.Byte
sbyte: System.SByte
short: System.Int16
ushort: System.UInt16
int: System.Int32
uint: System.UInt32
long: System.Int64
ulong: System.UInt64
float: System.Single
double: System.Double
decimal: System.Decimal
char: System.Char
Apart from string and object, the aliases are all to value types. decimal is a value type, but not a primitive type in the CLR. The only primitive type which doesn't have an alias is System.IntPtr.
In the spec, the value type aliases are known as "simple types". Literals can be used for constant values of every simple type; no other value types have literal forms available. (Compare this with VB, which allows DateTime literals, and has an alias for it too.)
There is one circumstance in which you have to use the aliases: when explicitly specifying an enum's underlying type. For instance:
public enum Foo : UInt32 {} // Invalid
public enum Bar : uint {} // Valid
That's just a matter of the way the spec defines enum declarations - the part after the colon has to be the integral-type production, which is one token of sbyte, byte, short, ushort, int, uint, long, ulong, char... as opposed to a type production as used by variable declarations for example. It doesn't indicate any other difference.
Finally, when it comes to which to use: personally I use the aliases everywhere for the implementation, but the CLR type for any APIs. It really doesn't matter too much which you use in terms of implementation - consistency among your team is nice, but no-one else is going to care. On the other hand, it's genuinely important that if you refer to a type in an API, you do so in a language-neutral way. A method called ReadInt32 is unambiguous, whereas a method called ReadInt requires interpretation. The caller could be using a language that defines an int alias for Int16, for example. The .NET framework designers have followed this pattern, good examples being in the BitConverter, BinaryReader and Convert classes.
String stands for System.String and it is a .NET Framework type. string is an alias in the C# language for System.String. Both of them are compiled to System.String in IL (Intermediate Language), so there is no difference. Choose what you like and use that. If you code in C#, I'd prefer string as it's a C# type alias and well-known by C# programmers.
I can say the same about (int, System.Int32) etc..
The best answer I have ever heard about using the provided type aliases in C# comes from Jeffrey Richter in his book CLR Via C#. Here are his 3 reasons:
I've seen a number of developers confused, not knowing whether to use string or String in their code. Because in C# the string (a keyword) maps exactly to System.String (an FCL type), there is no difference and either can be used.
In C#, long maps to System.Int64, but in a different programming language, long could map to an Int16 or Int32. In fact, C++/CLI does in fact treat long as an Int32. Someone reading source code in one language could easily misinterpret the code's intention if he or she were used to programming in a different programming language. In fact, most languages won't even treat long as a keyword and won't compile code that uses it.
The FCL has many methods that have type names as part of their method names. For example, the BinaryReader type offers methods such as ReadBoolean, ReadInt32, ReadSingle, and so on, and the System.Convert type offers methods such as ToBoolean, ToInt32, ToSingle, and so on. Although it's legal to write the following code, the line with float feels very unnatural to me, and it's not obvious that the line is correct:
BinaryReader br = new BinaryReader(...);
float val = br.ReadSingle(); // OK, but feels unnatural
Single val = br.ReadSingle(); // OK and feels good
So there you have it. I think these are all really good points. I however, don't find myself using Jeffrey's advice in my own code. Maybe I am too stuck in my C# world but I end up trying to make my code look like the framework code.
string is a reserved word, but String is just a class name.
This means that string cannot be used as a variable name by itself.
If for some reason you wanted a variable called string, you'd see only the first of these compiles:
StringBuilder String = new StringBuilder(); // compiles
StringBuilder string = new StringBuilder(); // doesn't compile
If you really want a variable name called string you can use # as a prefix:
StringBuilder #string = new StringBuilder();
Another critical difference: Stack Overflow highlights them differently.
There is one difference - you can't use String without using System; beforehand.
It's been covered above; however, you can't use string in reflection; you must use String.
System.String is the .NET string class - in C# string is an alias for System.String - so in use they are the same.
As for guidelines I wouldn't get too bogged down and just use whichever you feel like - there are more important things in life and the code is going to be the same anyway.
If you find yourselves building systems where it is necessary to specify the size of the integers you are using and so tend to use Int16, Int32, UInt16, UInt32 etc. then it might look more natural to use String - and when moving around between different .net languages it might make things more understandable - otherwise I would use string and int.
I prefer the capitalized .NET types (rather than the aliases) for formatting reasons. The .NET types are colored the same as other object types (the value types are proper objects, after all).
Conditional and control keywords (like if, switch, and return) are lowercase and colored dark blue (by default). And I would rather not have the disagreement in use and format.
Consider:
String someString;
string anotherString;
string and String are identical in all ways (except the uppercase "S"). There are no performance implications either way.
Lowercase string is preferred in most projects due to the syntax highlighting
This YouTube video demonstrates practically how they differ.
But now for a long textual answer.
When we talk about .NET there are two different things one there is .NET framework and the other there are languages (C#, VB.NET etc) which use that framework.
"System.String" a.k.a "String" (capital "S") is a .NET framework data type while "string" is a C# data type.
In short "String" is an alias (the same thing called with different names) of "string". So technically both the below code statements will give the same output.
String s = "I am String";
or
string s = "I am String";
In the same way, there are aliases for other C# data types as shown below:
object: System.Object, string: System.String, bool: System.Boolean, byte: System.Byte, sbyte: System.SByte, short: System.Int16 and so on.
Now the million-dollar question from programmer's point of view: So when to use "String" and "string"?
The first thing to avoid confusion use one of them consistently. But from best practices perspective when you do variable declaration it's good to use "string" (small "s") and when you are using it as a class name then "String" (capital "S") is preferred.
In the below code the left-hand side is a variable declaration and it is declared using "string". On the right-hand side, we are calling a method so "String" is more sensible.
string s = String.ToUpper() ;
C# is a language which is used together with the CLR.
string is a type in C#.
System.String is a type in the CLR.
When you use C# together with the CLR string will be mapped to System.String.
Theoretically, you could implement a C#-compiler that generated Java bytecode. A sensible implementation of this compiler would probably map string to java.lang.String in order to interoperate with the Java runtime library.
Lower case string is an alias for System.String.
They are the same in C#.
There's a debate over whether you should use the System types (System.Int32, System.String, etc.) types or the C# aliases (int, string, etc). I personally believe you should use the C# aliases, but that's just my personal preference.
string is just an alias for System.String. The compiler will treat them identically.
The only practical difference is the syntax highlighting as you mention, and that you have to write using System if you use String.
Both are same. But from coding guidelines perspective it's better to use string instead of String. This is what generally developers use. e.g. instead of using Int32 we use int as int is alias to Int32
FYI
“The keyword string is simply an alias for the predefined class System.String.” - C# Language Specification 4.2.3
http://msdn2.microsoft.com/En-US/library/aa691153.aspx
As the others are saying, they're the same. StyleCop rules, by default, will enforce you to use string as a C# code style best practice, except when referencing System.String static functions, such as String.Format, String.Join, String.Concat, etc...
New answer after 6 years and 5 months (procrastination).
While string is a reserved C# keyword that always has a fixed meaning, String is just an ordinary identifier which could refer to anything. Depending on members of the current type, the current namespace and the applied using directives and their placement, String could be a value or a type distinct from global::System.String.
I shall provide two examples where using directives will not help.
First, when String is a value of the current type (or a local variable):
class MySequence<TElement>
{
public IEnumerable<TElement> String { get; set; }
void Example()
{
var test = String.Format("Hello {0}.", DateTime.Today.DayOfWeek);
}
}
The above will not compile because IEnumerable<> does not have a non-static member called Format, and no extension methods apply. In the above case, it may still be possible to use String in other contexts where a type is the only possibility syntactically. For example String local = "Hi mum!"; could be OK (depending on namespace and using directives).
Worse: Saying String.Concat(someSequence) will likely (depending on usings) go to the Linq extension method Enumerable.Concat. It will not go to the static method string.Concat.
Secondly, when String is another type, nested inside the current type:
class MyPiano
{
protected class String
{
}
void Example()
{
var test1 = String.Format("Hello {0}.", DateTime.Today.DayOfWeek);
String test2 = "Goodbye";
}
}
Neither statement in the Example method compiles. Here String is always a piano string, MyPiano.String. No member (static or not) Format exists on it (or is inherited from its base class). And the value "Goodbye" cannot be converted into it.
Using System types makes it easier to port between C# and VB.Net, if you are into that sort of thing.
Against what seems to be common practice among other programmers, I prefer String over string, just to highlight the fact that String is a reference type, as Jon Skeet mentioned.
string is an alias (or shorthand) of System.String. That means, by typing string we meant System.String. You can read more in think link: 'string' is an alias/shorthand of System.String.
I'd just like to add this to lfousts answer, from Ritchers book:
The C# language specification states, “As a matter of style, use of the keyword is favored over
use of the complete system type name.” I disagree with the language specification; I prefer
to use the FCL type names and completely avoid the primitive type names. In fact, I wish that
compilers didn’t even offer the primitive type names and forced developers to use the FCL
type names instead. Here are my reasons:
I’ve seen a number of developers confused, not knowing whether to use string
or String in their code. Because in C# string (a keyword) maps exactly to
System.String (an FCL type), there is no difference and either can be used. Similarly,
I’ve heard some developers say that int represents a 32-bit integer when the application
is running on a 32-bit OS and that it represents a 64-bit integer when the application
is running on a 64-bit OS. This statement is absolutely false: in C#, an int always maps
to System.Int32, and therefore it represents a 32-bit integer regardless of the OS the
code is running on. If programmers would use Int32 in their code, then this potential
confusion is also eliminated.
In C#, long maps to System.Int64, but in a different programming language, long
could map to an Int16 or Int32. In fact, C++/CLI does treat long as an Int32.
Someone reading source code in one language could easily misinterpret the code’s
intention if he or she were used to programming in a different programming language.
In fact, most languages won’t even treat long as a keyword and won’t compile code
that uses it.
The FCL has many methods that have type names as part of their method names. For
example, the BinaryReader type offers methods such as ReadBoolean, ReadInt32,
ReadSingle, and so on, and the System.Convert type offers methods such as
ToBoolean, ToInt32, ToSingle, and so on. Although it’s legal to write the following
code, the line with float feels very unnatural to me, and it’s not obvious that the line is
correct:
BinaryReader br = new BinaryReader(...);
float val = br.ReadSingle(); // OK, but feels unnatural
Single val = br.ReadSingle(); // OK and feels good
Many programmers that use C# exclusively tend to forget that other programming
languages can be used against the CLR, and because of this, C#-isms creep into the
class library code. For example, Microsoft’s FCL is almost exclusively written in C# and
developers on the FCL team have now introduced methods into the library such as
Array’s GetLongLength, which returns an Int64 value that is a long in C# but not
in other languages (like C++/CLI). Another example is System.Linq.Enumerable’s
LongCount method.
I didn't get his opinion before I read the complete paragraph.
String (System.String) is a class in the base class library. string (lower case) is a reserved work in C# that is an alias for System.String. Int32 vs int is a similar situation as is Boolean vs. bool. These C# language specific keywords enable you to declare primitives in a style similar to C.
It's a matter of convention, really. string just looks more like C/C++ style. The general convention is to use whatever shortcuts your chosen language has provided (int/Int for Int32). This goes for "object" and decimal as well.
Theoretically this could help to port code into some future 64-bit standard in which "int" might mean Int64, but that's not the point, and I would expect any upgrade wizard to change any int references to Int32 anyway just to be safe.
#JaredPar (a developer on the C# compiler and prolific SO user!) wrote a great blog post on this issue. I think it is worth sharing here. It is a nice perspective on our subject.
string vs. String is not a style debate
[...]
The keyword string has concrete meaning in C#. It is the type System.String which exists in the core runtime assembly. The runtime intrinsically understands this type and provides the capabilities developers expect for strings in .NET. Its presence is so critical to C# that if that type doesn’t exist the compiler will exit before attempting to even parse a line of code. Hence string has a precise, unambiguous meaning in C# code.
The identifier String though has no concrete meaning in C#. It is an identifier that goes through all the name lookup rules as Widget, Student, etc … It could bind to string or it could bind to a type in another assembly entirely whose purposes may be entirely different than string. Worse it could be defined in a way such that code like String s = "hello"; continued to compile.
class TricksterString {
void Example() {
String s = "Hello World"; // Okay but probably not what you expect.
}
}
class String {
public static implicit operator String(string s) => null;
}
The actual meaning of String will always depend on name resolution.
That means it depends on all the source files in the project and all
the types defined in all the referenced assemblies. In short it
requires quite a bit of context to know what it means.
True that in the vast majority of cases String and string will bind to
the same type. But using String still means developers are leaving
their program up to interpretation in places where there is only one
correct answer. When String does bind to the wrong type it can leave
developers debugging for hours, filing bugs on the compiler team, and
generally wasting time that could’ve been saved by using string.
Another way to visualize the difference is with this sample:
string s1 = 42; // Errors 100% of the time
String s2 = 42; // Might error, might not, depends on the code
Many will argue that while this is information technically accurate using String is still fine because it’s exceedingly rare that a codebase would define a type of this name. Or that when String is defined it’s a sign of a bad codebase.
[...]
You’ll see that String is defined for a number of completely valid purposes: reflection helpers, serialization libraries, lexers, protocols, etc … For any of these libraries String vs. string has real consequences depending on where the code is used.
So remember when you see the String vs. string debate this is about semantics, not style. Choosing string gives crisp meaning to your codebase. Choosing String isn’t wrong but it’s leaving the door open for surprises in the future.
Note: I copy/pasted most of the blog posts for archive reasons. I ignore some parts, so I recommend skipping and reading the blog post if you can.
String is not a keyword and it can be used as Identifier whereas string is a keyword and cannot be used as Identifier. And in function point of view both are same.
Coming late to the party: I use the CLR types 100% of the time (well, except if forced to use the C# type, but I don't remember when the last time that was).
I originally started doing this years ago, as per the CLR books by Ritchie. It made sense to me that all CLR languages ultimately have to be able to support the set of CLR types, so using the CLR types yourself provided clearer, and possibly more "reusable" code.
Now that I've been doing it for years, it's a habit and I like the coloration that VS shows for the CLR types.
The only real downer is that auto-complete uses the C# type, so I end up re-typing automatically generated types to specify the CLR type instead.
Also, now, when I see "int" or "string", it just looks really wrong to me, like I'm looking at 1970's C code.
There is no difference.
The C# keyword string maps to the .NET type System.String - it is an alias that keeps to the naming conventions of the language.
Similarly, int maps to System.Int32.
There's a quote on this issue from Daniel Solis' book.
All the predefined types are mapped directly to
underlying .NET types. The C# type names (string) are simply aliases for the
.NET types (String or System.String), so using the .NET names works fine syntactically, although
this is discouraged. Within a C# program, you should use the C# names
rather than the .NET names.
Yes, that's no difference between them, just like the bool and Boolean.
string is a keyword, and you can't use string as an identifier.
String is not a keyword, and you can use it as an identifier:
Example
string String = "I am a string";
The keyword string is an alias for
System.String aside from the keyword issue, the two are exactly
equivalent.
typeof(string) == typeof(String) == typeof(System.String)
I was told that decimal is implemented as user defined type and other c# types like int have specific opcodes devoted to them. What's the reasoning behind this?
decimal isn't alone here; DateTime, TimeSpan, Guid, etc are also custom types. I guess the main reason is that they don't map to CPU primatives. float (IEEE 754), int, etc are pretty ubiquitous here, but decimal is bespoke to .NET.
This only really causes a problem if you want to talk to the operators directly via reflection (since they don't exist for int etc). I can't think of any other scenarios where you'd notice the difference.
(actually, there are still structs etc to represent the others - they are just lacking most of what you might expect to be in them, such as operators)
"What's the reasoning behind this?"
Decimal math is handled in software versus hardware. Currently, many processors don't support native decimal (financial decimal versus float) math. That's changing though with the adoption of IEEE 754R.
See also:
Decimal vs Double Speed
http://software.intel.com/en-us/blogs/2008/03/06/intel-decimal-floating-point-math-library/
http://grouper.ieee.org/groups/754/
This question already has answers here:
Closed 14 years ago.
It seems that unsigned integers would be useful for method parameters and class members that should never be negative, but I don't see many people writing code that way. I tried it myself and found the need to cast from int to uint somewhat annoying...
Anyhow what are you thoughts on this?
Duplicate
Why is Array Length an Int and not an UInt?
The idea, that unsigned will prevent you from problems with methods/members that should not have to deal with negative values is somewhat flawed:
now you have to check for big values ('overflow') in case of error
whereas you could have checked for <=0 with signed
use only one signed int in your methods and you are back to square "signed" :)
Use unsigned when dealing with bits. But don't use bits today anyway, except you have such a lot of them, that they fill some megabytes or at least your small embedded memory.
Using the standard ones probably avoids the casting to unsigned versions. In your code, you can probably maintain the distinction ok, but lots of other inputs and 3rd party libraries wont and thus the casting will just drive people mad!
I can't remember exactly how C# does its implicit conversions, but in C++, widening conversions are done implicitly. Unsigned is considered wider than signed, and so this leads to unexpected problems:
int s = 5;
unsigned int u = 25;
// s > u is false
int s = -1;
unsigned int u = 25;
// s > u is **TRUE**! Error error!
In the example above, s overflowed, so it's value will be something like 4294967295. This has caused me problems before, I often have methods return -1 to say "no match" or something like that, and with the implicit conversion it just fails to do what I think it should.
After a while programmers learnt to almost always use signed variables, except in exceptional cases. Compilers these days also produce warnings for this which is very helpful.
One reason is that public methods or properties referring to unsigned types aren't CLS compliant.
You'll almost always see this attribute applied to .Net assemblies as various wizards include it by default:
[assembly: CLSCompliant(true)]
So basically if your assembly includes the attribute above, and you try to use unsigned types in your public interface with the outside world, you'll get a compilation error.
For simplicity. Modern software involves enough casts and conversions. There's a benefit to stick to as few, commonly available data types as possible to reduce complexity and ambiguity about proper interfaces.
there is no real need. Declaring something as unsigned to say numbers should be positive is a poor mans attempt at validation.
In fact it would be better to just have a single number class that represented all numbers.
To Validate numbers you should use some other technique because generally the constraint isn't just positive numbers, it's a set range. It's usually best to use the most unconstrained method for representing numbers and then if you want to change the rules for allowable values, you change JUST the validation rules NOT the type.
Unsigned data types are carried over from the old days when memomry was a premium. So now we don't really need them for that purpose. Combine that with casting and they are a little cumbersome.
It's not advisable to use unsigned integer because if u assigned negative values to it, all hell break lose. However, if you insists on doing it the right way, try using Spec#, declare it as an integer (where you would have used uint) and attach an invariant to it saying it can never be negative.
You're right in that it probably would be better to use uint for things which should never be negative. In practice, though, there's a few reasons against it:
int is the 'standard' or 'default', it has a lot of inertia behind it.
You have to use annoying/ugly explicit casts everywhere to get back to int, which you're probably going to need to do a lot due to the first point after all.
When you do cast back to int, what happens if your uint value is going to overflow it?
A lot of the time, people use ints for things that are never negative, and then use -1 or -99 or something like that for error codes or uninitialised values. It's a little lazy perhaps, but again this is something that uint is less flexible with.
The biggest reason is that people are usually too lazy, or not thinking closely enough, to know where they're appropriate. Something like a size_t can never be negative, so unsigned is correct.
Casting from signed to unsigned can be fraught with peril though, because of peculiarities in how sign bits are handled by the underlying architecture.
You shouldn't need to manually cast it, I don't think.
Anyway, it's because for most applications, it doesn't matter - the range for either is big enough for most uses.
It is because they are not CLS compliant, which means your code might not run as you expect on other .NET framework implementations or even on other .NET languages (unless supported).
Also it is not interoperable with other systems if you try to pass them. Like a web service to be consumed by Java, or call to Win32 API
See that SO post on the reason why as well