Why is there no empty char literal? - c#

Is there any specific reason why there is no empty char literal?
What comes closest to what I think of, the '' is the '\0' the null character.
In C++ the char is represented by an int, which means empty char goes directly to the 0 integer value, which is in C++ "the same as null".
The practical part of coming up with that question:
In a class I want to represent char values as enum attributes.
Unbiased I tried to initialize an instance with '', which of course does not work.
But shouldn't be there a char null value? Not to be confused with string.Empty,
more in the nature of a null reference.
So the question is: Why is there no empty char?
-edit-
Seeing this question the question can be enhanced on:
An empty char value would enable concatening strings and chars without
destroying the string. Would that not be preferable? Or should this
"just work as expected"?

A char by definition has a length of one character. Empty simply doesn't fit the bill.
Don't run into confusion between a char and a string of max length 1. They sure look similar, but are very different beasts.

To give a slightly more technical explanation: There is no character that can serve as the identity element when performing concatenation. This is different from integers, where 0 serves as the identity element for addition.

Related

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

string.IndexOf() not recognizing modified characters

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.
e.g.
string find = "abcʼabcabc";
int index = find.IndexOf("c");
In this code, index should be 2, but it returns 6.
Is there a way to get around this?
Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).
You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):
string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one
Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).
The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.
string find = "abcʼabcabc";
int index = find.IndexOf("c", StringComparison.Ordinal);

Why char array displays an additional text in the array elements?

why do we get the extra text within the char array in c#, looks like a char equivalent but just wondering are there any advantages of having this within the char array, if so where do we use this feature.
That is the integer representation of the char within the ASCII table
They're the ASCII values of what numerical value the characters represent.
http://www.asciitable.com/
This is purely the watch window that is showing you the char byte value and the actual char string vaule.
So, from ASCII Table you can see that 48 is '0' and 49 is '1'
Characters in C# are Unicode, and there are plenty of them. And there are plenty of them that look similar or downright identical or even unreadable. In such cases you need these numbers.
As per the C# Reference, each char from char array shows the Decimal value of it

String must be exactly one character long

I have what I think is an easy problem. For some reason the following code generates the exception, "String must be exactly one character long".
int n = 0;
foreach (char letter in charMsg)
{
// Get the integral value of the character.
int value = Convert.ToInt32(letter);
// Convert the decimal value to a hexadecimal value in string form.
string hexOutput = String.Format("{0:X}", value);
//Console.WriteLine("Hexadecimal value of {0} is {1}", letter, hexOutput);
charMsg[n] = Convert.ToChar(hexOutput);
n++;
}
The exception occurs at the charMsg[n] = Convert.ToChar(hexOutput); line. Why does it happen? When I check the values of CharMsg it seems to contain all of them properly, yet still throws an error at me.
UPDATE: I've solved this problem, it was my mistake. Sorry for bothering you.
OK, this was a really stupid mistake on my part. Point is, with my problem I'm not even supposed to do this as hex values clearly won't help me in any way.
What I am trying to do it to encrypt a message in an image. I've already encrypted the length of said message in last digits on each color channel of first pixel. Now I'm trying to put the very message in there. I lookt here: http://en.wikipedia.org/wiki/ASCII and said to myself without thinking that usung hexes would be a good idea. Can't belive I thought that.
Convert.ToChar( string s ), per the documentation requires a single character string, otherwise it throws a FormatException as you've noted. It is a rough, though more restrictive, equivalent of
public char string2char( string s )
{
return s[0] ;
}
Your code does the following:
Iterates over all the characters in some enumrable collection of characters.
For each such character, it...
Converts the char to an int. Hint: a char is an integral type: its an unsigned 16-bit integral value.
converts that value to a string containing a hex representation of the character in question. For most characters, that string will be at least two character in length: for instance, converting the space character (' ', 0x20) this way will give you the string "20".
You then try to convert that back to a char and replace the current item being iterated over. This is where your exception is thrown. One thing you should note here is that altering a collection being enumerated is likely to cause the enumerator to throw an exception.
What exactly are you trying to accomplish here. For instance, given a charMsg that consist of 3 characters, 'a', 'b' and 'c', what should happen. A clear problem statement helps us to help you.
Since printable unicode characters can be anywhere in range from 0x0000 to 0xFFFF, your hexOutput variable can hold more than one character - this is why error is thrown.
Convert.ToChar(string) would always check length a of string, and if it is not equal to 1 - it would throw. So it would not convert string 0x30 to hexadecimal number, and then to ascii representation, symbol 0.
Can you elaborate on what you are trying to archieve ?
Your hexOutput is a string, and I'm assuming charMsg is a character array. Suppose the first element in charMsg is 'p', or hex value 70. The documentation for Convert.ToChar(string) says it'll use just the first character of the string ('7'), but it's wrong. It'll throw this error. You can test this with a static example, like charMsg[n] = Convert.ToChar("70");. You'll get the same error.
Are you trying to replace characters with hex values? If so, you might try using a StringBuilder object instead of your array assignments.
Convert.ToChar(string) if it is empty string lead this error. instead use cchar()

What is a binary null character?

I have a requirement to create a sysDesk log file. In this requirement I am supposed to create an XML file, that in certain places between the elements contains a binary null character.
Can someone please explain to me, firstly what is a binary null character, and how can I write one to a text file?
I suspect it means Unicode U+0000. However, that's not a valid character in an XML file... you should see if you can get a very clear specification of the file format to work out what's actually required. Sample files would also be useful :)
Comments are failing me at the moment, so to address a couple of other answers:
It's not a string termination character in C#, as C# doesn't use null-terminated strings. In fact, all .NET strings are null-terminated for the sake of interop, but more importantly the length is stored independently. In particular, a C# string can entirely validly include a null character without terminating it:
string embeddedNull = "a\0b";
Console.WriteLine(embeddedNull.Length); // Prints 3
The method given by rwmnau for getting a null character or string is very inefficient for something simple. Better would be:
string justNullString = "\0";
char justNullChar = '\0';
A binary null character is just a char with an integer/ASCII value of 0.
You can create a null character with Convert.ToChar(0) or the more common, more well-recognized '\0'.
A binary NULL character is one that's all zeros (0x00 in Hex). You can write:
System.Text.Encoding.ASCII.GetChars(new byte[] {00});
to get it in C#.
The null character is the special character that's represented by U+0000 (encoded by all-zero bits). The null character is represented in C# by the escape sequence \0, as in "This string ends with a null character.\0".

Categories