What is a binary null character?

What is a binary null character? - c#

I have a requirement to create a sysDesk log file. In this requirement I am supposed to create an XML file, that in certain places between the elements contains a binary null character.
Can someone please explain to me, firstly what is a binary null character, and how can I write one to a text file?

I suspect it means Unicode U+0000. However, that's not a valid character in an XML file... you should see if you can get a very clear specification of the file format to work out what's actually required. Sample files would also be useful :)
Comments are failing me at the moment, so to address a couple of other answers:
It's not a string termination character in C#, as C# doesn't use null-terminated strings. In fact, all .NET strings are null-terminated for the sake of interop, but more importantly the length is stored independently. In particular, a C# string can entirely validly include a null character without terminating it:
string embeddedNull = "a\0b";
Console.WriteLine(embeddedNull.Length); // Prints 3
The method given by rwmnau for getting a null character or string is very inefficient for something simple. Better would be:
string justNullString = "\0";
char justNullChar = '\0';

A binary null character is just a char with an integer/ASCII value of 0.
You can create a null character with Convert.ToChar(0) or the more common, more well-recognized '\0'.

A binary NULL character is one that's all zeros (0x00 in Hex). You can write:
System.Text.Encoding.ASCII.GetChars(new byte[] {00});
to get it in C#.

The null character is the special character that's represented by U+0000 (encoded by all-zero bits). The null character is represented in C# by the escape sequence \0, as in "This string ends with a null character.\0".

Related

22021: invalid byte sequence for encoding "UTF8": 0x00

I am doing a bulk import to PostgreSQL from C# and one of the records gives me this error:
22021: invalid byte sequence for encoding "UTF8": 0x00
I googled it and the general advice is that this refers to a null field but in my instance this is not the case. I tracked down the string that causes the error and it is this:
Addresses the following: Let $A$ be a Banach algebra, and let $\sum:\0\rightarrow I\rightarrow\mathfrak A\overset\pi\to\longrightarrow A\rightarrow 0$ be an extension of $A$, where $\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\mathfrak A$.
I am reading this from an XML file and have UTF-8 defined on the file stream.
The escaped string on my deserialized C# class is:
"Addresses the following: Let $A$ be a Banach algebra, and let $\\sum\\:\\0\\rightarrow I\\rightarrow\\mathfrak A\\overset\\pi\\to\\longrightarrow A\\rightarrow 0$ be an extension of $A$, where $\\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\\mathfrak A$."
Obviously something is not right with the string. I am guessing some sort of mathmatical symbols should be there but what exactly about this is breaking the import and making PostgreSQL report that it is a null field? What format should that be read in?
If I manually overwite this field the import works so it is 100% an issue with this string.

Since it's a bulk import, I'm assuming you're creating a file or some kind of big string to send to Postgres? In that case the strings probably have escape characters enabled, as opposed to executing this via, say, a prepared statement. So it's probably that \0 in your string that Postgres is escaping and interpreting as a 0x00.
from the docs: https://www.postgresql.org/docs/8.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g. E'foo'. (When continuing an escape string constant across lines, write E only before the first opening quote.) Within an escape string, a backslash character () begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represents a special byte value. \b is a backspace, \f is a form feed, \n is a newline, \r is a carriage return, \t is a tab. Also supported are \digits, where digits represents an octal byte value, and \xhexdigits, where hexdigits represents a hexadecimal byte value. (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\). Also, a single quote can be included in an escape string by writing \', in addition to the normal way of ''.
So if your bulk statement is prepending strings with E, like E'hello', don't do that.

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.

The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.

Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

Why is there no empty char literal?

Is there any specific reason why there is no empty char literal?
What comes closest to what I think of, the '' is the '\0' the null character.
In C++ the char is represented by an int, which means empty char goes directly to the 0 integer value, which is in C++ "the same as null".
The practical part of coming up with that question:
In a class I want to represent char values as enum attributes.
Unbiased I tried to initialize an instance with '', which of course does not work.
But shouldn't be there a char null value? Not to be confused with string.Empty,
more in the nature of a null reference.
So the question is: Why is there no empty char?
-edit-
Seeing this question the question can be enhanced on:
An empty char value would enable concatening strings and chars without
destroying the string. Would that not be preferable? Or should this
"just work as expected"?

A char by definition has a length of one character. Empty simply doesn't fit the bill.
Don't run into confusion between a char and a string of max length 1. They sure look similar, but are very different beasts.

To give a slightly more technical explanation: There is no character that can serve as the identity element when performing concatenation. This is different from integers, where 0 serves as the identity element for addition.

String must be exactly one character long

I have what I think is an easy problem. For some reason the following code generates the exception, "String must be exactly one character long".
int n = 0;
foreach (char letter in charMsg)
{
// Get the integral value of the character.
int value = Convert.ToInt32(letter);
// Convert the decimal value to a hexadecimal value in string form.
string hexOutput = String.Format("{0:X}", value);
//Console.WriteLine("Hexadecimal value of {0} is {1}", letter, hexOutput);
charMsg[n] = Convert.ToChar(hexOutput);
n++;
}
The exception occurs at the charMsg[n] = Convert.ToChar(hexOutput); line. Why does it happen? When I check the values of CharMsg it seems to contain all of them properly, yet still throws an error at me.
UPDATE: I've solved this problem, it was my mistake. Sorry for bothering you.
OK, this was a really stupid mistake on my part. Point is, with my problem I'm not even supposed to do this as hex values clearly won't help me in any way.
What I am trying to do it to encrypt a message in an image. I've already encrypted the length of said message in last digits on each color channel of first pixel. Now I'm trying to put the very message in there. I lookt here: http://en.wikipedia.org/wiki/ASCII and said to myself without thinking that usung hexes would be a good idea. Can't belive I thought that.

Convert.ToChar( string s ), per the documentation requires a single character string, otherwise it throws a FormatException as you've noted. It is a rough, though more restrictive, equivalent of
public char string2char( string s )
{
return s[0] ;
}
Your code does the following:
Iterates over all the characters in some enumrable collection of characters.
For each such character, it...
Converts the char to an int. Hint: a char is an integral type: its an unsigned 16-bit integral value.
converts that value to a string containing a hex representation of the character in question. For most characters, that string will be at least two character in length: for instance, converting the space character (' ', 0x20) this way will give you the string "20".
You then try to convert that back to a char and replace the current item being iterated over. This is where your exception is thrown. One thing you should note here is that altering a collection being enumerated is likely to cause the enumerator to throw an exception.
What exactly are you trying to accomplish here. For instance, given a charMsg that consist of 3 characters, 'a', 'b' and 'c', what should happen. A clear problem statement helps us to help you.

Since printable unicode characters can be anywhere in range from 0x0000 to 0xFFFF, your hexOutput variable can hold more than one character - this is why error is thrown.
Convert.ToChar(string) would always check length a of string, and if it is not equal to 1 - it would throw. So it would not convert string 0x30 to hexadecimal number, and then to ascii representation, symbol 0.
Can you elaborate on what you are trying to archieve ?

Your hexOutput is a string, and I'm assuming charMsg is a character array. Suppose the first element in charMsg is 'p', or hex value 70. The documentation for Convert.ToChar(string) says it'll use just the first character of the string ('7'), but it's wrong. It'll throw this error. You can test this with a static example, like charMsg[n] = Convert.ToChar("70");. You'll get the same error.
Are you trying to replace characters with hex values? If so, you might try using a StringBuilder object instead of your array assignments.

Convert.ToChar(string) if it is empty string lead this error. instead use cchar()

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?

The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong 🀜 denotes the Four of circles";
The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.
(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);
In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.