22021: invalid byte sequence for encoding "UTF8": 0x00 - c#

I am doing a bulk import to PostgreSQL from C# and one of the records gives me this error:
22021: invalid byte sequence for encoding "UTF8": 0x00
I googled it and the general advice is that this refers to a null field but in my instance this is not the case. I tracked down the string that causes the error and it is this:
Addresses the following: Let $A$ be a Banach algebra, and let $\sum:\0\rightarrow I\rightarrow\mathfrak A\overset\pi\to\longrightarrow A\rightarrow 0$ be an extension of $A$, where $\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\mathfrak A$.
I am reading this from an XML file and have UTF-8 defined on the file stream.
The escaped string on my deserialized C# class is:
"Addresses the following: Let $A$ be a Banach algebra, and let $\\sum\\:\\0\\rightarrow I\\rightarrow\\mathfrak A\\overset\\pi\\to\\longrightarrow A\\rightarrow 0$ be an extension of $A$, where $\\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\\mathfrak A$."
Obviously something is not right with the string. I am guessing some sort of mathmatical symbols should be there but what exactly about this is breaking the import and making PostgreSQL report that it is a null field? What format should that be read in?
If I manually overwite this field the import works so it is 100% an issue with this string.

Since it's a bulk import, I'm assuming you're creating a file or some kind of big string to send to Postgres? In that case the strings probably have escape characters enabled, as opposed to executing this via, say, a prepared statement. So it's probably that \0 in your string that Postgres is escaping and interpreting as a 0x00.
from the docs: https://www.postgresql.org/docs/8.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g. E'foo'. (When continuing an escape string constant across lines, write E only before the first opening quote.) Within an escape string, a backslash character () begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represents a special byte value. \b is a backspace, \f is a form feed, \n is a newline, \r is a carriage return, \t is a tab. Also supported are \digits, where digits represents an octal byte value, and \xhexdigits, where hexdigits represents a hexadecimal byte value. (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\). Also, a single quote can be included in an escape string by writing \', in addition to the normal way of ''.
So if your bulk statement is prepending strings with E, like E'hello', don't do that.

Related

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

Remove double backslashes c# (for use ESC/POS programming)

I've seached a long time, and it seems that my problem is world-wide known. But, all the answers that are given, won't work for me. Most of the time, people say 'there is no problem'.
The problem: I'm programming a POS solution, and I'm using a Epson POS printer. To print the buttom to the receipt, I'm storing a string in the database. This is, so users can adjust the text at the bottom of the receipt. But, when I'm pulling the string out of the database, C# adds slashes to the string, so my excape characters won't work. I know, that usualy is not a problem, but in my case it is, because my ECS/POS commands won't work.
I've already tried some scripts, which replaces the double \ with a single \, but they don't work. (eg. String.Replace(#'\\',#'\').
Problem:
I have a sting: "foo \n bar"
Needs to print as:
foo
bar
C# adds slashes: "foo \\n bar"
Now it's printed as:
foo \n bar
Anyone an idea?
The problem is a misunderstanding of how C# handles strings. Take the following sample code:
string foo = "a\nb";
int fooLength = foo.Length; \\ 3 characters.
int bar = (int)(foo[1]); \\ 10 = linefeed character.
versus:
string foo = #"a\nb"; \\ NB: # prefix!
int fooLength = foo.Length; \\ 4 characters.
int bar = (int)(foo[1]); \\ 92 = backslash character.
The first example uses a string literal ("a\nb") which is interpreted by the C# compiler to yield three characters. The second example uses a verbatim string literal, due the prefix #, that suppresses the interpretation of the string.
Note that the debugger is designed to add to the confusion by displaying strings with escape codes added, e.g. string foo = "a\nb" + (Char)9; results in a string that the debugger shows as "a\nb\t". If you use the "text visualizer" in the debugger (by clicking on the magnifying glass when examining the the variable's value) you can see the difference between literal and interpreted characters.
Databases are, as a rule, designed to accept and return string values without interpretation. That way you needn't worry about names like "Delete D'table". Neither the presence of a SQL keyword, nor punctuation used in SQL statements, should present a problem in a data column.
Now the OP's issue should be becoming clearer. The string retrieved from the database does not contain a linefeed, but instead contains the characters '\' and 'n'. .NET has no reason to change those values when the string is read from the database and written to a printer. Unfortunately, the debugger confounds the difference. (Use the text visualizer as described above.)
The solution involves adding code to reproduce the C# compiler's processing of escape sequences. (This should include escaping escape characters!) Alternatively, tokens can be added that are suitable for the application at hand, e.g. occurrences of «ESC» could be replaced with an ASCII escape character. This can be employed for longer sequences, for example if a print uses several characters to introduce a font change then write the code to replace «SetFont» with the correct sequence. More generally, you can replace a snippet with a dynamic value, e.g. «Now» could be replaced with the current date/time when the receipt is being printed. (Register number, cashier name, store hours, ... .) This makes the values in the database more human readable than embedded Unicode oddities and more flexible than fixed strings.
Left as an exercise for the reader: extend snippets to support formatting and null value substitution. «Now|DD.MM.YY hh:mm» to specify a format, «Discount|*|n/a» to specify a value ("n/a") to be displayed if the field is null.

Escape string from file

I have to parse some files that contain some string that has characters in them that I need to escape. To make a short example you can imagine something like this:
var stringFromFile = "This is \\n a test \\u0085";
Console.WriteLine(stringFromFile);
The above results in the output:
This is \n a test \u0085
, but I want the text escaped. How do I do this in C#? The text contains unicode characters too.
To make clear; The above code is just an example. The text contains the \n and unicode \u00xx characters from the file.
Example of the file contents:
Fisika (vanaf Grieks, \u03C6\u03C5\u03C3\u03B9\u03BA\u03CC\u03C2,
\"Natuurlik\", en \u03C6\u03CD\u03C3\u03B9\u03C2, \"Natuur\") is die
wetenskap van die Natuur
Try it using: Regex.Unescape(string)
Should be the right way.
Att.
Don't use the # symbol -- this interprets the string as 100% literal. Just take it off and all shall be well.
EDIT
I may have been a bit hasty with my reply. I think what you're asking is: how can I have C# turn the literal string '\n' into a newline, when read from a file (similar question for other escaped literals).
The answer is: you write it yourself. You need to search for "\\n" and convert it to "\n". Keep in mind that in C#, it's the compiler not the language that changes your strings into actual literals, so there's not some library call to do this (actually there could be -- someone look this up, quick).
EDIT
Aha! Eureka! Behold:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.unescape.aspx
Since you are reading the string from a file, \n is not read as a unicode character but rather as two characters \ and n.
I would say you probably need a search an replace function to convert string "\n" to its unicode character '\n' and so on.
I don't think there's any easy way to do this. Because it's the job of lexical analyzer to parse literals.
I would try generating and compiling a class via CodeDOM with the string inserted there as constant. It's not very fast but it will do all escaping.

Why does .NET add an additional slash to the already existent slashes in a path?

I've noticed that C# adds additional slashes (\) to paths. Consider the path C:\Test. When I inspect the string with this path in the text visualiser, the actual string is C:\\Test.
Why is this? It confuses me, as sometimes I may want to split the path up (using string.Split()), but have to wonder which string to use (one or two slashes).
The \\ is used because the \ is an escape character and is need to represent the a single \.
So it is saying treat the first \ as an escape character and then the second \ is taken as the actual value. If not the next character after the first \ would be parsed as an escaped character.
Here is a list of available escape characters:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 – Null
\a - Alert
\b - Backspace
\f - Form feed
\n - New line
\r - Carriage return
\t - Horizontal tab
\v - Vertical quote
\u - Unicode escape sequence for character
\U - Unicode escape sequence for surrogate pairs.
\x - Unicode escape sequence similar to "\u" except with variable length.
EDIT: To answer your question regarding Split, it should be no issue. Use Split as you would normally. The \\ will be treated as only the one character of \.
.Net is not adding anything to your string here. What your seeing is an effect of how the debugger chooses to display strings. C# strings can be represented in 2 forms
Verbatim Strings: Prefixed with an # sign and removes the need o escape \\ characters
Normal Strings: Standard C style strings where \\ characters need to escape themselves
The debugger will display a string literal as a normal string vs. a verbatim string. It's just an issue of display though, it doesn't affect it's underlying value.
Debugger visualizers display strings in the form in which they would appear in C# code. Since \ is used to escape characters in non-verbatum C# strings, \\ is the correct escaped form.
Okay, so the answers above are not wholly correct. As such I am adding my findings for the next person who reads this post.
You cannot split a string using any of the chars in the table above if you are reading said string(s) from an external source.
i.e,
string[] splitStrings = File.ReadAllText([path]).Split((char)7);
will not split by those chars. However internally created strings work fine.
i.e.,
string[] splitStrings = "hello\agoodbye".Split((char)7);
This may not hold true for other methods of reading text from a file. I am unsure as I have not tested with other methods. With that in mind, it is probably best not to use those chars for delimiting strings!

What is a binary null character?

I have a requirement to create a sysDesk log file. In this requirement I am supposed to create an XML file, that in certain places between the elements contains a binary null character.
Can someone please explain to me, firstly what is a binary null character, and how can I write one to a text file?
I suspect it means Unicode U+0000. However, that's not a valid character in an XML file... you should see if you can get a very clear specification of the file format to work out what's actually required. Sample files would also be useful :)
Comments are failing me at the moment, so to address a couple of other answers:
It's not a string termination character in C#, as C# doesn't use null-terminated strings. In fact, all .NET strings are null-terminated for the sake of interop, but more importantly the length is stored independently. In particular, a C# string can entirely validly include a null character without terminating it:
string embeddedNull = "a\0b";
Console.WriteLine(embeddedNull.Length); // Prints 3
The method given by rwmnau for getting a null character or string is very inefficient for something simple. Better would be:
string justNullString = "\0";
char justNullChar = '\0';
A binary null character is just a char with an integer/ASCII value of 0.
You can create a null character with Convert.ToChar(0) or the more common, more well-recognized '\0'.
A binary NULL character is one that's all zeros (0x00 in Hex). You can write:
System.Text.Encoding.ASCII.GetChars(new byte[] {00});
to get it in C#.
The null character is the special character that's represented by U+0000 (encoded by all-zero bits). The null character is represented in C# by the escape sequence \0, as in "This string ends with a null character.\0".

Categories