Store UTF8 data in UTF16 column - c#

I'm storing XML in an XML column in SQL Server. SQL Server stores the data internally in UTF-16. Therefore the XML that is stored has to be in UTF-16.
The XML I have is in utf-8, it has this declaration on top:
<?xml version="1.0" encoding="UTF-8" ?>
When I try to insert xml with the UTF-8 declaration I get an exception saying something about the encoding. I can easily fix this in two ways:
by removing the declaration or
by changing the declaration to
:
<?xml version="1.0" encoding="UTF-16" ?>
Problem
I don't know if it's 'safe' or correct to just remove or replace the declaration. Will I lose data, or will the XML become corrupt? Or do I have to convert the string in C# from utf-8 to utf-16?

C# stores strings in UCS-2, an older version of the UTF-16 standard. So when you read a UTF-8 string in C#, C# converts it to UCS-2. It's the UCS-2 variant that you transmit to SQL Server.
You can change the xml declaration to encoding="UTF-16" or omit it altogether. There are some differences between UCS-2 and UTF-16; I'd be interesting in knowing how that affects C# and SQL Server!

SQL Server internally uses UCS-2 to store XML data, but this has nothing to do with the form in which you pass the data to SQL Server.
If for example you insert it using a varchar literal, make it an nvarchar literal instead and declare the encoding to be UTF-16. Sample:
DECLARE #VAR XML
INSERT INTO MyTable (MyXmlColumn)
VALUES (N'<?xml version="1.0" encoding="UTF-16" ?><doc></doc>')

Related

UTF-16 XML to HTML in .NET

I have a string stored in a SQL Server database table column that is currently a VarChar(Max) but the content is UTF-16 XML. Here is a sample:
<?xml version="1.0" encoding="utf-16" standalone="yes"?><Content><control name="txtGeneral" value="Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
" /></Content>
The data, stored raw, is not XML/datatype but I can do the conversion in my select (see below). I am pulling it out via .NET/ADO so I have it locally in a string for display in HTML. I just need to convert it for a textbox or HTML element so that it is displayed on the screen.
I can parse in t-sql the element (#value) I want but this does not do the encoding changes for me. Here is my sample query:
SELECT TOP 1 CONVERT(XML,CONVERT(NVARCHAR(MAX),m.Content)).value('(/Content/control/#value)[1]', 'varchar(max)')
FROM Messages m
WHERE MessageID = 85713;
I can use either .NET or t-sql for the conversion. I will be selecting only a single message at a time so performance should not be an issue.
This is what I would like it to look like:
Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
convert via: https://r12a.github.io/apps/conversion/
Thanks!
There are many serious flaws:
Do not store XML on string base, use the native XML type
Do not handle XML as a string, use the native XML methods
If - for any reason - you have to deal with it on string level use NVARCHAR(MAX)
Never use 1-byte-encoded VARCHAR(MAX). This will nead extra conversions and can lead to silly errors.
Do not store the xml-declaration <?xml blah ?>. This is needed to specify a file's encoding. Within SQL-Server an XML is always unicode / UCS 2
If you can change the above, you should really consider to do this. If not, here's an approach:
First cast the VARCHAR(MAX) to NVARCHAR(MAX), then to XML. Together with NVARCHAR(MAX) the UTF-16 will no longer disturb. Then use .value() to retrieve the value of the so named attribute.
DECLARE #mockMessages TABLE(Content VARCHAR(MAX));
INSERT INTO #mockMessages VALUES
('<?xml version="1.0" encoding="utf-16" standalone="yes"?><Content><control name="txtGeneral" value="Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
" /></Content>');
SELECT CAST(CAST(m.Content AS NVARCHAR(MAX)) AS XML).value(N'(/Content/control/#value)[1]',N'nvarchar(max)')
FROM #mockMessages AS m;
The same is - in principles - valid for .Net.
UPDATE: Some words about encoding
SQL-Server does neither support UTF-8, nor real UTF-16. There is a 1-byte encoding, which is extended ASCII (codepage/character mapping) and a 2-byte encoding, which is unicode / UCS-2 (which is almost UTF-8, at least with 99% of the usually seen characters). If you need your output UTF-8 encoded you must do this in your application. In almost any case you consider SQL Server's XML output (in UCS-2) as UTF-16. The communication between SQL-Server and .Net-code is unicode by default

Create xml file in binary format in C#

I want to create a xml file which has xml declaration, root node and child nodes.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<Tag1>
<SubTag>
<Id>
</Id>
<Name>IdentityManagement</Name>
<Time>4/11/2017 6:26:15 PM</Time>
<Message>Message1</Message>
</SubTag>
<SubTag>
<Id>
</Id>
<Name>MainWindow</Name>
<Time>4/11/2017 6:26:20 PM</Time>
<Message>Message2</Message>
</SubTag>
</Tag1>
But I need to write this xml in binary format, so no one can read it.
On calling of one function, one can add another SubTag.
So there can be n number of .
If you want to convert it into a form that is not trivially readable by a human, encode it to base64:
Convert.ToBase64String(textAsBytes);
If it should not be readable by anyone under any circumstances, encrypt it.
I am not sure what you mean when you say 'binary' though, all text is already binary when stored in a file, it is just encoded using an encoding scheme like ASCII or UTF8.

Why does en-dash (–) trigger illegal XML character error (C#/SSMS)?

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).
We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:
var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
This creates an XML file on disk that has contains the following (abbreviated) data:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>
Notice that in the second record, - is different to –. I believe the second instance is en-dash.
If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:
declare #xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 25, illegal xml character
So why does en-dash cause the error? From my research, it would appear that
...only a few entities that need escaping: <,>,\,' and & in both HTML and
XML.
Source
...of which en-dash is not one. An encoded version (replacing – with –) works fine.
UPDATE
Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm
So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?
Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.
The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.
Cause of original problem
This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:
DECLARE #badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 29, illegal xml character
Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 1, character 56, unable to switch the encoding
Solutions
Alternatives that work are:
1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
2) As above but with decimal encoding on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):
DECLARE #xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):
DECLARE #xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
Can you modify the XML encoding declaration? If so;
declare #xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml
(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>
Speculative Edit
Both of these fail with illegal xml character:
set #xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set #xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'
because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the – are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)
This does pass a nvarchar string to the parser,
but fails with unable to switch the encoding:
set #xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'
This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the – is not equivalent in the two encodings
This works as everything is utf-16
set #xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'
SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode
The reason you are looking for: With UTF-8 specified, this character is not known.
--without your directive, SQL Server picks its default
declare #xml XML =
'<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml;
--or UNICODE, but you must use UTF-16
declare #xml2 XML =
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));
select #xml2
UPDATE
UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...
Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.
Hope this is clear now...
UPDATE 2
This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:
DECLARE #x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
SELECT SUBSTRING(#x,Nmbr,1), ASCII(SUBSTRING(#x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(#x,Nmbr,1)) IS NOT NULL;
Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.
The MSDN guidelines says:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL
Server. SQL Server allows for an internal DTD in xml data type data,
which can be used to supply default values and to replace entity
references with their expanded contents. SQLXML passes the XML data
"as is" (including the internal DTD) to the server. You can convert
DTDs to XML Schema (XSD) documents using third-party tools, and load
the data with inline XSD schemas into the database.

Write utf-8 to a sql server Text field using ADO.Net and maintain the UTF-8 bytes

I have some xml encoded as UTF-8 and I want to write this to a Text field in SQL Server. UTF-8 is byte compatible with Text so it should be able to do this and then read out the xml later still encoded as utf-8.
However special characters such as ÄÅÖ, which are multi-byte in UTF-8 get changed on the way.
I have code like this:
byte[] myXML = ...
SqlCommand _MyCommand = new SqlCommand(storeProcedureName, pmiDB.GetADOConnection());
_MyCommand.CommandType = CommandType.StoredProcedure;
_MyCommand.Parameters.Add("xmlText", SqlDbType.Text);
_MyCommand.Parameters["xmlText"].Value = Encoding.UTF8.GetString(myXML);
_MyCommand.ExecuteNonQuery();
My guess is that changing the xml byte array to string changes the special characters to UTF-16 characters which are then changed again to the Latin1. And Latin1 ÖÄÅ are not the same as UTF-8 ÖÄÅ.
How can I write the UTF-8 xml bytes to the Text field without them getting changed?
Define your column as NText or NVarchar
The solution that I got to work was to change the Stored Procedure so that the myXml parameter was Varbinary(Max), which allowed me to pass in the byte array. Then in the SP I Cast the Varbinary(max) to Varchar(max). This preserves the bytes as required for UTF-8
SET myXMLText = CAST(myXMLBinary as VARCHAR(MAX))
if you want to store UTF-8 use binary then, because text is stored internally as UTF-16
If it's XML and if you're on SQL Server 2005 and up - use the XML column type! It's faster, it's more compact than VARCHAR(MAX) or NVARCHAR(MAX), you can associate it with an XML schema and thus validate only valid XML is stored.... only benefits!
If you can't use the XML column type for whatever reason, then please at least drop the TEXT for VARCHAR(MAX) or NVARCHAR(MAX)! TEXT/NTEXT is deprecated and will go away - plus, with (N)VARCHAR(MAX), you get all the usual strings functions, too, that don't work on TEXT/NTEXT.

XML UTF-8 encoding checking

I have an XML structure like this, some Student item contains invalid UTF-8 byte sequenceswhich may cause XML parsing fail for the whole XML document.
What I want to do is, filter out Student item which contains UTF-8 byte sequences, and keep the valid byte sequences ones. Any advice or samples about how to do this in .Net (C# preferred)?
BTW: invalid byte sequences I mean => http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
<?xml version="1.0" encoding="utf-8"?>
<AllStudents>
<Student>
Mike
</Student>
<Student>
(Invalid name here)
</Student>
</AllStudents>
thanks in advance,
George
That's pretty hard to do. You won't get an XML parser to parse a document with invalid characters in it, so I think you're reduced to a couple of options:
Figure out why the encoding is wrong - a common problem is labeling the document as UTF-8 (or having no encoding declaration) when the document is actually written in Latin-1.
Take out the bad sections by hand.
Try and find a tag soup parser for .NET that will continue parsing after the error.
Reject the invalid XML document.
I don't know C#, so I'm afraid I can't give you code to do this, but the basic idea is to read the whole file as a utf-8 text file, using a DecoderFallback to replace invalid sequences with either question mark characters or the unicode chacter 0xFFFD. Then write the file back out as a utf-8 text file, and parse that.
Basically, you separate out the operation of "wiping out bad utf-8 sequences" from the operation of "parsing the xml file".
You should probably even be able to skip writing the file back out again before running the XML parser to read in the fixed data; there should be some way to write the file to an in-memory byte stream and parse that byte stream as XML. (Again, sorry for not knowing C#)
Very close from XML encoding issue.

Categories