Why does en-dash (–) trigger illegal XML character error (C#/SSMS)?

Why does en-dash (–) trigger illegal XML character error (C#/SSMS)? - c#

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).
We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:
var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
This creates an XML file on disk that has contains the following (abbreviated) data:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>
Notice that in the second record, - is different to –. I believe the second instance is en-dash.
If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:
declare #xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 25, illegal xml character
So why does en-dash cause the error? From my research, it would appear that
...only a few entities that need escaping: <,>,\,' and & in both HTML and
XML.
Source
...of which en-dash is not one. An encoded version (replacing – with –) works fine.
UPDATE
Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm
So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.
The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.
Cause of original problem
This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:
DECLARE #badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 29, illegal xml character
Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 1, character 56, unable to switch the encoding
Solutions
Alternatives that work are:
1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
2) As above but with decimal encoding on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):
DECLARE #xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):
DECLARE #xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';

Can you modify the XML encoding declaration? If so;
declare #xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml
(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>
Speculative Edit
Both of these fail with illegal xml character:
set #xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set #xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'
because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the – are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)
This does pass a nvarchar string to the parser,
but fails with unable to switch the encoding:
set #xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'
This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the – is not equivalent in the two encodings
This works as everything is utf-16
set #xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'

SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode
The reason you are looking for: With UTF-8 specified, this character is not known.
--without your directive, SQL Server picks its default
declare #xml XML =
'<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml;
--or UNICODE, but you must use UTF-16
declare #xml2 XML =
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));
select #xml2
UPDATE
UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...
Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.
Hope this is clear now...
UPDATE 2
This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:
DECLARE #x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
SELECT SUBSTRING(#x,Nmbr,1), ASCII(SUBSTRING(#x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(#x,Nmbr,1)) IS NOT NULL;
Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.

The MSDN guidelines says:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL
Server. SQL Server allows for an internal DTD in xml data type data,
which can be used to supply default values and to replace entity
references with their expanded contents. SQLXML passes the XML data
"as is" (including the internal DTD) to the server. You can convert
DTDs to XML Schema (XSD) documents using third-party tools, and load
the data with inline XSD schemas into the database.

Related

Adding line breaks to XML text [duplicate]

I have an XML file and I would like to make a new line in the text
"Sample Text 123" like this
Sample
Text 123
I've tried already everything I mean &#xA &#xD \n but nothing works:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
Sample
Text 123
</data>
</item>

A newline (aka line break or end-of-line, EOL) is special character or character sequence that marks the end of a line of text. The exact codes used vary across operating systems:
Operating System
End-of-Line (EOL) marker
Unix
LF
Mac OS up to version 9
CR
Windows, DOS
CR+LF
You can use
for line feed (LF) or 
 for carriage return (CR), and an XML parser will replace it with the respective character when handing off the parsed text to an application. These can be added manually, as you show in your example, but are particularly convenient when needing to add newlines programmatically within a string:
Common programming languages:
LF: "
"
CR: "
"
XSLT:
LF: <xsl:text>
</xsl:text>
CR: <xsl:text>
</xsl:text>
Or, if you want to see it in the XML immediately, simply put it in literally:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
Sample
Text 123
</data>
</item>
Newline still not showing up?
Keep in mind that how an application interprets text, including newlines, is up to it. If you find that your newlines are being ignored, it might be that the application automatically runs together text separated by newlines.
HTML browsers, for example, will ignore newlines (and will normalize space within text such that multiple spaces are consolidated). To break lines in HTML,
use <br/>; or
wrap block in an element such as a div or p which by default causes a line break after the enclosed text, or in an element such as pre which by default typically will preserve whitespace and line breaks; or
use CSS styling such as white-space to control newline rendering.
XML application not cooperating?
If an XML application isn't respecting your newlines, and working within the application's processing model isn't helping, another possible recourse is to use CDATA to tell the XML parser not to parse the text containing
the newline.
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
<![CDATA[Sample
Text 123]]>
</data>
</item>
or, if HTML markup is recognized downstream:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
<![CDATA[Sample <br/>
Text 123]]>
</data>
</item>
Whether this helps will depend upon application-defined semantics of one or more stages in the pipeline of XML processing that the XML passes through.
Bottom line
A newline (aka line break or end-of-line, EOL) can be added much like any character in XML, but be mindful of
differing OS conventions
differing XML application semantics

Create xml file in binary format in C#

I want to create a xml file which has xml declaration, root node and child nodes.
Example:
<?xml version="1.0" encoding="UTF-8"?>
<Tag1>
<SubTag>
<Id>
</Id>
<Name>IdentityManagement</Name>
<Time>4/11/2017 6:26:15 PM</Time>
<Message>Message1</Message>
</SubTag>
<SubTag>
<Id>
</Id>
<Name>MainWindow</Name>
<Time>4/11/2017 6:26:20 PM</Time>
<Message>Message2</Message>
</SubTag>
</Tag1>
But I need to write this xml in binary format, so no one can read it.
On calling of one function, one can add another SubTag.
So there can be n number of .

If you want to convert it into a form that is not trivially readable by a human, encode it to base64:
Convert.ToBase64String(textAsBytes);
If it should not be readable by anyone under any circumstances, encrypt it.
I am not sure what you mean when you say 'binary' though, all text is already binary when stored in a file, it is just encoded using an encoding scheme like ASCII or UTF8.

Using IntermediateSerializer, how do I deserialize a list in an object?

Let's say, for example, this is my class...
public class DoodadData
{
public List<Color> colorVariations;
}
...and this is my XML data I'm deserializing...
<?xml version="1.0" encoding="utf-8" ?>
<XnaContent>
<Asset Type="Data.DoodadData">
<colorVariations>
<Item>
<R>0</R>
<G>0</G>
<B>0</B>
<A>0</A>
</Item>
</colorVariations>
</Asset>
</XnaContent>
Is there something I need to change to get this to work? The error that MSVC is giving me says...
"There was an error while deserializing intermediate XML. 'Element' is an invalid XmlNodeType. Line 20, position 5."
Which is pointing me to the first "Item" tag in the colorVariations List. Everything I've found on Google tells me that naming the elements "Item" is correct when using IntermediateSerializer. I've also tried naming them "Element" and "Color" to no avail. (I've also tried other things, like renaming the RGBA properties, which also didn't work).

After messing around with it, I've found that it seems like the colors need to be entered tag-less and in hex format, like so:
<colorVariations>
FFFFFFFF
FFFFFFFF
FFFFFFFF
</colorVariations>

Store UTF8 data in UTF16 column

I'm storing XML in an XML column in SQL Server. SQL Server stores the data internally in UTF-16. Therefore the XML that is stored has to be in UTF-16.
The XML I have is in utf-8, it has this declaration on top:
<?xml version="1.0" encoding="UTF-8" ?>
When I try to insert xml with the UTF-8 declaration I get an exception saying something about the encoding. I can easily fix this in two ways:
by removing the declaration or
by changing the declaration to
:
<?xml version="1.0" encoding="UTF-16" ?>
Problem
I don't know if it's 'safe' or correct to just remove or replace the declaration. Will I lose data, or will the XML become corrupt? Or do I have to convert the string in C# from utf-8 to utf-16?

C# stores strings in UCS-2, an older version of the UTF-16 standard. So when you read a UTF-8 string in C#, C# converts it to UCS-2. It's the UCS-2 variant that you transmit to SQL Server.
You can change the xml declaration to encoding="UTF-16" or omit it altogether. There are some differences between UCS-2 and UTF-16; I'd be interesting in knowing how that affects C# and SQL Server!

SQL Server internally uses UCS-2 to store XML data, but this has nothing to do with the form in which you pass the data to SQL Server.
If for example you insert it using a varchar literal, make it an nvarchar literal instead and declare the encoding to be UTF-16. Sample:
DECLARE #VAR XML
INSERT INTO MyTable (MyXmlColumn)
VALUES (N'<?xml version="1.0" encoding="UTF-16" ?><doc></doc>')

Is there any way to FORCIBLY allow the usage of '<' and/or '>' in XML files?

I have written a C# application that loads XML files, parses them and uses the information to run SQL queries and send the results to email distribution lists.
These XML files are usually created by END users.
Currently I have them replacing > and < with > and < in the SQL, of course being END users they sometime forget. In fact they ALWAYS forget. I'd prefer to keep the query in an XML file. So, is there ANY way to force/allow the use of these special characters in XML files?
Right now my user must type this:
<?xml version="1.0" encoding="utf-8" ?>
<report>
<queries>
<query>
SELECT * FROM THETABLE WHERE THEVALUE > 100
</query>
</queries>
</report>
I'd like them to be able to type this:
<?xml version="1.0" encoding="utf-8" ?>
<report>
<queries>
<query>
SELECT * FROM THETABLE WHERE THEVALUE > 100
</query>
</queries>
</report>

You can wrap your queries in CDATA:
<?xml version="1.0" encoding="utf-8" ?>
<report>
<queries>
<query><![CDATA[
SELECT * FROM THETABLE WHERE THEVALUE > 100
]]></query>
</queries>
</report>

Use CDATA, text inside CDATA is not parsed, something like this :
<query><![CDATA[SELECT * FROM THETABLE WHERE THEVALUE > 100]]></query>

You would need to surround the text with CDATA so it looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<report>
<queries>
<query>
<![CDATA[SELECT * FROM THETABLE WHERE THEVALUE > 100]]>
</query>
</queries>
</report>
This tells the parser that everything between should be treated as text and should not be interpreted.

Use CDATA. So:
<query><![CDATA[SELECT * FROM THETABLE WHERE THEVALUE > 100]]></query>
The text inside a CDATA section is ignored by the parser.

You can preprocess the file with a regular expression which looks for < and > that doesn't belong to a tag, and replace them accordingly.
You can use this regex:
(?sx)
\s*
(?:<\?.*?\?>)(?:\s*)
(?:
(?:<[^\s]*?>)\s*
|(?:[^<>]*\s)
|(?<lt><)
|(?<gt>>)
)*
\s*
(Be aware that you must use single line and ignore whitespace options, as stablished by (?sx).
This expression captures or the less than and greater than symbols which doesn't belong to the tags in the lt and gt groups.
You can replace the matches.
If you want to know how it works, this captures everything in named groups:
(?sx)
\s*
(?<head><\?.*?\?>)(?:\s*)
(?:
(?<tag><[^\s]*?>)\s*
|(?<others>[^<>]*\s)
|(?<lt><)
|(?<gt>>)
)*
\s*

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why does en-dash (–) trigger illegal XML character error (C#/SSMS)? - c#

Related

Adding line breaks to XML text [duplicate]

Create xml file in binary format in C#

Using IntermediateSerializer, how do I deserialize a list in an object?

Store UTF8 data in UTF16 column

Is there any way to FORCIBLY allow the usage of '<' and/or '>' in XML files?

Categories

Resources