This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).
We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:
var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
This creates an XML file on disk that has contains the following (abbreviated) data:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>
Notice that in the second record, - is different to –. I believe the second instance is en-dash.
If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:
declare #xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 25, illegal xml character
So why does en-dash cause the error? From my research, it would appear that
...only a few entities that need escaping: <,>,\,' and & in both HTML and
XML.
Source
...of which en-dash is not one. An encoded version (replacing – with –) works fine.
UPDATE
Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm
So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?
Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.
The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.
Cause of original problem
This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:
DECLARE #badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 29, illegal xml character
Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 1, character 56, unable to switch the encoding
Solutions
Alternatives that work are:
1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
2) As above but with decimal encoding on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):
DECLARE #xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):
DECLARE #xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
Can you modify the XML encoding declaration? If so;
declare #xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml
(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>
Speculative Edit
Both of these fail with illegal xml character:
set #xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set #xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'
because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the – are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)
This does pass a nvarchar string to the parser,
but fails with unable to switch the encoding:
set #xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'
This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the – is not equivalent in the two encodings
This works as everything is utf-16
set #xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'
SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode
The reason you are looking for: With UTF-8 specified, this character is not known.
--without your directive, SQL Server picks its default
declare #xml XML =
'<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml;
--or UNICODE, but you must use UTF-16
declare #xml2 XML =
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));
select #xml2
UPDATE
UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...
Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.
Hope this is clear now...
UPDATE 2
This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:
DECLARE #x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
SELECT SUBSTRING(#x,Nmbr,1), ASCII(SUBSTRING(#x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(#x,Nmbr,1)) IS NOT NULL;
Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.
The MSDN guidelines says:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL
Server. SQL Server allows for an internal DTD in xml data type data,
which can be used to supply default values and to replace entity
references with their expanded contents. SQLXML passes the XML data
"as is" (including the internal DTD) to the server. You can convert
DTDs to XML Schema (XSD) documents using third-party tools, and load
the data with inline XSD schemas into the database.
I am using an xsl stylesheet to output an xsl:fo document with an SVG chart embedded.
I am having trouble taking an array from the input, and splitting it into several smaller arrays, stored in variables so that the SVG template can be applied to the different arrays to generate 3 different charts. The input looks like this (please note the custom ns):
<root xmlns="http://xml.mynamespace.com">
<data>
<list>
<item>
<id>1</id>
<title>Foo</title>
<score>10</score>
</item>
<item>
<id>2</id>
<title>Bar</title>
<score>6</score>
</item>
<item>
<id>3</id>
<title>Baz</title>
<score>16</score>
</item>
<item>
<id>4</id>
<title>Fizz</title>
<score>14</score>
</item>
<item>
<id>5</id>
<title>Buzz</title>
<score>7</score>
</item>
</list>
</data>
</root>
These value can be split into 3 distinct groups. I am trying to split the array list into 3 separate variables to that a template can be applied to turn them into an SVG chart. The SVG transform is known to work for the array as above, so I think the problem is the way I am trying to create the variables. I have tried a few different ways, but I have had the most success (if you can call it that) using xsl:copy-of as so (again, please be aware of the ns):
<xslt:stylesheet xmlns:m="http://xml.mynamespace.com" version="1.0">
<xsl:variable name="group1">
<xsl:element name="m:list">
<xsl:copy-of select="/m:root/m:data/m:list/m:item[id <= 3]"/>
</xsl:element>
</xsl:variable>
</xslt:stylesheet>
and then later the variable is used like so:
<xsl:apply-templates select="msxsl:node-set($group1)/m:list" />
The reason I am putting them in variables is because the template that creates the SVG expects input in the format of <list> with one or more child item elements. The SVG transform template is as so:
<xsl:template match="m:list">
<xsl:variable name="canvasHeight" select="28 * count(m:item)"/>
<svg height="{$canvasHeight}">
<xsl:for-each select="m:item">
<!-- Draw bar here -->
</xsl:for-each>
</svg>
</xsl:template>
The output when I try to transform the variable to SVG as above indicates that the list element is created correctly (because the template matches and the SVG element is output) but the item elements aren't copied because the for-each doesn't seem to have executed and the outputted height is 0.
Am I incorrectly creating the variable group1? Or is there an easier way to do this that doesn't require splitting the initial list into separate variables?
Well with /m:root/m:data/m:list/m:item[Id <= 3] you simply have the wrong case (Id versus id) and the wrong namespace (none versus m:id) in the predicate.
I have a an xml schema document in this format
<Schema xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<AttributeType name="scale" default="4.0"/>
<ElementType name="GPA" content="textOnly" dt:type="float">
<attribute type="scale"/>
</ElementType>
<AttributeType name="studentID"/>
<ElementType name="student" content="eltOnly" model="open" order="many">
<attribute type="studentID"/>
<element type="GPA"/>
</ElementType>
</Schema>
I wanted to generate classes in C# from given schema. I checked online that XSD.exe
can generate classes only from W3C XML Schemas.
Is dere ne way to convert this format into W3C?
I am new to XSD and tried rewriting it but gettin loads of errors.
Please help.
Thanks in Advance
You could try this:
C# Auto generation of class objects from XSD
Alternatively, I just wrote a generic modeler that's open source. You can use it to generate classes in whatever language you'd like.
https://github.com/homer6/modeler
If you fork the crudecppmodeler branch, it'd be similar to C#. Either that, or I can specifically design it for you.
It's not based off of XSD, but I could adapt it to. I plan to write support for multiplicity in the next few days.
Here's the sample format: https://github.com/homer6/modeler/blob/crudecppmodeler/simple.jm
Hope that helps...
How can i parse this XML with C# WP7 to different lists for binding different panorama pages:
<root>
<main1>
<item>
<id>1</id>
</item>
<item>
<id>2</id>
</item>
</main1>
<main2>
<item>
<id>1</id>
</item>
<item>
<id>2</id>
</item>
<main2>
</root>
The most efficient (and memory preserving) way to parse large XML documents is to use the XmlReader. Se the MSDN howto for a reasonable example.
The easiest way to parse an XML document is usually the XDocument class, but that class reads the whole document into memory at once and is not recommendable for large documents.
Duplicate: This is a duplicate of Best practices to parse xml files with C#? and many others (see https://stackoverflow.com/search?q=c%23+parse+xml). Please close it and do not answer.
How do you parse XML document from bottom up in C#?
For Example :
<Employee>
<Name> Test </name>
<ID> 123 </ID>
<Employee>
<Company>
<Name>ABC</company>
<Email>test#ABC.com</Email>
</company>
Like these there are many nodes..I need to start parsing from bottom up like..first parse <company> and then and so on..How doi go about this in C# ?
Try this:
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\Path\To\Xml\File.xml");
Or alternatively if you have the XML in a string use the LoadXml method.
Once you have it loaded, you can use SelectNodes and SelectSingleNode to query specific values, for example:
XmlNode node = doc.SelectSingleNode("//Company/Email/text()");
// node.Value contains "test#ABC.com"
Finally, note that your XML is invalid as it doesn't contain a single root node. It must be something like this:
<Data>
<Employee>
<Name>Test</Name>
<ID>123</ID>
</Employee>
<Company>
<Name>ABC</Name>
<Email>test#ABC.com</Email>
</Company>
</Data>