Adding line breaks to XML text [duplicate] - c#

I have an XML file and I would like to make a new line in the text
"Sample Text 123" like this
Sample
Text 123
I've tried already everything I mean &#xA &#xD \n but nothing works:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
Sample
Text 123
</data>
</item>

A newline (aka line break or end-of-line, EOL) is special character or character sequence that marks the end of a line of text. The exact codes used vary across operating systems:
Operating System
End-of-Line (EOL) marker
Unix
LF
Mac OS up to version 9
CR
Windows, DOS
CR+LF
You can use
for line feed (LF) or 
 for carriage return (CR), and an XML parser will replace it with the respective character when handing off the parsed text to an application. These can be added manually, as you show in your example, but are particularly convenient when needing to add newlines programmatically within a string:
Common programming languages:
LF: "
"
CR: "
"
XSLT:
LF: <xsl:text>
</xsl:text>
CR: <xsl:text>
</xsl:text>
Or, if you want to see it in the XML immediately, simply put it in literally:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
Sample
Text 123
</data>
</item>
Newline still not showing up?
Keep in mind that how an application interprets text, including newlines, is up to it. If you find that your newlines are being ignored, it might be that the application automatically runs together text separated by newlines.
HTML browsers, for example, will ignore newlines (and will normalize space within text such that multiple spaces are consolidated). To break lines in HTML,
use <br/>; or
wrap block in an element such as a div or p which by default causes a line break after the enclosed text, or in an element such as pre which by default typically will preserve whitespace and line breaks; or
use CSS styling such as white-space to control newline rendering.
XML application not cooperating?
If an XML application isn't respecting your newlines, and working within the application's processing model isn't helping, another possible recourse is to use CDATA to tell the XML parser not to parse the text containing
the newline.
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
<![CDATA[Sample
Text 123]]>
</data>
</item>
or, if HTML markup is recognized downstream:
<?xml version="1.0" encoding="UTF-8" ?>
<item>
<text>Address</text>
<data>
<![CDATA[Sample <br/>
Text 123]]>
</data>
</item>
Whether this helps will depend upon application-defined semantics of one or more stages in the pipeline of XML processing that the XML passes through.
Bottom line
A newline (aka line break or end-of-line, EOL) can be added much like any character in XML, but be mindful of
differing OS conventions
differing XML application semantics

Related

How to read an XML file with carriage return in its contents?

I need to read an XML file that has 
 chars in some node contents and I need to keep that chars as is and avoid converting them into new lines. Those nodes have xmldsig signatures and converting 
 chars into new lines invalidate the signatures.
I have tried loading the XML with XmlDocument.Load, XmlReader, StreamReader and the special chars ends up converted into new lines.
UPDATE with an XML sample
<?xml version="1.0"?>
<catalog>
<book>
<description>description
with
several
lines
</description>
</book>
<Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
...
</Signature>
</catalog>
If the CR characters are literal 0x0D bytes, any conformant XML parser is obliged to drop these or convert them to newlines, under the rules for normalizing line endings in the XML recommendation: see https://www.w3.org/TR/REC-xml/#sec-line-ends.
Generally, any processing of an XML file is going to make changes at the binary level, for example whitespace between attributes will be lost. Your expectation that you can parse and serialize an XML file while preserving its binary representation is fundamentally wrong.
However, the algorithm for XML digital signatures is careful to ignore such variations. It works at a logical level, and should ignore things such as the whitespace within start tags, or the exact representation of line endings. You state that converting CR to NL is invalidating the signature: that sounds wrong to me. The signature should be unaffected.
There are a few ways to read an XML file with carriage return 
 in its contents:
Use an XML parser that supports 
 as a line ending character.
Use a text editor that supports 
 as a line ending character.
Use a tool that can convert 
 to a different line ending character.

Why does en-dash (–) trigger illegal XML character error (C#/SSMS)?

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).
We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:
var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
This creates an XML file on disk that has contains the following (abbreviated) data:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>
Notice that in the second record, - is different to –. I believe the second instance is en-dash.
If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:
declare #xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 25, illegal xml character
So why does en-dash cause the error? From my research, it would appear that
...only a few entities that need escaping: <,>,\,' and & in both HTML and
XML.
Source
...of which en-dash is not one. An encoded version (replacing – with –) works fine.
UPDATE
Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm
So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?
Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.
The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.
Cause of original problem
This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:
DECLARE #badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 29, illegal xml character
Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 1, character 56, unable to switch the encoding
Solutions
Alternatives that work are:
1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
2) As above but with decimal encoding on the entity (reference):
DECLARE #xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):
DECLARE #xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):
DECLARE #xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
Can you modify the XML encoding declaration? If so;
declare #xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml
(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>
Speculative Edit
Both of these fail with illegal xml character:
set #xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set #xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'
because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the – are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)
This does pass a nvarchar string to the parser,
but fails with unable to switch the encoding:
set #xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'
This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the – is not equivalent in the two encodings
This works as everything is utf-16
set #xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'
SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode
The reason you are looking for: With UTF-8 specified, this character is not known.
--without your directive, SQL Server picks its default
declare #xml XML =
'<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select #xml;
--or UNICODE, but you must use UTF-16
declare #xml2 XML =
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));
select #xml2
UPDATE
UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...
Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.
Hope this is clear now...
UPDATE 2
This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:
DECLARE #x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
SELECT SUBSTRING(#x,Nmbr,1), ASCII(SUBSTRING(#x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(#x,Nmbr,1)) IS NOT NULL;
Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.
The MSDN guidelines says:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL
Server. SQL Server allows for an internal DTD in xml data type data,
which can be used to supply default values and to replace entity
references with their expanded contents. SQLXML passes the XML data
"as is" (including the internal DTD) to the server. You can convert
DTDs to XML Schema (XSD) documents using third-party tools, and load
the data with inline XSD schemas into the database.

remove nested element using regular expression

I am new to regex. I want to only capture the text portion from <firstpar> or to remove all <asmbly> with all its children nodes and values. Can anyone show me how to do that. The following is the snap shot of the xml fiel. thanks.
<?xml version="1.0" encoding="UTF-8"?>
<firstpar>
<thumbcred>Sample 1 thumbcred</thumbcred>
<asmbly>
<caption>
<p><work ty="drawing">Two Fabulous Animals</work>Sample 1 <e> sample 1caption </e></p>
</caption>
<credit>Paul Miller/AP</credit>
<asset id="126099" hgt="450" wdth="289" tmstp="24-OCT-08"
bintype="2" filename="images/sample126099.jpg" source="eb" bighgt="1600"
bigwdth="1029" bigfilename="botany003.jpg"
bigdeployfullfilename="/eb-media/99/126099-050-CAD1EF0A.jpg"
/>
<copyright>Copyright © 1994-2013 Encyclopædia Britannica, Inc.</copyright>
</asmbly>
Sample firstpar text <e>Sample e</e> just some
text <sub>sample sub </sub><e>sample e text again</e> more text with sup sub e.
</firstpar>
Unfortunately, one of the known limitations of regex is that it does not handle nesting
You can and should use whatever XML parser is available in whatever language you're using.
If you have a very specifically formed piece of XML, and a very specific goal, than it is possible to use regex to perform some operations on it, but once you try to apply your regex to a non-specific piece of xml, it will be unable to handle it.

Clean out/replace invalid XML characters in element attributes

UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"

XmlDocument dropping encoded characters

My C# application loads XML documents using the following code:
XmlDocument doc = new XmlDocument();
doc.Load(path);
Some of these documents contain encoded characters, for example:
<xsl:text>
</xsl:text>
I notice that when these documents are loaded,
gets dropped.
My question: How can I preserve <xsl:text>
</xsl:text>?
FYI - The XML declaration used for these documents:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
Are you sure the character is dropped? character 10 is just a line feed- it wouldn't exactly show up in your debugger window. It could also be treated as whitespace. Have you tried playing with the whitespace settings on your xmldocument?
If you need to preserve the encoding you only have two choices: a CDATA section or reading as plain text rather than Xml. I suspect you have absolutely 0 control over the documents that come into the system, therefore eliminating the CDATA option.
Plain-text rather than Xml is probably distasteful as well, but it's all you have left. If you need to do validation or other processing you could first load and verify the xml, and then concatenate your files using simple file streams as a separate step. Again: not ideal, but it's all that's left.
is a linefeed - i.e. whitespace. The XML parser will load it in as a linefeed, and thereafter ignore the fact that it was originally encoded. The encoding is just part of the serialization of the data to text format - it's not part of the data itself.
Now, XML sometimes ignores whitespace and sometimes doesn't, depending on context, API etc. As Joel says you may find that it's not missing at all - or you may find that using it with an API which allows you to preserve whitespace fixes the problem. I wouldn't be at all surprised to see it turned into an unencoded linefeed character when you output the data though.
maybe it would be better to keep data in ![CDATA] ?
http://www.w3schools.com/XML/xml_cdata.asp

Categories