I'm using the new System.Xml.Linq to create HTML documents (Yes, I know about HtmlDocument, but much prefer the XDocument/XElement classes). I'm having a problem inserting (or any other HTML entity). What I've tried already:
Just putting text in directly doesn't work because the & gets turned int &.
new XElement("h1", "Text to keep together.");
I tried parsing in the raw XML using the following, but it barfs with this error:
XElement.Parse("Text to keep together.");
--> Reference to undeclared entity 'nbsp'.`
Try number three looks like the following. If I save to a file, there is just a space, the gets lost.
var X = new XDocument(new XElement("Name", KeepTogether("Hi Mom!")));
private static XNode KeepTogether(string p)`
{
return XElement.Parse("<xml>" + p.Replace(" ", " ") + "</xml>").FirstNode;
}
I couldn't find a way to just shove the raw text through without it getting escaped. Am I missing something obvious?
I couldn't find a way to just shove the raw text through without it getting escaped.
Just put the Unicode character in that refers to (U+00A0 NO-BREAK SPACE) directly in a text node, then let the serializer worry about whether it needs escaping to or not. (Probably not: if you are using UTF-8 or ISO-8859-1 as your page encoding the character can be included directly without having to worry about encoding it into an entity reference or character reference).
new XElement("h1", "Text\u00A0to\u00A0keep\u00A0together");
Replacing the & with a marker like ##AMP## solved my problem. Maybe not the prettiest solution but I got a demo for a customer in 10 mins so...I don't care :)
Thx for the idea
I know this is old, but I found something and I'm rather surprised!
XElement element = new XElement("name",new XCData("<br /> & etc"));
And there you go! CDATA text!
You could also try using numbered entities, they need no declaration.
Numbered entity equivalent to the named entity is
Unlike amp (&), lt (<) etc, nbsp is not known entity to XML, so you need to declare it.
In XML, e.g. &xyz; is treated as an entity, The parser will reference its value to produce the output.
// the xml, plz remove '.' within xml
string xml = "<xml>test&.n.b.s.p;test</xml>";
// declare nbsp as xml entity and its value is " " in this case.
string declareEntity = "<!DOCTYPE xml [<!ENTITY nbsp \" \">]>";
XElement x = XElement.Parse( declareEntity + xml );
// output with a space between tests
// <xml>test test</xml>
or
// plz remove '.' in the string
XElement.Parse("<xml>" + HttpUtility.HtmlEncode("Text&.n.b.s.p;keep everything") + "</xml>");
You can paste the character as you wish to see it if you copy it somewhere else. Viusal studio allows that.
Though this is hard to do if you need , it is easy if you need any symbols, for example:
&bull ...just paste •
↔ ...just paste ↔
I came up with this slightly daft approach which suits me:
String replace all the & with ##AMP## when you store the data....
And reverse that operation on output.
I am using this in conjunction with XElement SQL column and works a treat.
Regards
Neil
Related
We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.
Using MS Visual Studio 2013 to create a C# application, I am trying to get the following output in an XML document.
<UnitsOfMeasure>
&uom-data;
</UnitsOfMeasure>
I keep getting
<UnitsOfMeasure>
&uom-data;
</UnitsOfMeasure>
Here is the code I have tried
XElement uom = new XElement("UnitsOfMeasure");
uom.Add("\n" + tab2, new XText("&uom-data;"), "\n" + tab1);
sd.Add("\n" + tab1, uom);
sd.Add("\n");
XElement uom = new XElement("UnitsOfMeasure");
uom.Add("\n" + tab2, new XText((char)38 + "uom-data;"), "\n" + tab1);
sd.Add("\n" + tab1, uom);
sd.Add("\n");
Thanks
The problem is that & has a special meaning in XML - it's used to escape other things; see beware of the ampersand when using xml, for example. What's being written for you is the correct way to include an ampersand inside XML and when an XML parser reads it back in, it should convert the & back to &.
So perhaps, if anything, you may have a problem with whatever code is reading that XML back in again as it should be converting it back for you.
XML has things called "entities", which take the form ampersand-characters-semicolon.
The XML entity is a alias for a different block of text (although in most cases, entities are just used just to insert a single character -- generally characters not on the keyboard)
& is the most commonly used -- it's to insert an &. © is for the copyright symbol.
In addition to the standard ones, you are allowed to define your own.
The fact that what you are trying to enter -- &uom-data; -- so neatly follows the entity format, I suspect that it really IS an entity and you are just missing the part where it's defined.
I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.
The issue is actually fairly easy to reproduce:
Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
xmlDoc.Save(#"d:\TabTest.xml");
NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.
Load the document using new XmlTextReader:
var rawFile = XmlReader.Create(#"D:\TabTest.xml");
var rawDoc = new XmlDocument();
rawDoc.Load(rawFile);
Load the document using XmlReader.Create:
var rawFile2 = new XmlTextReader(#"D:\TabTest.xml");
var rawDoc2 = new XmlDocument();
rawDoc2.Load(rawFile2);
Compare the documents in the debugger:
(rawDoc).InnerXml "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>" string
(rawDoc2).InnerXml "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>" string
The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected.
However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.
What the....!! :-)
After a bit of a Google search I found that I could encode a '\t' as ' ' - if I used this instead of '\t' in the example XML both readers work as expected.
Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?
Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?
Which way should I go to fix/work around this issue?
Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.
#all: Thanks for all your answers and comments.
It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.
In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.
The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.
While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.
So the original file save code becomes:
var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
using (var xmlWriter = XmlWriter.Create(#"d:\TabTest.Encoded.xml"))
{
xmlDoc.Save(xmlWriter);
}
This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.
The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.
Check out XmlReaderSettings.ComformanceLevel. In particular, this description:
Note that XmlReader objects created by the Create method are more compliant by default than the XmlTextReader class. The following are conformance improvements that are not enabled on XmlTextReader, but are available by default on readers created by the Create method
At a glance it seems that XmlTextReader is not compliant with the W3C recommendation. See the section in the recommendation on attribute value normalization, specifically
For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
Hence the behaviour that you weren't expecting (seeing a space instead of a tab) is actually the correct recommended behaviour.
I have no idea why XmlTextReader is behaving this way (there is nothing in the documentation), however you seem to have already seem to have identified the correct workaround - encode the attribute as instead. In this case the normalised string will contain the tab character itself.
I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.
As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.
Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.
It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.
The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.