I have the following XML element defined in a document:
<modelDef name="EmployeeOutput" description="Current Model Run output.">
<resultSet name="Values">
<field name="Net P&L Impact" typeName="Decimal" formatString="C2"/>
</resultSet>
</modelDef>
Notice the name attribute on the field element. When I retrieve the value it comes back with the ampersand, where I want to keep the encoded value as defined in the xml. This is the simple code I use to retrieve the value of the attribute
field.Attribute("name").Value
How can I make sure that I get the & code back rather than the actual ampersand symbol?
Thanks!
Actually, "Net P&L Impact" is the value you've encoded in XML, because & is the XML code for an ampersand. If the actual encoded value should be "Net P&L Impact" then you need to change your XML:
<modelDef name="EmployeeOutput" description="Current Model Run output.">
<resultSet name="Values">
<field name="Net P&L Impact" typeName="Decimal" formatString="C2"/>
</resultSet>
</modelDef>
However, it's rare to encounter real-world entities with & in their names. Why do you want it to be encoded that way? Is it because you're outputting it to HTML? If so, use HttpUtility.HtmlEncode() on the resulting value as you go to output it in HTML. That's the correct and safe way. Don't expect your XML-parsing code to anticipate the context that the value will be output into.
Related
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);
One of my element in an xml has a value like
<item name="abc_def>" />
The actual value pulled from the data source is "abc_def!!>". I have no control over this data source and this cannot be changed.
I wanted to know how do I escape these characters when xml serialization is taking place. I have tried a couple of things, but they didnt work.
I tried all methods explained here
What is the correct way to escape these characters ? The end output is an api which our clients hit using their browsers and because of this issue, the xml parsing in browser is breaking.
If all of your strings look like that one, you can do something like this:
string input = "abc_def>";
input = input.Replace("", "!!");
string output = HttpUtility.HtmlDecode(input);
You need to use:
System.Net.WebUtility.HtmlEncode(stringToEncode);
Of course when you later decode that you use:
System.Net.WebUtility.HtmlDecode(stringToDecode);
This is for UWP, namespace may vary depending on what framework you use.
I am using XDocument to switch a value in an xml document.
In the new value I need to use the character '&' (ampersand)
but after XDocument.save() the xml has & instead!
I tried using encoding and stuff… nothing worked
XDocument is doing exactly what it's supposed to do.
& is invalid XML. (it's an unfinished character/entity reference)
& means "Start of an entity" in XML so if you want to include an & as data you must express it as an entity — & (or use it in a CDATA block).
What you describe is normal behaviour and the XML would break otherwise.
There are two options. Either to ensure proper XML encoding/decoding of all your content in the XML document. Remember that HTML and XML encoding/decoding is slightly different.
Option two is to use base64 encoding on whatever content in the xml that might contain invalid elements.
Is your output file app.config supposed to be an XML file?
If it is, then the & must be escaped as &.
If it isn't, then you should be using the text output method instead of the xml output method: use <xsl:output method='text'/>.
PS: this question appears to be a duplicate of How can I add an ampersand for a value in a ASP.net/C# app config file value
UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"
I've got to get a quick and dirty configuration editor up and running. The flow goes something like this:
configuration (POCOs on server) are serialized to XML.
The XML is well formed at this point. The configuration is sent to the web server in XElements.
On the web server, the XML (Yes, ALL OF IT) is dumped into a textarea for editing.
The user edits the XML directly in the webpage and clicks Submit.
In the response, I retrieve the altered text of the XML configuration. At this point, ALL escapes have been reverted by the process of displaying them in a webpage.
I attempt to load the string into an XML object (XmlElement, XElement, whatever). KABOOM.
The problem is that serialization escapes attribute strings, but this is lost in translation along the way.
For example, let's say I have an object that has a regex. Here's the configuration as it comes to the web server:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So, I put this into a textarea, where it looks like this to the user:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So the user makes a slight modification and submits the changes back. On the web server, the response string looks like:
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
So, the user added another validator thingie, and now BOTH have attributes with illegal characters. If I try to load this into any XML object, it throws an exception because < and & are not valid within a text string. I CANNOT CANNOT CANNOT CANNOT use any kind of encoding function, as it encodes the entire bloody thing:
var result = Server.HttpEncode(editedConfig);
results in
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
This is NOT valid XML. If I try to load this into an XML element of any kind I will be hit by a falling anvil. I don't like falling anvils.
SO, the question remains... Is the ONLY way I can get this string XML ready for parsing into an XML object is by using regex replaces? Is there any way to "turn off constraints" when I load? How do you get around this???
One last response and then wiki-izing this, as I don't think there is a valid answer.
The XML I place in the textarea IS valid, escaped XML. The process of 1) putting it in the text area 2) sending it to the client 3) displaying it to the client 4) submitting the form it's in 5) sending it back to the server and 6) retrieving the value from the form REMOVES ANY AND ALL ESCAPES.
Let me say this again: I'M not un-escaping ANYTHING. Just displaying it in the browser does this!
Things to mull over: Is there a way to prevent this un-escaping from happening in the first place? Is there a way to take almost-valid XML and "clean" it in a safe manner?
This question now has a bounty on it. To collect the bounty, you demonstrate how to edit VALID XML in a browser window WITHOUT a 3rd party/open source tool that doesn't require me to use regex to escape attribute values manually, that doesn't require users to escape their attributes, and that doesn't fail when roundtripping (&amp;amp;etc;)
Erm … How do you serialize? Usually, the XML serializer should never produce invalid XML.
/EDIT in response to your update: Do not display invalid XML to your user to edit! Instead, display the properly escaped XML in the TextBox. Repairing broken XML isn't fun and I actually see no reason not to display/edit the XML in a valid, escaped form.
Again I could ask: how do you display the XML in the TextBox? You seem to intentionally unescape the XML at some point.
/EDIT in response to your latest comment: Well yes, obviously, since the it can contain HTML. You need to escape your XML properly before writing it out into an HTML page. With that, I mean the whole XML. So this:
<foo mean-attribute="<">
becomes this:
<foo mean-attribute="&<">
Of course when you put entity references inside a textarea they come out unescaped. Textareas aren't magic, you have to &escape; everything you put in them just like every other element. Browsers might display a raw '<' in a textarea, but only because they're trying to clean up your mistakes.
So if you're putting editable XML in a textarea, you need to escape the attribute value once to make it valid XML, and then you have to escape the whole XML again to make it valid HTML. The final source you want to appear in the page would be:
<textarea name="somexml">
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
</textarea>
Question is based on a misunderstanding of the content model of the textarea element - a validator would have picked up the problem right away.
ETA re comment: Well, what problem remains? That's the issue on the serialisation side. All that remains is parsing it back in, and for that you have to assume the user can create well-formed XML.
Trying to parse non-well-formed XML, in order to allow errors like having '<' or '&' unescaped in an attribute value is a loss, totally against how XML is supposed to work. If you can't trust your users to write well-formed XML, give them an easier non-XML interface, such as a simple newline-separated list of regexp strings.
As you say, the normal serializer should escape everything for you.
The problem, then, is the text block: you need to handle anything passed through the textblock yourself.
You might try HttpUtility.HtmlEncode(), but I think the simplest method is to just encase anything you pass through the text block in a CDATA section.
Normally of course I would want everything properly escaped rather than relying on the CDATA "crutch", but I would also want to use the built-in tools to do the escaping. For something that is edited in it's "hibernated" state by a user, I think CDATA might be the way to go.
Also see this earlier question:
Best way to encode text data for XML
Update
Based on a comment to another response, I've realized you're showing the users the markup, not just the contents. Xml parsers are, well, picky. I think the best thing you could do in this case is to check for well-formedness before accepting the edited xml.
Perhaps try to automatically correct certain kinds of errors (like bad ampersands from my linked question), but then get the line number and column number of the first validation error from the .Net xml parser and use that to show users where their mistake is until they give you something acceptable. Bonus points if you also validate against a schema.
You could take a look at something like TinyMCE, which allows you to edit html in a rich text box. If you can't configure it to do exactly what you want, you could use it as inspiration.
Note: firefox (in my test) does not unescape in text areas as you describe. Specifically, this code:
<textarea cols="80" rows="10" id="1"></textarea>
<script>
elem = document.getElementById("1");
elem.value = '\
<Configuration>\n\
<Validator Expression="[^<]" />\n\
</Configuration>\
'
alert(elem.value);
</script>
Is alerted and displayed to the user unchanged, as:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So maybe one (un-viable?) solution is for your users to use firefox.
It seems two parts to your question have been revealed:
1 XML that you display is getting unescaped.
For example, "<" is unescaped as "<". But since "<" is also unescaped as "<", information is lost and you can't get it back.
One solution is for you to escape all the "&" characters, so that "<" becomes "<". This will then be unescaped by the textarea as "<". When you read it back, it will be as it was in the first place. (I'm assuming that the textarea actually changes the string, but firefox isn't behaving as you report, so I can't check this)
Another solution (mentioned already I think) is to build/buy/borrow a custom text area (not bad if simple, but there's all the editing keys, ctrl-C, ctrl-shift-left and so on).
2 You would like users to not have to bother escaping.
You're in escape-hell:
A regex replace will mostly work... but how can you reliably detect the end quote ("), when the user might (legitimately, within the terms you've given) enter :
<Configuration>
<Validator Expression="[^"<]" />
</Configuration>
Looking at it from the point of view of the regex syntax, it also can't tell whether the final " is part of the regex, or the end of it. Regex syntax usually solves this problem with an explicit terminator eg:
/[^"<]/
If users used this syntax (with the terminator), and you wrote a parser for it, then you could determine when the regex has ended, and therefore that the next " character is not part of the regex, but part of the XML, and therefore which parts need to be escaped. I'm not saying you should this! I'm saying it's theoretically possible. It's pretty far from quick and dirty.
BTW: The same problem arises for text within an element. The following is legitimate, within the terms you've given, but has the same parsing problems:
<Configuration>
<Expression></Expression></Expression>
</Configuration>
The basic rule in a syntax that allows "any text" is that the delimiter must be escaped, (e.g. " or <), so that the end can be recognized. Most syntax also escapes a bunch of other stuff, for convenience/inconvenience. (EDIT it will need to have an escape for the escape character itself: for XML, it is "&", which when literal is escaped as "&" For regex, it is the C/unix-style "\", which when literal is escaped as "\\").
Nest syntaxes, and you're in escape-hell.
One simple solution for you is to tell your users: this is a quick and dirty configuration editor, so you're not getting any fancy "no need to escape" mamby-pamby:
List the characters and escapes next
to the text area, eg: "<" as
"<".
For XML that won't
validate, show them the list again.
Looking back, I see bobince gave the same basic answer before me.
Inserting CDATA around all text would give you another escape mechanism that would (1) save users from manually escaping, and (2) enable the text that was automatically unescaped by the textarea to be read back correctly.
<Configuration>
<Validator Expression="<![CDATA[ [^<] ]]>" />
</Configuration>
:-)
This special character - "<" - should have replaced with other characters so that your XML will be valid. Check this link for XML special characters:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Try also to encode your TextBlock content before sending it to the deserializer:
HttpServerUtility utility = new HttpServerUtility();
string encodedText = utility.HtmlEncode(text);
Is this really my only option? Isn't this a common enough problem that it has a solution somewhere in the framework?
private string EscapeAttributes(string configuration)
{
var lt = #"(?<=\w+\s*=\s*""[^""]*)<(?=[^""]*"")";
configuration = Regex.Replace(configuration, lt, "<");
return configuration;
}
(edit: deleted ampersand replacement as it causes problems roundtripping)