I've got to get a quick and dirty configuration editor up and running. The flow goes something like this:
configuration (POCOs on server) are serialized to XML.
The XML is well formed at this point. The configuration is sent to the web server in XElements.
On the web server, the XML (Yes, ALL OF IT) is dumped into a textarea for editing.
The user edits the XML directly in the webpage and clicks Submit.
In the response, I retrieve the altered text of the XML configuration. At this point, ALL escapes have been reverted by the process of displaying them in a webpage.
I attempt to load the string into an XML object (XmlElement, XElement, whatever). KABOOM.
The problem is that serialization escapes attribute strings, but this is lost in translation along the way.
For example, let's say I have an object that has a regex. Here's the configuration as it comes to the web server:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So, I put this into a textarea, where it looks like this to the user:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So the user makes a slight modification and submits the changes back. On the web server, the response string looks like:
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
So, the user added another validator thingie, and now BOTH have attributes with illegal characters. If I try to load this into any XML object, it throws an exception because < and & are not valid within a text string. I CANNOT CANNOT CANNOT CANNOT use any kind of encoding function, as it encodes the entire bloody thing:
var result = Server.HttpEncode(editedConfig);
results in
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
This is NOT valid XML. If I try to load this into an XML element of any kind I will be hit by a falling anvil. I don't like falling anvils.
SO, the question remains... Is the ONLY way I can get this string XML ready for parsing into an XML object is by using regex replaces? Is there any way to "turn off constraints" when I load? How do you get around this???
One last response and then wiki-izing this, as I don't think there is a valid answer.
The XML I place in the textarea IS valid, escaped XML. The process of 1) putting it in the text area 2) sending it to the client 3) displaying it to the client 4) submitting the form it's in 5) sending it back to the server and 6) retrieving the value from the form REMOVES ANY AND ALL ESCAPES.
Let me say this again: I'M not un-escaping ANYTHING. Just displaying it in the browser does this!
Things to mull over: Is there a way to prevent this un-escaping from happening in the first place? Is there a way to take almost-valid XML and "clean" it in a safe manner?
This question now has a bounty on it. To collect the bounty, you demonstrate how to edit VALID XML in a browser window WITHOUT a 3rd party/open source tool that doesn't require me to use regex to escape attribute values manually, that doesn't require users to escape their attributes, and that doesn't fail when roundtripping (&amp;amp;etc;)
Erm … How do you serialize? Usually, the XML serializer should never produce invalid XML.
/EDIT in response to your update: Do not display invalid XML to your user to edit! Instead, display the properly escaped XML in the TextBox. Repairing broken XML isn't fun and I actually see no reason not to display/edit the XML in a valid, escaped form.
Again I could ask: how do you display the XML in the TextBox? You seem to intentionally unescape the XML at some point.
/EDIT in response to your latest comment: Well yes, obviously, since the it can contain HTML. You need to escape your XML properly before writing it out into an HTML page. With that, I mean the whole XML. So this:
<foo mean-attribute="<">
becomes this:
<foo mean-attribute="&<">
Of course when you put entity references inside a textarea they come out unescaped. Textareas aren't magic, you have to &escape; everything you put in them just like every other element. Browsers might display a raw '<' in a textarea, but only because they're trying to clean up your mistakes.
So if you're putting editable XML in a textarea, you need to escape the attribute value once to make it valid XML, and then you have to escape the whole XML again to make it valid HTML. The final source you want to appear in the page would be:
<textarea name="somexml">
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
</textarea>
Question is based on a misunderstanding of the content model of the textarea element - a validator would have picked up the problem right away.
ETA re comment: Well, what problem remains? That's the issue on the serialisation side. All that remains is parsing it back in, and for that you have to assume the user can create well-formed XML.
Trying to parse non-well-formed XML, in order to allow errors like having '<' or '&' unescaped in an attribute value is a loss, totally against how XML is supposed to work. If you can't trust your users to write well-formed XML, give them an easier non-XML interface, such as a simple newline-separated list of regexp strings.
As you say, the normal serializer should escape everything for you.
The problem, then, is the text block: you need to handle anything passed through the textblock yourself.
You might try HttpUtility.HtmlEncode(), but I think the simplest method is to just encase anything you pass through the text block in a CDATA section.
Normally of course I would want everything properly escaped rather than relying on the CDATA "crutch", but I would also want to use the built-in tools to do the escaping. For something that is edited in it's "hibernated" state by a user, I think CDATA might be the way to go.
Also see this earlier question:
Best way to encode text data for XML
Update
Based on a comment to another response, I've realized you're showing the users the markup, not just the contents. Xml parsers are, well, picky. I think the best thing you could do in this case is to check for well-formedness before accepting the edited xml.
Perhaps try to automatically correct certain kinds of errors (like bad ampersands from my linked question), but then get the line number and column number of the first validation error from the .Net xml parser and use that to show users where their mistake is until they give you something acceptable. Bonus points if you also validate against a schema.
You could take a look at something like TinyMCE, which allows you to edit html in a rich text box. If you can't configure it to do exactly what you want, you could use it as inspiration.
Note: firefox (in my test) does not unescape in text areas as you describe. Specifically, this code:
<textarea cols="80" rows="10" id="1"></textarea>
<script>
elem = document.getElementById("1");
elem.value = '\
<Configuration>\n\
<Validator Expression="[^<]" />\n\
</Configuration>\
'
alert(elem.value);
</script>
Is alerted and displayed to the user unchanged, as:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So maybe one (un-viable?) solution is for your users to use firefox.
It seems two parts to your question have been revealed:
1 XML that you display is getting unescaped.
For example, "<" is unescaped as "<". But since "<" is also unescaped as "<", information is lost and you can't get it back.
One solution is for you to escape all the "&" characters, so that "<" becomes "<". This will then be unescaped by the textarea as "<". When you read it back, it will be as it was in the first place. (I'm assuming that the textarea actually changes the string, but firefox isn't behaving as you report, so I can't check this)
Another solution (mentioned already I think) is to build/buy/borrow a custom text area (not bad if simple, but there's all the editing keys, ctrl-C, ctrl-shift-left and so on).
2 You would like users to not have to bother escaping.
You're in escape-hell:
A regex replace will mostly work... but how can you reliably detect the end quote ("), when the user might (legitimately, within the terms you've given) enter :
<Configuration>
<Validator Expression="[^"<]" />
</Configuration>
Looking at it from the point of view of the regex syntax, it also can't tell whether the final " is part of the regex, or the end of it. Regex syntax usually solves this problem with an explicit terminator eg:
/[^"<]/
If users used this syntax (with the terminator), and you wrote a parser for it, then you could determine when the regex has ended, and therefore that the next " character is not part of the regex, but part of the XML, and therefore which parts need to be escaped. I'm not saying you should this! I'm saying it's theoretically possible. It's pretty far from quick and dirty.
BTW: The same problem arises for text within an element. The following is legitimate, within the terms you've given, but has the same parsing problems:
<Configuration>
<Expression></Expression></Expression>
</Configuration>
The basic rule in a syntax that allows "any text" is that the delimiter must be escaped, (e.g. " or <), so that the end can be recognized. Most syntax also escapes a bunch of other stuff, for convenience/inconvenience. (EDIT it will need to have an escape for the escape character itself: for XML, it is "&", which when literal is escaped as "&" For regex, it is the C/unix-style "\", which when literal is escaped as "\\").
Nest syntaxes, and you're in escape-hell.
One simple solution for you is to tell your users: this is a quick and dirty configuration editor, so you're not getting any fancy "no need to escape" mamby-pamby:
List the characters and escapes next
to the text area, eg: "<" as
"<".
For XML that won't
validate, show them the list again.
Looking back, I see bobince gave the same basic answer before me.
Inserting CDATA around all text would give you another escape mechanism that would (1) save users from manually escaping, and (2) enable the text that was automatically unescaped by the textarea to be read back correctly.
<Configuration>
<Validator Expression="<![CDATA[ [^<] ]]>" />
</Configuration>
:-)
This special character - "<" - should have replaced with other characters so that your XML will be valid. Check this link for XML special characters:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Try also to encode your TextBlock content before sending it to the deserializer:
HttpServerUtility utility = new HttpServerUtility();
string encodedText = utility.HtmlEncode(text);
Is this really my only option? Isn't this a common enough problem that it has a solution somewhere in the framework?
private string EscapeAttributes(string configuration)
{
var lt = #"(?<=\w+\s*=\s*""[^""]*)<(?=[^""]*"")";
configuration = Regex.Replace(configuration, lt, "<");
return configuration;
}
(edit: deleted ampersand replacement as it causes problems roundtripping)
Related
One of my element in an xml has a value like
<item name="abc_def>" />
The actual value pulled from the data source is "abc_def!!>". I have no control over this data source and this cannot be changed.
I wanted to know how do I escape these characters when xml serialization is taking place. I have tried a couple of things, but they didnt work.
I tried all methods explained here
What is the correct way to escape these characters ? The end output is an api which our clients hit using their browsers and because of this issue, the xml parsing in browser is breaking.
If all of your strings look like that one, you can do something like this:
string input = "abc_def>";
input = input.Replace("", "!!");
string output = HttpUtility.HtmlDecode(input);
You need to use:
System.Net.WebUtility.HtmlEncode(stringToEncode);
Of course when you later decode that you use:
System.Net.WebUtility.HtmlDecode(stringToDecode);
This is for UWP, namespace may vary depending on what framework you use.
I have looked at most of the parsing of XML into SQL with special Chars and could not find anything relevant that didnt include having control over the XML output itself.
I understand that the way to do this would be make sure all special characters are escaped, the issue i have is that i do not have control over the XML that gets generated until after the fact. The output i could have could be something like the below. I need to find a way to replace all the special characters within the without touching the characters that are valid for the xml. This could be done using a CLR or in Straight up SQL, i will even consider other options.
<?xml version="1.0" ?>
<A>
<B>this is my test <myemail#gmail.com</B>
<B>>>>this is another test<<<</B>
</A>
You are probably looking for something similar to HtmlEncode() of the contents. Loop through your XML structure and encode the fields you need to prior to writing to the DB, and perform the HtmlDecode() on the read from the DB.
https://msdn.microsoft.com/en-us/library/w3te6wfz%28v=vs.110%29.aspx
IF you are sure the XML element names are valid then the solution could be using regular expressions to parse the XML as text and substitute the & with & and the > with > and < with <.
Have a look here regular expression to find special character & between xml tags for example.
I am saving xml from .NET's XElement. I've been using the method ToString, but the formatting doesn't look how I'd like (examples below). I'd like at most two tags per line. How can I achieve that?
Saving XElement.Parse("<a><b><c>one</c><c>two</c></b><b>three<c>four</c><c>five</c></b></a>").ToString() gives me
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three<c>four</c><c>five</c></b>
</a>
But for readability I would rather 'three', 'four' and 'five' were on separate lines:
<a>
<b>
<c>one</c>
<c>two</c>
</b>
<b>three
<c>four</c>
<c>five</c>
</b>
</a>
Edit: Yes I understand this is syntactically different and "not in the spirit of xml", but I'm being pragmatic. Recently I've seen megabyte-size xml files with as few as 3 lines—these are challenging to text editors, source control, and diff tools. Something needs to be done! I've tested that changing the formatting above is compatible with our application.
If you want exactly that output, you'll need to do it manually, adding whitespace around nodes as necessary.
Almost all whitespace in XML documents is significant, even if we only think of it as indenting. When we ask the serializer to indent the document for us, it is making changes to the content that can get extracted, so they try to be as conservative as possible. The elements
<tag>foo</tag>
and
<tag>
foo
</tag>
have different content, and if an serializer changed the former into the latter, it would change what you get back from your XML API when asking for the contents of <tag>.
The usual rule of thumb is that no indenting will be applied if there's any existing non-whitespace between the elements. In this case, your three between the tags would be modified if a serializer applied the indenting you desire, so nothing will do it for you automatically.
If you have control over the XML format, it's inadvisable to mix element and text children like this, where <b> has both text (three) and element (<c>) children, as it causes issues like what you're seeing.
The formatting isn't working the way you want because of the naked "three". Is there a reason it's not in it's own tag? Should it be an attribute of "b" instead?
Explained reasons to colleagues - we're going to change the file format. I recommend you try to do the same. It's nigh impossible to do what I wanted, because most xml tools assume whitespace is significant.
XML is an information exchange format, intended for computers. The whitespace is irrelevant (depending on location and schema, really) and as such, it would be arbitrary to use one or the other.
You could use XmlTextWriter with XElement.Save and see whether you can tweak it to your liking with the XmlWriter.Settings Property
I've had to do something similar before (for a client request). All I ended up doing was writing a custom .ToString() method only used for either displaying the XML in a browser(ugh, i know) or for their use in downloading an xml file of the content. Because the code did not have to be computationally efficient, it was merely a matter of checking the children of each tag and arranging the 'hanging' text as such.
Eventually we were able to convince the user that the text should be an attribute instead.
While loading XML file in a C# application, I am getting
Name cannot begin with the '1' character, hexadecimal value 0x31.
Line 2, position 2.
The XML tag begins like this.
<version="1.0" encoding="us-ascii" standalone="yes" />
<1212041205115912>
I am not supposed to change this tag at any cost.
How can I resolve this?
You are supposed to change the tag name since the one you wrote violates the xml standard.
Just to remember the interesting portion of it here:
XML Naming Rules
XML elements MUST follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces
Any name can be used, no words are reserved.
as a suggestion to solve your problem mantaining the standard:
Use an attribute, ie <Number value="1212041205115912"/>
Add a prefix to the tag ie <_1212041205115912/>
Of course you can mantain the structure you propose by writing your own format parser, but I can state it would be a really bad idea, because in the future someone would probably extend the format and would not be happy to see that the file that seems xml it is actually not, and he/she can get angry for that. Furthermore, if you want your custom format, use something simpler, I mean: messing a text file with some '<' and '>' does not add any value if it is not an officially recognized format, it is better to use someting like a simple plain text file instead.
IF you absolutely cant change it, eg. for some reason the format is already out in the wild and used by other systems/customers/whatever.
Since it is an invalid xml document, try to clean it up before parsing it.
eg. make a regex that replaces all < number> tags with < IMessedUp>number< /IMessedUp> and then parse it.
Sort of iffy way to do it, but I will solve your problem.
If you need to process this document, then stop thinking of it as XML, and cast aside any thoughts of using XML tools to process it. You're dealing with a proprietary format and you will need to write your own tools to handle it. If you want the benefits of using XML technology, you will have to redesign your documents so they are valid XML.
UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"