XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.
I have looked at most of the parsing of XML into SQL with special Chars and could not find anything relevant that didnt include having control over the XML output itself.
I understand that the way to do this would be make sure all special characters are escaped, the issue i have is that i do not have control over the XML that gets generated until after the fact. The output i could have could be something like the below. I need to find a way to replace all the special characters within the without touching the characters that are valid for the xml. This could be done using a CLR or in Straight up SQL, i will even consider other options.
<?xml version="1.0" ?>
<A>
<B>this is my test <myemail#gmail.com</B>
<B>>>>this is another test<<<</B>
</A>
You are probably looking for something similar to HtmlEncode() of the contents. Loop through your XML structure and encode the fields you need to prior to writing to the DB, and perform the HtmlDecode() on the read from the DB.
https://msdn.microsoft.com/en-us/library/w3te6wfz%28v=vs.110%29.aspx
IF you are sure the XML element names are valid then the solution could be using regular expressions to parse the XML as text and substitute the & with & and the > with > and < with <.
Have a look here regular expression to find special character & between xml tags for example.
While loading XML file in a C# application, I am getting
Name cannot begin with the '1' character, hexadecimal value 0x31.
Line 2, position 2.
The XML tag begins like this.
<version="1.0" encoding="us-ascii" standalone="yes" />
<1212041205115912>
I am not supposed to change this tag at any cost.
How can I resolve this?
You are supposed to change the tag name since the one you wrote violates the xml standard.
Just to remember the interesting portion of it here:
XML Naming Rules
XML elements MUST follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces
Any name can be used, no words are reserved.
as a suggestion to solve your problem mantaining the standard:
Use an attribute, ie <Number value="1212041205115912"/>
Add a prefix to the tag ie <_1212041205115912/>
Of course you can mantain the structure you propose by writing your own format parser, but I can state it would be a really bad idea, because in the future someone would probably extend the format and would not be happy to see that the file that seems xml it is actually not, and he/she can get angry for that. Furthermore, if you want your custom format, use something simpler, I mean: messing a text file with some '<' and '>' does not add any value if it is not an officially recognized format, it is better to use someting like a simple plain text file instead.
IF you absolutely cant change it, eg. for some reason the format is already out in the wild and used by other systems/customers/whatever.
Since it is an invalid xml document, try to clean it up before parsing it.
eg. make a regex that replaces all < number> tags with < IMessedUp>number< /IMessedUp> and then parse it.
Sort of iffy way to do it, but I will solve your problem.
If you need to process this document, then stop thinking of it as XML, and cast aside any thoughts of using XML tools to process it. You're dealing with a proprietary format and you will need to write your own tools to handle it. If you want the benefits of using XML technology, you will have to redesign your documents so they are valid XML.
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.
I've got to get a quick and dirty configuration editor up and running. The flow goes something like this:
configuration (POCOs on server) are serialized to XML.
The XML is well formed at this point. The configuration is sent to the web server in XElements.
On the web server, the XML (Yes, ALL OF IT) is dumped into a textarea for editing.
The user edits the XML directly in the webpage and clicks Submit.
In the response, I retrieve the altered text of the XML configuration. At this point, ALL escapes have been reverted by the process of displaying them in a webpage.
I attempt to load the string into an XML object (XmlElement, XElement, whatever). KABOOM.
The problem is that serialization escapes attribute strings, but this is lost in translation along the way.
For example, let's say I have an object that has a regex. Here's the configuration as it comes to the web server:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So, I put this into a textarea, where it looks like this to the user:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So the user makes a slight modification and submits the changes back. On the web server, the response string looks like:
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
So, the user added another validator thingie, and now BOTH have attributes with illegal characters. If I try to load this into any XML object, it throws an exception because < and & are not valid within a text string. I CANNOT CANNOT CANNOT CANNOT use any kind of encoding function, as it encodes the entire bloody thing:
var result = Server.HttpEncode(editedConfig);
results in
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
This is NOT valid XML. If I try to load this into an XML element of any kind I will be hit by a falling anvil. I don't like falling anvils.
SO, the question remains... Is the ONLY way I can get this string XML ready for parsing into an XML object is by using regex replaces? Is there any way to "turn off constraints" when I load? How do you get around this???
One last response and then wiki-izing this, as I don't think there is a valid answer.
The XML I place in the textarea IS valid, escaped XML. The process of 1) putting it in the text area 2) sending it to the client 3) displaying it to the client 4) submitting the form it's in 5) sending it back to the server and 6) retrieving the value from the form REMOVES ANY AND ALL ESCAPES.
Let me say this again: I'M not un-escaping ANYTHING. Just displaying it in the browser does this!
Things to mull over: Is there a way to prevent this un-escaping from happening in the first place? Is there a way to take almost-valid XML and "clean" it in a safe manner?
This question now has a bounty on it. To collect the bounty, you demonstrate how to edit VALID XML in a browser window WITHOUT a 3rd party/open source tool that doesn't require me to use regex to escape attribute values manually, that doesn't require users to escape their attributes, and that doesn't fail when roundtripping (&amp;amp;etc;)
Erm … How do you serialize? Usually, the XML serializer should never produce invalid XML.
/EDIT in response to your update: Do not display invalid XML to your user to edit! Instead, display the properly escaped XML in the TextBox. Repairing broken XML isn't fun and I actually see no reason not to display/edit the XML in a valid, escaped form.
Again I could ask: how do you display the XML in the TextBox? You seem to intentionally unescape the XML at some point.
/EDIT in response to your latest comment: Well yes, obviously, since the it can contain HTML. You need to escape your XML properly before writing it out into an HTML page. With that, I mean the whole XML. So this:
<foo mean-attribute="<">
becomes this:
<foo mean-attribute="&<">
Of course when you put entity references inside a textarea they come out unescaped. Textareas aren't magic, you have to &escape; everything you put in them just like every other element. Browsers might display a raw '<' in a textarea, but only because they're trying to clean up your mistakes.
So if you're putting editable XML in a textarea, you need to escape the attribute value once to make it valid XML, and then you have to escape the whole XML again to make it valid HTML. The final source you want to appear in the page would be:
<textarea name="somexml">
<Configuration>
<Validator Expression="[^<]" />
<Validator Expression="[^&]" />
</Configuration>
</textarea>
Question is based on a misunderstanding of the content model of the textarea element - a validator would have picked up the problem right away.
ETA re comment: Well, what problem remains? That's the issue on the serialisation side. All that remains is parsing it back in, and for that you have to assume the user can create well-formed XML.
Trying to parse non-well-formed XML, in order to allow errors like having '<' or '&' unescaped in an attribute value is a loss, totally against how XML is supposed to work. If you can't trust your users to write well-formed XML, give them an easier non-XML interface, such as a simple newline-separated list of regexp strings.
As you say, the normal serializer should escape everything for you.
The problem, then, is the text block: you need to handle anything passed through the textblock yourself.
You might try HttpUtility.HtmlEncode(), but I think the simplest method is to just encase anything you pass through the text block in a CDATA section.
Normally of course I would want everything properly escaped rather than relying on the CDATA "crutch", but I would also want to use the built-in tools to do the escaping. For something that is edited in it's "hibernated" state by a user, I think CDATA might be the way to go.
Also see this earlier question:
Best way to encode text data for XML
Update
Based on a comment to another response, I've realized you're showing the users the markup, not just the contents. Xml parsers are, well, picky. I think the best thing you could do in this case is to check for well-formedness before accepting the edited xml.
Perhaps try to automatically correct certain kinds of errors (like bad ampersands from my linked question), but then get the line number and column number of the first validation error from the .Net xml parser and use that to show users where their mistake is until they give you something acceptable. Bonus points if you also validate against a schema.
You could take a look at something like TinyMCE, which allows you to edit html in a rich text box. If you can't configure it to do exactly what you want, you could use it as inspiration.
Note: firefox (in my test) does not unescape in text areas as you describe. Specifically, this code:
<textarea cols="80" rows="10" id="1"></textarea>
<script>
elem = document.getElementById("1");
elem.value = '\
<Configuration>\n\
<Validator Expression="[^<]" />\n\
</Configuration>\
'
alert(elem.value);
</script>
Is alerted and displayed to the user unchanged, as:
<Configuration>
<Validator Expression="[^<]" />
</Configuration>
So maybe one (un-viable?) solution is for your users to use firefox.
It seems two parts to your question have been revealed:
1 XML that you display is getting unescaped.
For example, "<" is unescaped as "<". But since "<" is also unescaped as "<", information is lost and you can't get it back.
One solution is for you to escape all the "&" characters, so that "<" becomes "<". This will then be unescaped by the textarea as "<". When you read it back, it will be as it was in the first place. (I'm assuming that the textarea actually changes the string, but firefox isn't behaving as you report, so I can't check this)
Another solution (mentioned already I think) is to build/buy/borrow a custom text area (not bad if simple, but there's all the editing keys, ctrl-C, ctrl-shift-left and so on).
2 You would like users to not have to bother escaping.
You're in escape-hell:
A regex replace will mostly work... but how can you reliably detect the end quote ("), when the user might (legitimately, within the terms you've given) enter :
<Configuration>
<Validator Expression="[^"<]" />
</Configuration>
Looking at it from the point of view of the regex syntax, it also can't tell whether the final " is part of the regex, or the end of it. Regex syntax usually solves this problem with an explicit terminator eg:
/[^"<]/
If users used this syntax (with the terminator), and you wrote a parser for it, then you could determine when the regex has ended, and therefore that the next " character is not part of the regex, but part of the XML, and therefore which parts need to be escaped. I'm not saying you should this! I'm saying it's theoretically possible. It's pretty far from quick and dirty.
BTW: The same problem arises for text within an element. The following is legitimate, within the terms you've given, but has the same parsing problems:
<Configuration>
<Expression></Expression></Expression>
</Configuration>
The basic rule in a syntax that allows "any text" is that the delimiter must be escaped, (e.g. " or <), so that the end can be recognized. Most syntax also escapes a bunch of other stuff, for convenience/inconvenience. (EDIT it will need to have an escape for the escape character itself: for XML, it is "&", which when literal is escaped as "&" For regex, it is the C/unix-style "\", which when literal is escaped as "\\").
Nest syntaxes, and you're in escape-hell.
One simple solution for you is to tell your users: this is a quick and dirty configuration editor, so you're not getting any fancy "no need to escape" mamby-pamby:
List the characters and escapes next
to the text area, eg: "<" as
"<".
For XML that won't
validate, show them the list again.
Looking back, I see bobince gave the same basic answer before me.
Inserting CDATA around all text would give you another escape mechanism that would (1) save users from manually escaping, and (2) enable the text that was automatically unescaped by the textarea to be read back correctly.
<Configuration>
<Validator Expression="<![CDATA[ [^<] ]]>" />
</Configuration>
:-)
This special character - "<" - should have replaced with other characters so that your XML will be valid. Check this link for XML special characters:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Try also to encode your TextBlock content before sending it to the deserializer:
HttpServerUtility utility = new HttpServerUtility();
string encodedText = utility.HtmlEncode(text);
Is this really my only option? Isn't this a common enough problem that it has a solution somewhere in the framework?
private string EscapeAttributes(string configuration)
{
var lt = #"(?<=\w+\s*=\s*""[^""]*)<(?=[^""]*"")";
configuration = Regex.Replace(configuration, lt, "<");
return configuration;
}
(edit: deleted ampersand replacement as it causes problems roundtripping)