Currently my understanding of XML legal strings is that all is required is that you convert any instances of: &, ", ', <, > with & " ' < >
So I made the following parser:
private static string ToXmlCompliantStr(string uriStr)
{
string uriXml = uriStr;
uriXml = uriXml.Replace("&", "&");
uriXml = uriXml.Replace("\"", """);
uriXml = uriXml.Replace("'", "'");
uriXml = uriXml.Replace("<", "<");
uriXml = uriXml.Replace(">", ">");
return uriXml;
}
I am aware that there are similar questions out there with good answers (which is how I was able to write this function) I am writing this question to ask if this code will translate ANY string that C# can throw at it and have XDocument parse it as a part of a whole document without any complaints as all the questions out there that I've found state that these are the only escape characters, not that parsing them will cause 100% valid XML string. I've gone as far as reading through the decompiled XNode class trying to see how that parse it.
Thanks
Firstly, you should absolutely not do this yourself. Use an XML API - that way you can trust that to do the right thing, rather than worrying about covering corner cases etc. You generally shouldn't be trying to come up with an "escaped string" at all - you should pass the string to the XElement constructor (or XAttribute, or whatever your situation is).
In other words, I think you should try really hard to design your application so that you don't need a method of the kind you've shown in your question at all. Look at where you'd be using that method, and see whether you can just create an XElement (or whatever) instead. If you try to treat XML as a data structure in itself rather than just as text, you'll have a much better experience in my experience.
Secondly, you need to understand that in XML 1.0 at least, there are Unicode characters that cannot be validly represented in XML, no matter how much escaping you use. In particular, values U+0000 to U+001F are unrepresentable other than U+0009 (tab), U+000A (line feed) and U+000D (carriage return). Also if you have a string which contains invalid UTF-16 (e.g. an unmatched half of a surrogate pair), that can't be correctly represented in XML.
Related
XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.
Here is my prob, I wanted String.Format() function should take 4 objects and format string. But it throws "Input string not in a correct format error".
Here is my code,
string jsonData = string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);
\"{2}\",}\"
Looks like you need to escape that closing brace by doubling it:
string.Format("{{\"sectionTitle\":\"{0}\",\"strPushMsg\":\"{1}\",\"Language\":\"{2}\",}}\",\"articleid\":\"{3}\"}}", urlsectiontitle, formatHeadline, Language, articleid);
It appears you are creating JSON. This can use single quotes (which would avoid all the escaping), but even better use a tool like JSON.Net designed to create JSON. While your (partial) structure here is quite small (the unmatched } shows this is only partial), and the JSON gets bigger it is much easier to use a tool to get it right.
I have some VB.Net code which is parsing an XML string.
The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it.
The issue we have is that one of the elements data can sometimes contain special characters e.g. &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".
Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.
If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):-
(if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)
<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
<Sender>0000111111</Sender>
<Status>New</Status>
<Transport>Sms</Transport>
<Id>-1</Id>
<Message>O&N</Message>
<Timestamp>19/02/2013 14:00:50</Timestamp>
<ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>
If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:
Dim url = "put url here"
Dim s As String
Dim characterMappings = New Dictionary(Of String, String) From {
{"&", "&"},
{"<", "<"},
{">", ">"},
{"""", """}
}
Using client As New WebClient
s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
"(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)
$ should not be an issue with XML, but if it is you can add it to the dictionary.
Use of WebClient comes from here.
Updated
Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:
s = Regex.Replace(s,
"(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Also, I highly recommend Expresso for working with regular expressions.
Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.
As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.
how do i get my c# code to recognize "ö"?
The output of the query is nice and formatted all special characters are visible, but in codebehind, i cannot use them for sorting.
example:
if (link.Contains("teborg"))
{
CountRss++;
Response.Write("<p class='RssCont'><a href='" + link + "' target='new'><b>" + title + "</b></a><br/>");
Response.Write(description + "</p>");
}
will give several results with "Göteborg" in title but:
if (link.Contains("Göteborg"))
{
CountRss++;
Response.Write("<p class='RssCont'><a href='" + link + "' target='new'><b>" + title + "</b></a><br/>");
Response.Write(description + "</p>");
}
will give no results at all.
Your code is perfectly sensible and good as code, the issue is with data. There are four general possibilities here.
The first is encoding issues, but I doubt this is the case as you say that it's rendering okay, so I'd highly doubt that's the issue or you'd have problems there too.
The second is a conflict between composed ö and ö formed from a o followed with combining-diaresis. This is unlikely, but putting the string into NFC with link.Normalize() will catch that.
The third is that since it's a URI it might be in URI rather than IURI form. So it'll be G%c3%b6teborg (indeed, it could be G%C3%b6teborg, G%c3%B6teborg or G%C3%B6teborg). Unescape the string with Uri.UnescapeDataString(link) or any of the various methods for this. This is the one I'd bet on.
The fourth is that it could be XML escaped (since it's from RSS to judge from the names used), in which case HtmlDecode should sort that out as barring a DTD defining other entities, the encoding for HTML is a superset of that for XML. However, this is only possible if you are parsing the RSS with text-based rather than XML-based methods in which case you've got bigger problems. If you're using XmlReader or XmlDocument or any other XML based class, this decoding will have been done for you already if necessary, so that's not the issue.
So the third is by far seeming the most likely, and Uri.UnescapeDataString(link) seems the most promising.
You might want a less precise check that case-sensitive exact char for char. Other methods will let you match göteborg and GÖTEBORG too. There are also some that would e.g. match goeteborg (it's common to transliterate ö to oe in English - this is more often done with German than Swedish but it might still be done). (Matching e.g. the English Gothenburg or the Danish Gøteborg is a much more involved matter).
If your code renders link correctly it should be encoded and as result will not contain non-ASCII characters.
Depending on position of the word in the url you may need to search for different text to find match.
Note that using proper Uri class to deal with url will make life easier. Also make sure you have correctly encoded link to avoid script injection attacks on your page.
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.