I have some VB.Net code which is parsing an XML string.
The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it.
The issue we have is that one of the elements data can sometimes contain special characters e.g. &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".
Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.
If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):-
(if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)
<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
<Sender>0000111111</Sender>
<Status>New</Status>
<Transport>Sms</Transport>
<Id>-1</Id>
<Message>O&N</Message>
<Timestamp>19/02/2013 14:00:50</Timestamp>
<ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>
If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:
Dim url = "put url here"
Dim s As String
Dim characterMappings = New Dictionary(Of String, String) From {
{"&", "&"},
{"<", "<"},
{">", ">"},
{"""", """}
}
Using client As New WebClient
s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
"(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)
$ should not be an issue with XML, but if it is you can add it to the dictionary.
Use of WebClient comes from here.
Updated
Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:
s = Regex.Replace(s,
"(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Also, I highly recommend Expresso for working with regular expressions.
Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.
As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.
Related
I am posting some XML to an API Gateway method in AWS, which has an integration to SNS. An SQS queue is then subscribed to the topic; and I have a C# process which polls the queue intermittently and needs to deserialize the XML.
The trouble is, the whitespace between the XML tags ends up getting encoded along the line somewhere, so tabs become \t and new lines become \r\n. But these end up as physical tokens inside the string.
Example XML which is posted to API Gateway:
<?xml version="1.0" encoding="utf-8"?>
<ProfileInformation>
<Username>bgs264</Username>
</ProfileInformation>
String which is read off the SQS queue:
<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<ProfileInformation>\n\t<Username>bgs264</Username>\n</ProfileInformation>
Note that the attributes in the declaration end up as \" and the whitespace posted ends up as \t, \r\n, etc.
However these aren't "the strings appearing as such in the debugger, but it's actually a tab", they are actually like this in the string.
So when I try to deserialize, using
using (var reader = new StringReader(message))
var myObj = serializer.Deserialize(reader) as ProfileInformation);
I get:
InvalidOperationException: There is an error in XML document (1, 15).
It refers to the first \ character in the declaration, as in version=\"1.0\"
My immediate idea was to simply string.Replace \t to empty string, etc, but that's unacceptable because it might be valid that the user's username is actually is bgs\t264 and the replace here would cause an inconsistency. In this example, I presume I would get bgs\\t264 in the message, so a replace would leave me, erroneously, with bgs\264 for example.
So I need to fix these \n\t characters where they occur between XML tags.
For what it's worth, I also have a lambda written in Go which has no problem with this and simply deserializes the exact same string straight into XML. So it must be possible.
My intial thoughts:
Can I somehow decode the string before passing it for
deserialization? I tried this with HttpUtility.DecodeHtml but I
don't think it's actually HTML that I'm trying to decode!
Is there a different XML library I can use that would work?
I would guess, and some googling seems to support the theory, that the message you're seeing has been converted to JSON & the escape sequences are as a consequence of that.
The ideal approach would be to investigate and prevent this from happening. I don't know enough about SNS to advise & you indicate this is a non-starter, so the simplest approach would be to reverse this process once you receive the message.
You can use a JSON library like Json.NET to do this:
var jsonString = string.Format("\"{0}\"", message);
var xmlString = JsonConvert.DeserializeObject<string>(jsonString);
using (var reader = new StringReader(xmlString))
{
var profileInformation = (ProfileInformation) serializer.Deserialize(reader);
}
XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.
Currently my understanding of XML legal strings is that all is required is that you convert any instances of: &, ", ', <, > with & " ' < >
So I made the following parser:
private static string ToXmlCompliantStr(string uriStr)
{
string uriXml = uriStr;
uriXml = uriXml.Replace("&", "&");
uriXml = uriXml.Replace("\"", """);
uriXml = uriXml.Replace("'", "'");
uriXml = uriXml.Replace("<", "<");
uriXml = uriXml.Replace(">", ">");
return uriXml;
}
I am aware that there are similar questions out there with good answers (which is how I was able to write this function) I am writing this question to ask if this code will translate ANY string that C# can throw at it and have XDocument parse it as a part of a whole document without any complaints as all the questions out there that I've found state that these are the only escape characters, not that parsing them will cause 100% valid XML string. I've gone as far as reading through the decompiled XNode class trying to see how that parse it.
Thanks
Firstly, you should absolutely not do this yourself. Use an XML API - that way you can trust that to do the right thing, rather than worrying about covering corner cases etc. You generally shouldn't be trying to come up with an "escaped string" at all - you should pass the string to the XElement constructor (or XAttribute, or whatever your situation is).
In other words, I think you should try really hard to design your application so that you don't need a method of the kind you've shown in your question at all. Look at where you'd be using that method, and see whether you can just create an XElement (or whatever) instead. If you try to treat XML as a data structure in itself rather than just as text, you'll have a much better experience in my experience.
Secondly, you need to understand that in XML 1.0 at least, there are Unicode characters that cannot be validly represented in XML, no matter how much escaping you use. In particular, values U+0000 to U+001F are unrepresentable other than U+0009 (tab), U+000A (line feed) and U+000D (carriage return). Also if you have a string which contains invalid UTF-16 (e.g. an unmatched half of a surrogate pair), that can't be correctly represented in XML.
I have a program that generates some data and saves it as an xml, unfortunately for my purposes I cant save it in the newer XML that allows for characters like 0x1f. As a result, I need to eliminate this character from my xml. All I have been able to find that seems to do this is this http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ but I don't know java-script, and would like to be able to use a script that I am able to understand. I do know basic C#, but am not great in it. Anyway, what would be the easiest way to filter this character? I do think this is a good question for the online community anyway as finding a working method in C# from Google proves to be challenging.
From this post: How can you strip non-ASCII characters from a string? (in C#)
Adjusting it for your case:
string s = File.ReadAllText(filepath);
s = Regex.Replace(s, #"[\u0000-\u001F]", string.Empty);
File.WriteAllText(newFilepath, s);
Then test the new file. Don't overwrite the old until you know if this works or not.
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.