i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of ™
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.
Related
We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.
XML snippet:
<field>& is escaped</field>
<field>"also escaped"</field>
<field>is & "not" escaped</field>
<field>is " and is not & escaped</field>
I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?
I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.
The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly
string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.
This question is helpful as it gives you a Regex to find these rogue ampersands:
&(?!(?:apos|quot|[gl]t|amp);|#)
And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:
var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&");
And then you'll be able to parse your XML.
Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.
If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.
For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like when there's no definition of &npsp;, then life starts to become rather more difficult.
Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.
One of my element in an xml has a value like
<item name="abc_def>" />
The actual value pulled from the data source is "abc_def!!>". I have no control over this data source and this cannot be changed.
I wanted to know how do I escape these characters when xml serialization is taking place. I have tried a couple of things, but they didnt work.
I tried all methods explained here
What is the correct way to escape these characters ? The end output is an api which our clients hit using their browsers and because of this issue, the xml parsing in browser is breaking.
If all of your strings look like that one, you can do something like this:
string input = "abc_def>";
input = input.Replace("", "!!");
string output = HttpUtility.HtmlDecode(input);
You need to use:
System.Net.WebUtility.HtmlEncode(stringToEncode);
Of course when you later decode that you use:
System.Net.WebUtility.HtmlDecode(stringToDecode);
This is for UWP, namespace may vary depending on what framework you use.
I have some VB.Net code which is parsing an XML string.
The XML String comes from a TCP 3rd Party stream and as such we have to take the data we get and deal with it.
The issue we have is that one of the elements data can sometimes contain special characters e.g. &, $ , < and thus when the “XMLDoc.LoadXml(XML)” is executed it fails - note XMLDoc is configured as "Dim XMLDoc As XmlDocument = New XmlDocument()".
Have tried to Google answers for this but I am really struggling to find a solution. Have looked at a RegEX but realised this has some limitations; or I just dont understand it enough lol.
If it helps here is an example of XLM we would have streamed to us (just for info the message tag comes from an SMS message):-
(if it helps the only bit that will ever have an error is (and all I have to check) the <Message>O&N</Message> section so in this case the message has come in with an &)
<IncomingMessage><DeviceSendTime>19/02/2013 14:00:50</DeviceSendTime>
<Sender>0000111111</Sender>
<Status>New</Status>
<Transport>Sms</Transport>
<Id>-1</Id>
<Message>O&N</Message>
<Timestamp>19/02/2013 14:00:50</Timestamp>
<ReadTimestamp>19/02/2013 14:00:50</ReadTimestamp>
</IncomingMessage>
If we're looking specifically within Message elements, and assuming there are no nested elements within the Message element:
Dim url = "put url here"
Dim s As String
Dim characterMappings = New Dictionary(Of String, String) From {
{"&", "&"},
{"<", "<"},
{">", ">"},
{"""", """}
}
Using client As New WebClient
s = client.DownloadString(url)
End Using
s = Regex.Replace(s,
"(?:<Message>).*?(" & String.Join("|", characterMappings.Keys) & ").*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Dim x = XDocument.Parse(s)
$ should not be an issue with XML, but if it is you can add it to the dictionary.
Use of WebClient comes from here.
Updated
Since $ has special meaning in regular expressions, it cannot be simply added to the dictionary; it needs to be escaped with \ in the regular expression pattern. The simplest way to do this, would be to write the pattern manually, instead of joining the keys to the dictionary:
s = Regex.Replace(s,
"(?:<Message>).*?(&|<|>|\$).*?(?:</Message>)",
Function(match) characterMappings(match.Groups(1).Value)
)
Also, I highly recommend Expresso for working with regular expressions.
Your XML is invalid and hence it is not XML. Either fix code that generates XML (correct approach) or pretend this is text file and enjoy all problems with parsing non-structured text.
As you've stated in the question <Message>O&N</Message> is not valid XML. Most likely reason of such "XML" is using string concatenation to construct it instead of using proper XML manipulation methods. Unless you use some arcane language all practically used languages have built in or library support for XML creation so it should not be to hard to create XML right.
While loading XML file in a C# application, I am getting
Name cannot begin with the '1' character, hexadecimal value 0x31.
Line 2, position 2.
The XML tag begins like this.
<version="1.0" encoding="us-ascii" standalone="yes" />
<1212041205115912>
I am not supposed to change this tag at any cost.
How can I resolve this?
You are supposed to change the tag name since the one you wrote violates the xml standard.
Just to remember the interesting portion of it here:
XML Naming Rules
XML elements MUST follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces
Any name can be used, no words are reserved.
as a suggestion to solve your problem mantaining the standard:
Use an attribute, ie <Number value="1212041205115912"/>
Add a prefix to the tag ie <_1212041205115912/>
Of course you can mantain the structure you propose by writing your own format parser, but I can state it would be a really bad idea, because in the future someone would probably extend the format and would not be happy to see that the file that seems xml it is actually not, and he/she can get angry for that. Furthermore, if you want your custom format, use something simpler, I mean: messing a text file with some '<' and '>' does not add any value if it is not an officially recognized format, it is better to use someting like a simple plain text file instead.
IF you absolutely cant change it, eg. for some reason the format is already out in the wild and used by other systems/customers/whatever.
Since it is an invalid xml document, try to clean it up before parsing it.
eg. make a regex that replaces all < number> tags with < IMessedUp>number< /IMessedUp> and then parse it.
Sort of iffy way to do it, but I will solve your problem.
If you need to process this document, then stop thinking of it as XML, and cast aside any thoughts of using XML tools to process it. You're dealing with a proprietary format and you will need to write your own tools to handle it. If you want the benefits of using XML technology, you will have to redesign your documents so they are valid XML.