I have an XML Document where it contains data with < character.
<Tunings>
<Notes>Norm <150 mg/dl</Notes>
</Tunings>
The code I am using is:
StreamReader objReader = new StreamReader(strFile);
string strData = objReader.ReadToEnd();
XmlDocument doc = new XmlDocument();
// Here I want to strip those characters from "strData"
doc.LoadXml(strData);
So it gives error:
Name cannot begin with the '1' character, hexadecimal value 0x31.
So is there a way to strip those characters from XML before Load calls.?
If this is only occurring in the <Notes> section, I'd recommend you modify the creation of the XML file to use a CDATA tag to contain the text in Notes, like this:
<Notes><![CDATA[Norm <150 mg/dl]]></Notes>
The CDATA tag tells XML parsers to not parse the characters between the <![CDATA[ and ]]>. This allows you have characters in your XML that would otherwise break the parsing.
You can use the CDATA tag for any situation where you know (or have reasonable expectations) of special characters in that data.
Trying to handle special characters at parsing time (without the CDATA) will be more labor intensive (and frustrating) than simply fixing the creation of the XML in the first place, IMO. Plus, "Norm <150 mg/dl" is not the same thing as "Norm 150 mg/dl", and that distinction might be important for whoever needs that information.
As the comments state, you do not have an XML document. If you know that the only way that these documents deviate from legal XML is as in your example, you could run the file through a regular expression and replace <(?:\d) with &. This will find the < adjacent to a number and properly encode it.
Related
We receive an xml string from an external API, and one element has a bunch of GT/LT signs.
When we run this code, it fails:
var xml = #"<SomeNode>10040:<->10110:<->10130:<->10150:<->10160:<->10180:<->10330:Value=><->10330:Matching=><->10330:Value2=><->10330:Value3=><->10330:Value4=><->10447:<->10418:No<->10419:No<->10430:No
</SomeNode>";
var doc = new XmlDocument();
doc.LoadXml(xml);
//System.Xml.XmlException: 'Name cannot begin with the '-' character, hexadecimal value 0x2D
I looked into escaping those characters, but as far as I can tell there isn't a way to escape only the ones inside SomeNode.
So I know that I could run some kind of string replacement using a regex or something to clear that out. But, is there an elegant way to solve this using existing XML related tools?
Based on the comments, there isn't an xml tools solution, and so it'll be a custom string replacement solution.
I have a string like this on C#:
string sxml="<id>xpto</id>"
and I want to save an xml file using XLinq:
XElement cursosXML = new XElement("MenuItem", new XCData(sxml));
cursosXML.save("C:\\xpto.xml")
when I read my xml file this appears:
<MenuItem><![CDATA[<id>xpto</id>]]></MenuItem>
but I do not want <![CDATA[
How can I get this result?
<MenuItem><xpto>3<</MenuItem>
Here's how you should add the string to your element
XElement sxml = XElement.Parse("<id>xpto</id>");
XElement cursosXML = new XElement("MenuItem", sxml);
cursosXML.Save("C:\\xpto.xml");
Below Is an explanation of the troubles you've been having. It's just here for your reference
<id>xpto</id> is an XML Element, but you're adding it to your current element as though it were a literal string. When you do this, the computer doesn't think you're adding a new child element to your main element, so it escapes it(explained later)
The code I posted calls XElement.Parse(string), which will take your seralized xml string, and try to generate a valid XElement from it.
What's happening is that < is an escape sequence for < and > is an escape sequence for >.
the reason why your XML parser escapes < and > is because those characters have special meaning, and including them in your XML change the nature of XML.
It's kinda similar to how \n is an escape sequence for the newline character, and how \\ is an escape sequence for \
I'm reading XML data from a varchar column in a SQL db, into a linq to sql XElement belonging to an XDocument.
When I execute the XDocument.Save method, the XML is written to file but includes the escape characters. For example, ">" is changed to ">".
Is there an easy way to prevent this?
First, there seems to be no reason to prevent it. Like kenny mentioned, unless special characters are XML encoded, no parser would be able to parse produced XML (because '<' or '>' characters means a lot for that parser). Second, when your parser decodes XML (e.g. you call XElement.Value), all special characters will be converted back to what they originally were. Finally, if you want to keep the original string (e.g. for purposes other than XML parsing), you can use CDATA, which in case of Linq2XML is represented by XCData class.
EDIT: As Rob pointed out, I might have gotten it wrong. If the point is to save add existing XML to a document, without special characters appear, use the following code:
XDocument document = new XDocument();
var xmlFromDb = "<xml>content</xml>";
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(xmlFromDb)))
{
using (var reader = XmlReader.Create(stream)) {
reader.MoveToContent();
document.Add(XElement.ReadFrom(reader));
}
}
UPDATE: The invalid characters are actually in the attributes instead of the elements, this will prevent me from using the CDATA solution as suggested below.
In my application I receive the following XML as a string. There are a two problems with this why this isn't accepted as valid XML.
Hope anyone has a solution for fixing these bug gracefully.
There are ASCII characters in the XML that aren't allowed. Not only the one displayed in the example but I would like to replace all the ASCII code with their corresponding characters.
Within an element the '<' exists - I would like to remove all these entire 'inner elements' (<L CODE="C01">WWW.cars.com</L>) from the XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<cars>
<car model="ford" description="Argentinië love this"/>
<car model="kia" description="a small family car"/>
<car model="opel" description="great car <L CODE="C01">WWW.cars.com</L>"/>
</cars>
For a quick fix, you could load this not-XML into a string, and add [CDATA][1] markers inside any XML tags that you know usually tend to contain invalid data. For example, if you only ever see bad data inside <description> tags, you could do:
var soCalledXml = ...;
var xml = soCalledXml
.Replace("<description>", "<description><![CDATA[")
.Replace("</description>", "]]></description>");
This would turn the tag into this:
<description><![CDATA[great car <L CODE="C01">WWW.cars.com</L>]]></description>
which you could then process successfully -- it would be a <description> tag that contains the simple string great car <L CODE="C01">WWW.cars.com</L>.
If the <description> tag could ever have any attributes, then this kind of string replacement would be fraught with problems. But if you can count on the open tag to always be exactly the string <description> with no attributes and no extra whitespace inside the tag, and if you can count on the close tag to always be </description> with no whitespace before the >, then this should get you by until you can convince whoever is producing your crap input that they need to produce well-formed XML.
Update
Since the malformed data is inside an attribute, CDATA won't work. But you could use a regular expression to find everything inside those quote characters, and then do string manipulation to properly escape the <s and >s. They're at least escaping embedded quotes, so a regex to go from " to " would work.
Keep in mind that it's generally a bad idea to use regexes on XML. Of course, what you're getting isn't actually XML, but it's still hard to get right for all the same reasons. So expect this to be brittle -- it'll work for your sample input, but it may break when they send you the next file, especially if they don't escape & properly. Your best bet is still to convince them to give you well-formed XML.
using System.Text.RegularExpressions;
var soCalledXml = ...;
var xml = Regex.Replace(soCalledXml, "description=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
You could wrap that content in a CDATA section.
With regex it will be something like this, match
"<description>(.*?)</description>"
and replace with
"<description><![CDATA[$1]]></description>"
i have a string that contains special character like (trademark sign etc). This string is set as an XML node value. But the special character is not rendered properly in XML, shows ??. This is how im using it.
String str=xxxx; //special character string
XmlNode node = new XmlNode();
node.InnerText = xxxx;
I tried HttpUtility.htmlEncode(xxxx) but it converts it into "& ;#8482;" so the output of xml is "™"; instead of â„¢
I have also tried XmlConvert.ToString() and XmlConvert.EncodeName but it gives ??
I strongly suspect that the problem is how you're viewing the XML. Have you made sure that whatever you're viewing it in is using the right encoding?
If you save the XML and then reload it and fetch the inner text as a string, does it have the right value? If so, where's the problem?
You shouldn't perform extra encoding yourself - let the XML APIs do their job.
I've had issues with some characters using htmlEncode() before, as well. Here's a good example of different ways to write your XML: Different Ways to Escape an XML String in C#. Check out #3 (System.Security.SecurityElement.Escape()) and #4 (System.Xml.XmlTextWriter), these are the methods I typically use.