special chars in XML - c#

I want to parse the following XML
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
but I found XML Exception
An error occurred while parsing EntityName.

Yeah - a & is not valid in XML and needs to be escaped to &.
The other characters invalid characters and their escapes:
< - <
> - >
" - &quote;
' - &apos;
The following should work:
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
However, you really should be creating the CostCenterNumber and CostCenter as elements and not as InnerXml.

private string SanitizeXml(string source)
{
if (string.IsNullOrEmpty(source))
{
return source;
}
if (source.IndexOf('&') < 0)
{
return source;
}
StringBuilder result = new StringBuilder(source);
result = result.Replace("<", "<>lt;")
.Replace(">", "<>gt;")
.Replace("&", "<>amp;")
.Replace("&apos;", "<>apos;")
.Replace(""", "<>quot;");
result = result.Replace("&", "&");
result = result.Replace("<>lt;", "<")
.Replace("<>gt;", ">")
.Replace("<>amp;", "&")
.Replace("<>apos;", "&apos;")
.Replace("<>quot;", """);
return result.ToString();
}

Updated:
#thabet, if the string "<CostCenterNumber>...G&A: Fin & Acctng</CostCenter>" is coming in as a parameter, and it's supposed to represent XML to be parsed, then it has to be well-formed XML to start with. In the example you gave, it isn't. & signals the start of an entity reference, is followed by an entity name, and is terminated by ;, which never appears in the string above.
If you are given that whole string as a parameter, some of which is markup that must be parsed (i.e. the start/end tags), and some of which may contain markup that should not be parsed (i.e. the &), there is no clean and reliable way to "escape" the latter and not escape the former. You could replace all & characters with &, but in doing so you might accidentally turn   into &#160; and your resulting content would be wrong. If this is your situation, that you are receiving input "XML" where markup is mixed with unparseable text, the best recourse is to tell the person from whom you are getting the XML that it's not well-formed and they need to fix their output. There are ways for them to do that that are not difficult with standard XML tools.
If on the other hand you have
<CostCenterNumber>2</CostCenterNumber>
<CostCenter>...</CostCenter>
separately from the passed string, and you need to plug in the passed string as the text content of the child <CostCenter>, and you know it is not to be parsed (does not contain elements), then you can do this:
create <CostCenterNumber> and <CostCenter> as elements
make them children of the parent <CostCenter>
set CostCenterNumber's text content using InnerXML assuming there is no risk of markup in there: eltCCN.InnerXml = "2";
create for the child CostCenter element a Text node child whose value is the passed string: textCC = doc.CreateText(argStr);
assign that text node as a child of the child CostCenter element: eltCC.AppendChild(textCC);

Related

C# XmlReader reads XML wrong and different based on how I invoke the reader's methods

So my current understanding of how the C# XmlReader works is that it takes a given XML File and reads it node-by-node when I wrap it in a following construct:
using System.Xml;
using System;
using System.Diagnostics;
...
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
settings.IgnoreWhitespace = true;
settings.IgnoreProcessingInstructions = true;
using (XmlReader reader = XmlReader.Create(path, settings))
{
while (reader.Read())
{
// All reader methods I call here will reference the current node
// until I move the pointer to some further node by calling methods like
// reader.Read(), reader.MoveToContent(), reader.MoveToElement() etc
}
}
Why will the following two snippets (within the above construct) produce two very different results, even though they both call the same methods?
I used this example file for testing.
Debug.WriteLine(new string(' ', reader.Depth * 2) + "<" + reader.NodeType.ToString() + "|" + reader.Name + ">" + reader.ReadString() + "</>");
(Snippet 1)
vs
(Snippet 2)
string xmlcontent = reader.ReadString();
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Debug.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
Output of Snippet 1:
<XmlDeclaration|xml></>
<Element|rss></>
<Element|head></>
<Text|>Test Xml File</>
<Element|description>This will test my xml reader</>
<EndElement|head></>
<Element|body></>
<Element|g:id>1QBX23</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<Element|item></>
<Text|>2QXB32</>
<Element|g:title>Example Title</>
<Element|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Yes, this is formatted as it was in my output window. As to be seen it skipped certain elements and outputted a wrong depth for a few others. Therefore, the NodeTypes are correct, unlike Snippet Number 2, which outputs:
<XmlDeclaration|xml></>
<Element|xml></>
<Element|title></>
<EndElement|title>Test Xml File</>
<EndElement|description>This will test my xml reader</>
<EndElement|head></>
<Element|item></>
<EndElement|g:id>1QBX23</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<Element|g:id></>
<EndElement|g:id>2QXB32</>
<EndElement|g:title>Example Title</>
<EndElement|g:description>Example Description</>
<EndElement|item></>
<EndElement|body></>
<EndElement|xml></>
<EndElement|rss></>
Once again, the depth is messed up, but it's not as critical as with Snippet Number 1. It also skipped some elements and assigned wrong NodeTypes.
Why can't it output the expected result? And why do these two snippets produce two totally different outputs with different depths, NodeTypes and skipped nodes?
I'd appreciate any help on this. I searched a lot for any answers on this but it seems like I'm the only one experiencing these issues. I'm using the .NET Framework 4.6.2 with Asp.net Web Forms in Visual Studio 2017.
Firstly, you are using a method XmlReader.ReadString() that is deprecated:
XmlReader.ReadString Method
... reads the contents of an element or text node as a string. However, we recommend that you use the ReadElementContentAsString method instead, because it provides a more straightforward way to handle this operation.
However, beyond warning us off the method, the documentation doesn't precisely specify what it actually does. To determine that, we need to go to the reference source:
public virtual string ReadString() {
if (this.ReadState != ReadState.Interactive) {
return string.Empty;
}
this.MoveToElement();
if (this.NodeType == XmlNodeType.Element) {
if (this.IsEmptyElement) {
return string.Empty;
}
else if (!this.Read()) {
throw new InvalidOperationException(Res.GetString(Res.Xml_InvalidOperation));
}
if (this.NodeType == XmlNodeType.EndElement) {
return string.Empty;
}
}
string result = string.Empty;
while (IsTextualNode(this.NodeType)) {
result += this.Value;
if (!this.Read()) {
break;
}
}
return result;
}
This method does the following:
If the current node is an empty element node, return an empty string.
If the current node is an element that is not empty, advance the reader.
If the now-current node is the end of the element, return an empty string.
While the current node is a text node, add the text to a string and advance the reader. As soon as the current node is not a text node, return the accumulated string.
Thus we can see that this method is designed to advance the reader. We can also see that, given mixed-content XML like <head>text <b>BOLD</b> more text</head>, ReadString() will only partially read the <head> element, leaving the reader positioned on <b>. This oddity is likely why Microsoft deprecated the method.
We can also see why your two snippets function differently. In the first, you get reader.Depth and reader.NodeType before calling ReadString() and advancing the reader. In the second you get these properties after advancing the reader.
Since your intent is to iterate through the nodes and get the value of each, rather than ReadString() or ReadElementContentAsString() you should just use XmlReader.Value:
gets the text value of the current node.
Thus your corrected code should look like:
string xmlcontent = reader.Value;
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Console.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");
XmlReader is tricky to work with. You always need to check the documentation to determine exactly where a given method positions the reader. For instance, XmlReader.ReadElementContentAsString() moves the reader past the end of the element, whereas XmlReader.ReadSubtree() moves the reader to the end of the element. But as a general rule any method named Read is going to advance the reader, so you need to be careful using a Read method inside an outer while (reader.Read()) loop.
Demo fiddle here.

How to correctly encode & in xml?

Im web-requsting an XML document. Xdocument.Load(stream) throws an exception because the XML contains &, and therefore expects ; like &.
I did read the stream to string and replaced & with &, but that broke all other correctly encoded special chars like ø.
Is there a simple way to encode all disallowed chars in the string before parsing to XDocument?
Try CDATA Sections in xml
A CDATA section can only be used in places where you could have a text node.
<foo><![CDATA[Here is some data including < , > or & etc) ]]></foo>
This kind of methods are not encouraged!! The reason lies in your question!
(replacing & by & turns > to &gt;)
The better suggestion apart from using regex is modifying your source code which is generating such uncoded XML.
I have come across (.NET) code that use 'string concat' to come up with XML! (Instead one should use XML-DOM)
If you have an access to modify the source code then better go head with that .. because encoding such half-encoded XML is not promised with perfection!
#espvar,
This is an input XML:
<root><child>nospecialchars</child><specialchild>data&data</specialchild><specialchild2>You.. & I in this beautiful world</specialchild2>data&</root>
And the Main function:
string EncodedXML = encodeWithCDATA(XMLInput); //Calling our Custom function
XmlDocument xdDoc = new XmlDocument();
xdDoc.LoadXml(EncodedXML); //passed
The function encodeWithCDATA():
private string encodeWithCDATA(string stringXML)
{
if (stringXML.IndexOf('&') != -1)
{
int indexofClosingtag = stringXML.Substring(0, stringXML.IndexOf('&')).LastIndexOf('>');
int indexofNextOpeningtag = stringXML.Substring(indexofClosingtag).IndexOf('<');
string CDATAsection = string.Concat("<![CDATA[", stringXML.Substring(indexofClosingtag, indexofNextOpeningtag), "]]>");
string encodedLeftPart = string.Concat(stringXML.Substring(0, indexofClosingtag+1), CDATAsection);
string UncodedRightPart = stringXML.Substring(indexofClosingtag+indexofNextOpeningtag);
return (string.Concat(encodedLeftPart, encodeWithCDATA(UncodedRightPart)));
}
else
{
return (stringXML);
}
}
Encoded XML (ie, xdDoc.OuterXml):
<root>
<child>nospecialchars</child>
<specialchild>
<![CDATA[>data&data]]>
</specialchild>
<specialchild2>
<![CDATA[>You.. & I in this beautiful world]]>
</specialchild2>
<![CDATA[>data&]]>
</root>
All I have used is, substring, IndexOf, stringConcat and recursive function call.. Let me know if you don't understand any part of the code.
The sample XML that I have provided possess data in the parent nodes as well, which is kind of HTML property .. ex: <div>this is <b>bold</b> text</div>.. and my code takes care of encoding data outside <b> tag if they have special character ie, &..
Please note that, I have taken care of encoding '&' only and .. data cannot have chars like '<' or '>' or single-quote or double-quote..

Remove self-closing tags (e.g. />) in an XmlDocument

In an XmlDocument, either when writing and modify later, is it possible to remove the self-closing tags (i.e. />) for a certain element.
For example: change
<img /> or <img></img> to <img>.
<br /> to <br>.
Why you ask? I'm trying to conform to the HTML for Word 2007 schema; the resulting HTML will be displayed in Microsoft Outlook 2007 or later.
After reading another StackOverflow question, I tried the setting the IsEmpty property to false like so.
var imgElements = finalHtmlDoc.SelectNodes("//*[local-name()=\"img\"]").OfType<XmlElement>();
foreach (var element in imgElements)
{
element.IsEmpty = false;
}
However that resulted in <img /> becoming <img></img>. Also, as a hack I also tried changing the OuterXml property directly however that doesn't work (didn't expect it to).
Question
Can you remove the self-closing tags from XmlDocument? I honestly do not think there is, as it would then be invalid xml (no closing tag), however thought I would throw the question out the community.
Update:
I ended up fixing the HTML string after exporting from the XmlDocument using a regular expression (written in the wonderful RegexBuddy).
var fixHtmlRegex = new Regex("<(?<tag>meta|img|br)(?<attributes>.*?)/>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
return fixHtmlRegex.Replace(htmlStringBuilder.ToString(), "<$1$2>");
It cleared many errors from the validation pass and allow me to focus on the real compatibility problems.
You're right: it's not possible simply because it's invalid (or rather, not well-formed) XML. Empty elements in XML must be closed, be it with the shortcut syntax /> or with an immediate closing tag.
Both HTML and XML are applications of SGML. While HTML and SGML allow unclosed tags like <br>, XML does not.
A bit embarrassed by my answer, but it worked for what I needed. After you have a complete xml document you can string manipulate it to clean it up...
private string RemoveSelfClosingTags(string xml)
{
char[] seperators = { ' ', '\t', '\r', '\n' };
int prevIndex = -1;
while (xml.Contains("/>"))
{
int selfCloseIndex = xml.IndexOf("/>");
if (prevIndex == selfCloseIndex)
return xml; // we are in a loop...
prevIndex = selfCloseIndex;
int tagStartIndex = -1;
string tag = "";
//really? no backwards indexof?
for (int i = selfCloseIndex; i > 0; i--)
{
if (xml[i] == '<')
{
tagStartIndex = i;
break;
}
}
int tagEndIndex = xml.IndexOfAny(seperators, tagStartIndex);
int tagLength = tagEndIndex - tagStartIndex;
tag = xml.Substring(tagStartIndex + 1, tagLength - 1);
xml = xml.Substring(0, selfCloseIndex) + "></" + tag + ">" + xml.Substring(selfCloseIndex + 2);
}
return xml;
}
<img> would not be valid XML, so no, you can't do this.

XmlDocument throwing "An error occurred while parsing EntityName"

I have a function where I am passing a string as params called filterXML which contains '&' in one of the properties.
I know that XML will not recognize it and it will throw me an err. Here is my code:
public XmlDocument TestXMLDoc(string filterXml)
{
XmlDocument doc = new XmlDocument();
XmlNode root = doc.CreateElement("ResponseItems");
// put that root into our document (which is an empty placeholder now)
doc.AppendChild(root);
try
{
XmlDocument docFilter = new XmlDocument();
docFilter.PreserveWhitespace = true;
if (string.IsNullOrEmpty(filterXml) == false)
docFilter.LoadXml(filterXml); //ERROR THROWN HERE!!!
What should I change in my code to edit or parse filterXml? My filterXml looks like this:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
I am changing my string value from & to &. Here is my code for that:
string editXml = filterXml;
if (editXml.Contains("&"))
{
editXml.Replace('&', '&');
}
But its giving me an err on inside the if statement : Too many literals.
The file shown above is not well-formed XML because the ampersand is not escaped.
You can try with:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
or:
<Testing>
<Test><![CDATA[CITY & COUNTY]]></Test>
</Testing>
About the second question: there are two signatures for String.Replace. One that takes characters, the other that takes strings. Using single quotes attempts to build character literals - but "&", for C#, is really a string (it has five characters).
Does it work with double quotes?
editXml.Replace("&", "&");
If you would like to be a bit more conservative, you could also write code to ensure that the &s you are replacing are not followed by one of
amp; quot; apos; gt; lt; or #
(but this would still not be a perfect filtering)
To specify an ampersand in XML you should use & since the ampersand sign ('&') has a special meaning in XML.

Convert character entities to their unicode equivalents

I have html encoded strings in a database, but many of the character entities are not just the standard & and <. Entities like “ and —. Unfortunately we need to feed this data into a flash based rss reader and flash doesn't read these entities, but they do read the unicode equivalent (ex “).
Using .Net 4.0, is there any utility method that will convert the html encoded string to use unicode encoded character entities?
Here is a better example of what I need. The db has html strings like: <p>John & Sarah went to see $ldquo;Scream 4$rdquo;.</p> and what I need to output in the rss/xml document with in the <description> tag is: <p>John &#38; Sarah went to see &#8220;Scream 4&#8221;.</p>
I'm using an XmlTextWriter to create the xml document from the database records similar to this example code http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx
So I need to replace all of the character entities within the html string from the db with their unicode equivilant because the flash based rss reader doesn't recognize any entities beyond the most common like &.
My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.
If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.
EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):
string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("&#" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}
I may have an off-by-one error somewhere in there, but it should be close.
would HttpUtility.HtmlDecode work for you?
I realize it doesn't convert to unicode equivalent entities, but instead converts it to unicode. Is there a specific reason you want the unicode equivalent entities?
updated edit
string test = "<p>John & Sarah went to see “Scream 4”.</p>";
string decode = HttpUtility.HtmlDecode(test);
string encode = HttpUtility.HtmlEncode(decode);
StringBuilder builder = new StringBuilder();
foreach (char c in encode)
{
if ((int)c > 127)
{
builder.Append("&#");
builder.Append((int)c);
builder.Append(";");
}
else
{
builder.Append(c);
}
}
string result = builder.ToString();
you can download a local copy of the appropriate HTML and/or XHTML DTDs from the W3C. Then set up an XmlResolver and use it to expand any entities found in the document.
You could use a regular expression to find/expand the entities, but that won't know anything about context (e.g., anything in a CDATA section shouldn't be expanded).
this might help you put input path in textbox
try
{
FileInfo n = new FileInfo(textBox1.Text);
string initContent = File.ReadAllText(textBox1.Text);
int contentLength = initContent.Length;
Match m;
while ((m = Regex.Match(initContent, "[^a-zA-Z0-9<>/\\s(&#\\d+;)-]")).Value != String.Empty)
initContent = initContent.Remove(m.Index, 1).Insert(m.Index, string.Format("&#{0};", (int)m.Value[0]));
File.WriteAllText("outputpath", initContent);
}
catch (System.Exception excep)
{
MessageBox.Show(excep.Message);
}
}

Categories