XmlDocument throwing "An error occurred while parsing EntityName" - c#

I have a function where I am passing a string as params called filterXML which contains '&' in one of the properties.
I know that XML will not recognize it and it will throw me an err. Here is my code:
public XmlDocument TestXMLDoc(string filterXml)
{
XmlDocument doc = new XmlDocument();
XmlNode root = doc.CreateElement("ResponseItems");
// put that root into our document (which is an empty placeholder now)
doc.AppendChild(root);
try
{
XmlDocument docFilter = new XmlDocument();
docFilter.PreserveWhitespace = true;
if (string.IsNullOrEmpty(filterXml) == false)
docFilter.LoadXml(filterXml); //ERROR THROWN HERE!!!
What should I change in my code to edit or parse filterXml? My filterXml looks like this:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
I am changing my string value from & to &. Here is my code for that:
string editXml = filterXml;
if (editXml.Contains("&"))
{
editXml.Replace('&', '&');
}
But its giving me an err on inside the if statement : Too many literals.

The file shown above is not well-formed XML because the ampersand is not escaped.
You can try with:
<Testing>
<Test>CITY & COUNTY</Test>
</Testing>
or:
<Testing>
<Test><![CDATA[CITY & COUNTY]]></Test>
</Testing>

About the second question: there are two signatures for String.Replace. One that takes characters, the other that takes strings. Using single quotes attempts to build character literals - but "&", for C#, is really a string (it has five characters).
Does it work with double quotes?
editXml.Replace("&", "&");
If you would like to be a bit more conservative, you could also write code to ensure that the &s you are replacing are not followed by one of
amp; quot; apos; gt; lt; or #
(but this would still not be a perfect filtering)

To specify an ampersand in XML you should use & since the ampersand sign ('&') has a special meaning in XML.

Related

an error occurred while parsing entityname with '&'

I have the next program that open a .XML document with Visual c#. I can´t open the Xml because it has a '&', and I don´t know how i can open.
private void button1_Click(object sender, EventArgs e)
{
XmlDocument doc;
doc = new XmlDocument();
doc.Load("nuevo.xml");
XmlNodeList menus;
menus = doc.GetElementsByTagName("menu");
foreach (XmlNode unMenu in menus)
{
if (unMenu.Attributes["precio"].Value == "50")
{
//Console.WriteLine(unMenu.Attributes["type"].Value);
XPathNavigator navegador = doc.CreateNavigator();
XPathNodeIterator nodos = navegador.Select("/restaurante");
while (nodos.MoveNext())
{
Console.WriteLine(nodos.Current.OuterXml);
Console.WriteLine();
textBox1.Text = nodos.Current.OuterXml;
}
}
}
}
If you get the error
an error occurred while parsing entityname with '&'
then there is an "&" somewhere in the name of an XML element. This is not allowed in an XML document. You cannot open an invalid XML file with the XmlDocument (or XDocument) class.
There are several things you can do:
Make sure that the XML files are always valid before trying to read them. This however depends on your scenario and may not be possible.
Preprocess your XML file to fix the invalid content by replacing "&" with "&". You can either do this manually or at run-time.
Use HtmlAgilityPack to parse the invalid file.
Personally, I would go with 1) if possible or 2) otherwise.
Replace all occurances of & with & in the xml.
So after spending hours on this issue: it turns out that if you have an ampersand symbol ("&") or any other XML escape characters within your xml string, it will always fail will you try read the XML. TO solve this, replace the special characters with their escaped string format
YourXmlString = YourXmlString.Replace("'", "&apos;").Replace("\"", """).Replace(">", ">").Replace("<", "<").Replace("&", "&");

Dealing with awkward XML layout in c# using XmlTextReader

so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!
I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);
It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)
I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"

How to correctly encode & in xml?

Im web-requsting an XML document. Xdocument.Load(stream) throws an exception because the XML contains &, and therefore expects ; like &.
I did read the stream to string and replaced & with &, but that broke all other correctly encoded special chars like ø.
Is there a simple way to encode all disallowed chars in the string before parsing to XDocument?
Try CDATA Sections in xml
A CDATA section can only be used in places where you could have a text node.
<foo><![CDATA[Here is some data including < , > or & etc) ]]></foo>
This kind of methods are not encouraged!! The reason lies in your question!
(replacing & by & turns > to &gt;)
The better suggestion apart from using regex is modifying your source code which is generating such uncoded XML.
I have come across (.NET) code that use 'string concat' to come up with XML! (Instead one should use XML-DOM)
If you have an access to modify the source code then better go head with that .. because encoding such half-encoded XML is not promised with perfection!
#espvar,
This is an input XML:
<root><child>nospecialchars</child><specialchild>data&data</specialchild><specialchild2>You.. & I in this beautiful world</specialchild2>data&</root>
And the Main function:
string EncodedXML = encodeWithCDATA(XMLInput); //Calling our Custom function
XmlDocument xdDoc = new XmlDocument();
xdDoc.LoadXml(EncodedXML); //passed
The function encodeWithCDATA():
private string encodeWithCDATA(string stringXML)
{
if (stringXML.IndexOf('&') != -1)
{
int indexofClosingtag = stringXML.Substring(0, stringXML.IndexOf('&')).LastIndexOf('>');
int indexofNextOpeningtag = stringXML.Substring(indexofClosingtag).IndexOf('<');
string CDATAsection = string.Concat("<![CDATA[", stringXML.Substring(indexofClosingtag, indexofNextOpeningtag), "]]>");
string encodedLeftPart = string.Concat(stringXML.Substring(0, indexofClosingtag+1), CDATAsection);
string UncodedRightPart = stringXML.Substring(indexofClosingtag+indexofNextOpeningtag);
return (string.Concat(encodedLeftPart, encodeWithCDATA(UncodedRightPart)));
}
else
{
return (stringXML);
}
}
Encoded XML (ie, xdDoc.OuterXml):
<root>
<child>nospecialchars</child>
<specialchild>
<![CDATA[>data&data]]>
</specialchild>
<specialchild2>
<![CDATA[>You.. & I in this beautiful world]]>
</specialchild2>
<![CDATA[>data&]]>
</root>
All I have used is, substring, IndexOf, stringConcat and recursive function call.. Let me know if you don't understand any part of the code.
The sample XML that I have provided possess data in the parent nodes as well, which is kind of HTML property .. ex: <div>this is <b>bold</b> text</div>.. and my code takes care of encoding data outside <b> tag if they have special character ie, &..
Please note that, I have taken care of encoding '&' only and .. data cannot have chars like '<' or '>' or single-quote or double-quote..

The error in getting the exact value from the XML node when the '\' value is in string and that string in passed used as xml instead of file

I am having XML in a String as below
String s = #<user>abc.int\abhi</user>
but when i write the following code
XmlDocument doc = new XmlDocument();
doc.InnerXml = s;
XmlElement root = doc.DocumentElement;
String User = root.SelectSingleNode("user");
The User has the value abc.int\\abhi instead of abc.int\abhi the '\' character appears twice in the string.
Thank you in advance.
Do you check that value in VS watch window? If so, it is normal to display \, because watch window shows string as if it was written in code, not the real string.
In code, if you want to enter \ into a string, you have to write string s = "\\"; And this will create actual string with \ in it.
try outputting your string to console or messagebox, and you should see, that it is correct.

special chars in XML

I want to parse the following XML
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
but I found XML Exception
An error occurred while parsing EntityName.
Yeah - a & is not valid in XML and needs to be escaped to &.
The other characters invalid characters and their escapes:
< - <
> - >
" - &quote;
' - &apos;
The following should work:
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
However, you really should be creating the CostCenterNumber and CostCenter as elements and not as InnerXml.
private string SanitizeXml(string source)
{
if (string.IsNullOrEmpty(source))
{
return source;
}
if (source.IndexOf('&') < 0)
{
return source;
}
StringBuilder result = new StringBuilder(source);
result = result.Replace("<", "<>lt;")
.Replace(">", "<>gt;")
.Replace("&", "<>amp;")
.Replace("&apos;", "<>apos;")
.Replace(""", "<>quot;");
result = result.Replace("&", "&");
result = result.Replace("<>lt;", "<")
.Replace("<>gt;", ">")
.Replace("<>amp;", "&")
.Replace("<>apos;", "&apos;")
.Replace("<>quot;", """);
return result.ToString();
}
Updated:
#thabet, if the string "<CostCenterNumber>...G&A: Fin & Acctng</CostCenter>" is coming in as a parameter, and it's supposed to represent XML to be parsed, then it has to be well-formed XML to start with. In the example you gave, it isn't. & signals the start of an entity reference, is followed by an entity name, and is terminated by ;, which never appears in the string above.
If you are given that whole string as a parameter, some of which is markup that must be parsed (i.e. the start/end tags), and some of which may contain markup that should not be parsed (i.e. the &), there is no clean and reliable way to "escape" the latter and not escape the former. You could replace all & characters with &, but in doing so you might accidentally turn   into &#160; and your resulting content would be wrong. If this is your situation, that you are receiving input "XML" where markup is mixed with unparseable text, the best recourse is to tell the person from whom you are getting the XML that it's not well-formed and they need to fix their output. There are ways for them to do that that are not difficult with standard XML tools.
If on the other hand you have
<CostCenterNumber>2</CostCenterNumber>
<CostCenter>...</CostCenter>
separately from the passed string, and you need to plug in the passed string as the text content of the child <CostCenter>, and you know it is not to be parsed (does not contain elements), then you can do this:
create <CostCenterNumber> and <CostCenter> as elements
make them children of the parent <CostCenter>
set CostCenterNumber's text content using InnerXML assuming there is no risk of markup in there: eltCCN.InnerXml = "2";
create for the child CostCenter element a Text node child whose value is the passed string: textCC = doc.CreateText(argStr);
assign that text node as a child of the child CostCenter element: eltCC.AppendChild(textCC);

Categories