Unable to convert special characters on reading XML - c#

I am using the following code to import XML into a dataset:
DataSet dataSet = new DataSet();
dataSet.ReadXml(file.FullName);
if (dataSet.Tables.Count > 0) //not empty XML file
{
da.ClearFieldsForInsert();
DataRow order = dataSet.Tables["Orders"].Rows[0];
da.AddStringForInsert("ProductDescription", order["ProductDescription"].ToString());
}
Special characters such as ' are not getting translated to ' as I would have thought they should.
I can convert them myself in code, but would have thought the ReadXML method should do it automatically.
Is there anything I've missed here?
EDIT:
Relevant line of XML file:
<ProductDescription>Grey &apos;Aberdeen&apos; double wardrobe</ProductDescription>
EDIT:
I then tried using XElement:
XDocument doc = XDocument.Load(file.FullName);
XElement order = doc.Root.Elements("Orders").FirstOrDefault();
...
if (order != null)
{
da.ClearFieldsForInsert();
IEnumerable<XElement> items = doc.Root.Elements("Orders");
foreach (XElement item in items)
{
da.ClearFieldsForInsert();
da.AddStringForInsert("ProductDescription", item.Element("ProductDescription").value.ToString());
}
Still not getting converted!

As stated here, &apos; is a valid XML escape code.
However, it is not necessary to escape ' in element values.
<ProductDescription>Grey 'Aberdeen' double wardrobe</ProductDescription>
is valid XML.
Workaround aside, a standards compliant XML parser should honour the predefined entities, wherever they occur (except in CDATA.)
This frailty, and deviation from standard XML parsing, of Data.ReadXml is noted in the documentation. I quote:
The DataSet itself only escapes illegal
XML characters in XML element names and hence can only consume the
same. When legal characters in XML element name are escaped, the
element is ignored while processing.
Due to its limitations, I wouldn't use DataTable.ReadXml for XML parsing. Instead you could use XDocument something like this,
using System.Xml.Linq;
...
var doc = XDocument.Load(file.FullName);
var order in doc.Root.Elements("Order").FirstOrDefault();
if (order != null)
{
da.ClearFieldsForInsert();
var productDescription = order.Element("ProductDescription");
da.AddStringForInsert(
"ProductDescription",
productDescription.Value);
}

Related

Remove all text not wrapped in XML braces

I want to remove all invalid text from an XML document. I consider any text not wrapped in <> XML brackets to be invalid, and want to strip these prior to translation.
From this post Regular expression to remove text outside the tags in a string - it explains how to match XML brackets together. However on my example it doesn't clean up the text outside of the XML as can be seen in this example. https://regex101.com/r/6iUyia/1
I dont think this specific example has been asked on S/O before from my initial research.
Currently in my code, I have this XML as a string, before I compose an XDocument from it later on. So I potentially have string, Regex and XDocument methods available to assist in removing this, there could additionally be more than one bit of invalid XML present in these documents. Additionally, I do not wish to use XSLT to remove these values.
One of the very rudimentary idea's I tried and failed to compose, was to iterate over the string as a char array, and attempting to remove it if it was outside of '>' and '<' but decided there must be a better way to achieve this (hence the question)
This is an example of the input, with invalid text being displayed between nested-A and nested-B
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
Remove text not inside valid xml braces
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
I expect the output to be in a format like the below.
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
You could do the following . Please note I have done very limited testing, kindly let me know if it fails in some scenarios .
XmlDocument doc = new XmlDocument();
doc.LoadXml(str);
var json = JsonConvert.SerializeXmlNode(doc);
string result = JToken.Parse(json).RemoveFields().ToString(Newtonsoft.Json.Formatting.None);
var xml = (XmlDocument)JsonConvert.DeserializeXmlNode(result);
Where RemoveFields are defined as
public static class Extensions
{
public static JToken RemoveFields(this JToken token)
{
JContainer container = token as JContainer;
if (container == null) return token;
List<JToken> removeList = new List<JToken>();
foreach (JToken el in container.Children())
{
JProperty p = el as JProperty;
if (p != null && p.Name.StartsWith("#"))
{
removeList.Add(el);
}
el.RemoveFields();
}
foreach (JToken el in removeList)
el.Remove();
return token;
}
}
Output
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
Please note am using Json.net in above code

How do I retrieve an XDocument's Element Value in XML escaped format?

I am building an XDocument and I have unit tests to test the output. One of the things I want to test for is that invalid strings are being formatted for XML properly. I have discovered that calling .ToString() on the XDoc itself properly formats the invalid strings for XML. However, in my testing I am retrieving specific Elements or Attributes off of the XDoc and testing the values. This does not format the values for XML. How do I go about getting these values in their escaped format?
Answer: (thx Ed Plunkett)
myXDoc.Descendants("element2").First().FirstNode.ToString();
// result "Pork&Beans"
Sample:
var xml =
"<element1>" +
"<element2>Pork&Beans</element2>" +
"</element1>";
var myXDoc = XDocument.Load(xml);
var xDocString = myXDoc.ToString();
// result is formatted - <element1> <element2>Pork&Beans</element2> </element1>
var element2Value = myXDoc.Decendents("element2").First().Value;
// result is unformatted - Pork&Beans
Got it: Text elements in XML are nodes too.
var el2XML = myXDoc.Descendants("element2").First();
var porkAndAmpSemicolonBeans = el2XML.FirstNode.ToString();
You'll want to also check el2XML.Nodes.Count to make sure there's exactly one child in there.
System.Xml.XmlDocument is another option, because XmlNode has an InnerXml property that'll give you what you want:
var morePorkAndBeans = doc.SelectSingleNode("//element2").InnerXml;

XML problem - HTML within a node is being removed (ASP.NET C# LINQ to XML)

When I load this XML node, the HTML within the node is being completely stripped out.
This is the code I use to get the value within the node, which is text combined with HTML:
var stuff = innerXml.Descendants("root").Elements("details").FirstOrDefault().Value;
Inside the "details" node is text that looks like this:
"This is <strong>test copy</strong>. This is A Link"
When I look in "stuff" var I see this:
"This is test copy. This is A Link". There is no HTML in the output... it is pulled out.
Maybe Value should be innerXml or innerHtml? Does FirstOrDefault() have anything to do with this?
I don't think the xml needs a "cdata" block...
HEre is a more complete code snippet:
announcements =
from link in xdoc.Descendants(textContainer).Elements(textElement)
where link.Parent.Attribute("id").Value == Announcement.NodeId
select new AnnouncmentXml
{
NodeId = link.Attribute("id").Value,
InnerXml = link.Value
};
XDocument innerXml;
innerXml = XDocument.Parse(item.InnerXml);
var abstract = innerXml.Descendants("root").Elements("abstract").FirstOrDefault().Value;
Finally, here is a snippet of the Xml Node. Notice how there is "InnerXml" within the standard xml structure. It starts with . I call this the "InnerXml" and this is what I am passing into the XDocument called InnerXml:
<text id="T_403080"><root> <title>How do I do stuff?</title> <details> Look Here Some Form. Please note that lorem ipsum dlor sit amet.</details> </root></text>
[UPDATE]
I tried to use this helper lamda, and it will return the HTML but it is escaped, so when it displays on the page I see the actual HTML in the view (it shows instead of giving a link, the tag is printed to screen:
Title = innerXml.Descendants("root").Elements("title").FirstOrDefault().Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString());
So I tried both HTMLEncode and HTMLDecode but neither helped. One showed the escaped chars on the screen and the other did nothing:
Title =
System.Web.HttpContext.Current.Server.HtmlDecode(
innerXml.Descendants("root").Elements("details").Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString())
);
I ended up using an XmlDocument instead of an XDocument. It doesn't seem like LINQ to XML is mature enough to support what I am trying to do. THere is no InnerXml property of an XDoc, only Value.
Maybe someday I will be able to revert to LINQ. For now, I just had to get this off my plate. Here is my solution:
// XmlDoc to hold custom Xml within each node
XmlDocument innerXml = new XmlDocument();
try
{
// Parse inner xml of each item and create objects
foreach (var faq in faqs)
{
innerXml.LoadXml(faq.InnerXml);
FAQ oFaq = new FAQ();
#region Fields
// Get Title value if node exists and is not null
if (innerXml.SelectSingleNode("root/title") != null)
{
oFaq.Title = innerXml.SelectSingleNode("root/title").InnerXml;
}
// Get Details value if node exists and is not null
if (innerXml.SelectSingleNode("root/details") != null)
{
oFaq.Description = innerXml.SelectSingleNode("root/details").InnerXml;
}
#endregion
result.Add(oFaq);
}
}
catch (Exception ex)
{
// Handle Exception
}
I do think wrapping your details node in a cdata block is the right decision. CData basically indicates that the information contained within it should be treated as text, and not parsed for XML special characters. The html charaters in the details node, especially the < and > are in direct conflict with the XML spec, and should really be marked as text.
You might be able to hack around this by grabbing the innerXml, but if you have control over the document content, cdata is the correct decision.
In case you need an example of how that should look, here's a modified version of the detail node:
<details>
<![CDATA[
This is <strong>test copy</strong>. This is A Link
]]>
</details>

C# , xml parsing. get data between tags

I have a string :
responsestring = "<?xml version="1.0" encoding="utf-8"?>
<upload><image><name></name><hash>SOmetext</hash>"
How can i get the value between
<hash> and </hash>
?
My attempts :
responseString.Substring(responseString.LastIndexOf("<hash>") + 6, 8); // this sort of works , but won't work in every situation.
also tried messing around with xmlreader , but couldn't find the solution.
ty
Try
XDocument doc = XDocument.Parse(str);
var a = from hash in doc.Descendants("hash")
select hash.Value;
you will need System.Core and System.Xml.Linq assembly references
Others have suggested LINQ to XML solutions, which is what I'd use as well, if possible.
If you're stuck with .NET 2.0, use XmlDocument or even XmlReader.
But don't try to manipulate the raw string yourself using Substring and IndexOf. Use an XML API of some description. Otherwise you will get it wrong. It's a matter of using the right tool for the job. Parsing XML properly is a significant chunk of work - work that's already been done.
Now, just to make this a full answer, here's a short but complete program using your sample data:
using System;
using System.Xml.Linq;
class Test
{
static void Main()
{
string response = #"<?xml version='1.0' encoding='utf-8'?>
<upload><image><name></name><hash>Some text</hash></image></upload>";
XDocument doc = XDocument.Parse(response);
foreach (XElement hashElement in doc.Descendants("hash"))
{
string hashValue = (string) hashElement;
Console.WriteLine(hashValue);
}
}
}
Obviously that will loop over all the hash elements. If you only want one, you could use doc.Descendants("hash").Single() or doc.Descendants("hash").First() depending on your requirements.
Note that both the conversion I've used here and the Value property will return the concatenation of all text nodes within the element. Hopefully that's okay for you - or you could get just the first text node which is a direct child if necessary.
var val = XElement.Parse();
val.Descendants(...).Value
Get your xml well formed and escape the double quotes with backslash. Then apply the following code
XDocument resp = XDocument.Parse("<hash>SOmetext</hash>");
var r= from element in resp.Elements()
where element.Name == "hash"
select element;
foreach (var item in r)
{
Console.WriteLine(item.Value);
}
You can use an xmlreader and/or xpath queries to get all desired data.
XmlReader_Object.ReadToFollowing("hash");
string value = XmlReader_Object.ReadInnerXml();

Converting one XML document into another XML document

I want to convert an XML document containing many elements within a node (around 150) into another XML document with a slightly different schema but mostly with the same element names. Now do I have to manually map each element/node between the 2 documents. For that I will have to hardcode 150 lines of mapping and element names. Something like this:
XElement newOrder = new XElement("Order");
newOrder.Add(new XElement("OrderId", (string)oldOrder.Element("OrderId")),
newOrder.Add(new XElement("OrderName", (string)oldOrder.Element("OrderName")),
...............
...............
...............and so on
The newOrder document may contain additional nodes which will be set to null if nothing is found for them in the oldOrder. So do I have any other choice than to hardcode 150 element names like orderId, orderName and so on... Or is there some better more maintainable way?
Use an XSLT transform instead. You can use the built-in .NET XslCompiledTransform to do the transformation. Saves you from having to type out stacks of code. If you don't already know XSL/XSLT, then learning it is something that'll bank you CV :)
Good luck!
Use an XSLT transformation to translate your old xml document into the new format.
XElement.Add has an overload that takes object[].
List<string> elementNames = GetElementNames();
newOrder.Add(
elementNames
.Select(name => GetElement(name, oldOrder))
.Where(element => element != null)
.ToArray()
);
//
public XElement GetElement(string name, XElement source)
{
XElement result = null;
XElement original = source.Elements(name).FirstOrDefault();
if (original != null)
{
result = new XElement(name, (string)original)
}
return result;
}

Categories