I want to remove all invalid text from an XML document. I consider any text not wrapped in <> XML brackets to be invalid, and want to strip these prior to translation.
From this post Regular expression to remove text outside the tags in a string - it explains how to match XML brackets together. However on my example it doesn't clean up the text outside of the XML as can be seen in this example. https://regex101.com/r/6iUyia/1
I dont think this specific example has been asked on S/O before from my initial research.
Currently in my code, I have this XML as a string, before I compose an XDocument from it later on. So I potentially have string, Regex and XDocument methods available to assist in removing this, there could additionally be more than one bit of invalid XML present in these documents. Additionally, I do not wish to use XSLT to remove these values.
One of the very rudimentary idea's I tried and failed to compose, was to iterate over the string as a char array, and attempting to remove it if it was outside of '>' and '<' but decided there must be a better way to achieve this (hence the question)
This is an example of the input, with invalid text being displayed between nested-A and nested-B
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
Remove text not inside valid xml braces
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
I expect the output to be in a format like the below.
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
You could do the following . Please note I have done very limited testing, kindly let me know if it fails in some scenarios .
XmlDocument doc = new XmlDocument();
doc.LoadXml(str);
var json = JsonConvert.SerializeXmlNode(doc);
string result = JToken.Parse(json).RemoveFields().ToString(Newtonsoft.Json.Formatting.None);
var xml = (XmlDocument)JsonConvert.DeserializeXmlNode(result);
Where RemoveFields are defined as
public static class Extensions
{
public static JToken RemoveFields(this JToken token)
{
JContainer container = token as JContainer;
if (container == null) return token;
List<JToken> removeList = new List<JToken>();
foreach (JToken el in container.Children())
{
JProperty p = el as JProperty;
if (p != null && p.Name.StartsWith("#"))
{
removeList.Add(el);
}
el.RemoveFields();
}
foreach (JToken el in removeList)
el.Remove();
return token;
}
}
Output
<ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<A>
<nested-A>valid text</nested-A>
<nested-B>more valid text here</nested-B>
</A>
</ASchema>
Please note am using Json.net in above code
Related
If you open an XML Document with XDocument.Load(path) and then look through Descendants when you find the one you are looking for and use SetElementValue if you set the value to an empty string ("") or null it ends up removing the tag so when you save the document it's lost.
I want to be able to keep the tag when the value is null or an empty string. I've not been able to find a way to do this.
Is my only option to deserialize the entire XML document into objects edit those objects and write over the file rather than just loading the XmlDocument and editing it?
Sorry, it took me a while to get back to this. I found what my issue was so the code was all correct. What I hadn't noticed before was that this line was at the end before the save.
xDocument.Descendants().Where(e => string.IsNullOrEmpty(e.Value)).Remove();
This goes through all the descendants and finds any that are null or empty strings and removes them which was my problem.
XElement.SetElementValue(elementName, elementValue);
This does exactly as documented. When the elementValue is NULL it will remove the element but when it's an empty string it will put leave the element as an empty element in long-form, not the short form which is fine for my case.
For completeness of this answer and since those asked for example code here is some.
Sample.cfg
<?xml version="1.0" encoding="utf-8"?>
<ParentNode>
<ChildNode>
<PropertyOne>1</PropertyOne>
<PropertyTwo>Y</PropertyTwo>
</ChildNode>
<ChildNode>
<PropertyOne>2</PropertyOne>
<PropertyTwo>N</PropertyTwo>
</ChildNode>
</ParentNode>
Sample Code
// See https://aka.ms/new-console-template for more information
using System.Xml.Linq;
var xDocument = XDocument.Load("Sample.cfg");
foreach (var childNode in xDocument.Descendants("ChildNode"))
{
foreach (var element in childNode.Elements())
{
if (element.Name == "PropertyOne" && element.Value == "2")
{
childNode.SetElementValue("PropertyTwo", "");
}
// Uncomment this line to always have it remove null and empty string descendants
//xDocument.Descendants().Where(e => string.IsNullOrEmpty(e.Value)).Remove();
xDocument.Save("Sample.cfg");
}
}
Here's some fantastic example XML:
<root>
<section>Here is some text<mightbe>a tag</mightbe>might <not attribute="be" />. Things are just<label>a mess</label>but I have to parse it because that's what needs to be done and I can't <font stupid="true">control</font> the source. <p>Why are there p tags here?</p>Who knows, but there may or may not be spaces around them so that's awesome. The point here is, there's node soup inside the section node and no definition for the document.</section>
</root>
I'd like to just grab the text from the section node and all sub nodes as strings. BUT, note that there may or may not be spaces around the sub-nodes, so I want to pad the sub notes and append a space.
Here's a more precise example of what input might look like, and what I'd like output to be:
<root>
<sample>A good story is the<book>Hitchhikers Guide to the Galaxy</book>. It was published<date>a long time ago</date>. I usually read at<time>9pm</time>.</sample>
</root>
I'd like the output to be:
A good story is the Hitchhikers Guide to the Galaxy. It was published a long time ago. I usually read at 9pm.
Note that the child nodes don't have spaces around them, so I need to pad them otherwise the words run together.
I was attempting to use this sample code:
XDocument doc = XDocument.Parse(xml);
foreach(var node in doc.Root.Elements("section"))
{
output += String.Join(" ", node.Nodes().Select(x => x.ToString()).ToArray()) + " ";
}
But the output includes the child tags, and is not going to work out.
Any suggestions here?
TL;DR: Was given node soup xml and want to stringify it with padding around child nodes.
Incase you have nested tags to an unknown level (e.g <date>a <i>long</i> time ago</date>), you might also want to recurse so that the formatting is applied consistently throughout. For example..
private static string Parse(XElement root)
{
return root
.Nodes()
.Select(a => a.NodeType == XmlNodeType.Text ? ((XText)a).Value : Parse((XElement)a))
.Aggregate((a, b) => String.Concat(a.Trim(), b.StartsWith(".") ? String.Empty : " ", b.Trim()));
}
You could try using xpath to extract what you need
var docNav = new XPathDocument(xml);
// Create a navigator to query with XPath.
var nav = docNav.CreateNavigator();
// Find the text of every element under the root node
var expression = "/root//*/text()";
// Execute the XPath expression
var resultString = nav.evaluate(expression);
// Do some stuff with resultString
....
References:
Querying XML, XPath syntax
Here is a possible solution following your initial code:
private string extractSectionContents(XElement section)
{
string output = "";
foreach(var node in section.Nodes())
{
if(node.NodeType == System.Xml.XmlNodeType.Text)
{
output += string.Format("{0}", node);
}
else if(node.NodeType == System.Xml.XmlNodeType.Element)
{
output += string.Format(" {0} ", ((XElement)node).Value);
}
}
return output;
}
A problem with your logic is that periods will be preceded by a space when placed right after an element.
You are looking at "mixed content" nodes. There is nothing particularly special about them - just get all child nodes (text nodes are nodes too) and join they values with space.
Something like
var result = String.Join("",
root.Nodes().Select(x => x is XText ? ((XText)x).Value : ((XElement)x).Value));
I am using the following code to import XML into a dataset:
DataSet dataSet = new DataSet();
dataSet.ReadXml(file.FullName);
if (dataSet.Tables.Count > 0) //not empty XML file
{
da.ClearFieldsForInsert();
DataRow order = dataSet.Tables["Orders"].Rows[0];
da.AddStringForInsert("ProductDescription", order["ProductDescription"].ToString());
}
Special characters such as ' are not getting translated to ' as I would have thought they should.
I can convert them myself in code, but would have thought the ReadXML method should do it automatically.
Is there anything I've missed here?
EDIT:
Relevant line of XML file:
<ProductDescription>Grey 'Aberdeen' double wardrobe</ProductDescription>
EDIT:
I then tried using XElement:
XDocument doc = XDocument.Load(file.FullName);
XElement order = doc.Root.Elements("Orders").FirstOrDefault();
...
if (order != null)
{
da.ClearFieldsForInsert();
IEnumerable<XElement> items = doc.Root.Elements("Orders");
foreach (XElement item in items)
{
da.ClearFieldsForInsert();
da.AddStringForInsert("ProductDescription", item.Element("ProductDescription").value.ToString());
}
Still not getting converted!
As stated here, ' is a valid XML escape code.
However, it is not necessary to escape ' in element values.
<ProductDescription>Grey 'Aberdeen' double wardrobe</ProductDescription>
is valid XML.
Workaround aside, a standards compliant XML parser should honour the predefined entities, wherever they occur (except in CDATA.)
This frailty, and deviation from standard XML parsing, of Data.ReadXml is noted in the documentation. I quote:
The DataSet itself only escapes illegal
XML characters in XML element names and hence can only consume the
same. When legal characters in XML element name are escaped, the
element is ignored while processing.
Due to its limitations, I wouldn't use DataTable.ReadXml for XML parsing. Instead you could use XDocument something like this,
using System.Xml.Linq;
...
var doc = XDocument.Load(file.FullName);
var order in doc.Root.Elements("Order").FirstOrDefault();
if (order != null)
{
da.ClearFieldsForInsert();
var productDescription = order.Element("ProductDescription");
da.AddStringForInsert(
"ProductDescription",
productDescription.Value);
}
I have a whole pile of HTML which is just a bunch of this:
<li id="entry-c7" data-user="ThisIsSomeonesUsername">
<img width="28" height="28" class="avatar" src="http://very_long_url.png">
<span class="time">6:07</span>
<span class="username">ThisIsSomeonesUsername</span>
<span class="message">This is my message. It is nice, no?</span>
</li>
Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:
HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);
Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?
Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?
Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).
Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.
Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.
(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).
Just an FYI Regex cant parse HTML in any usable fasion... RegEx match open tags except XHTML self-contained tags just for those that stumble across this post.
Now for your requirement have you tried using XmlDocument or XDocument?
Just try the following (note the img tag is missing the end />) if that is the case in your HTML this wont work as its not valid XML).
//parse the xml
var xDoc = XDocument.Parse(html);
//create our list of results (basic tuple here, could be your class)
List<Tuple<string, string, string>> attributes = new List<Tuple<string, string, string>>();
//iterate all li elemenets
foreach (var element in xDoc.Root.Elements("li"))
{
//set the default values
string time = "",
username = "",
message = "";
//get the time, username message attributes
XElement tElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "time");
XElement uElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "username");
XElement mElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "message");
//set our values based on element results
if (tElem != null)
time = tElem.Value;
if (uElem != null)
username = uElem.Value;
if (mElem != null)
message = mElem.Value;
//add to our list
attributes.Add(new Tuple<string, string, string>(time, username, message));
}
When I load this XML node, the HTML within the node is being completely stripped out.
This is the code I use to get the value within the node, which is text combined with HTML:
var stuff = innerXml.Descendants("root").Elements("details").FirstOrDefault().Value;
Inside the "details" node is text that looks like this:
"This is <strong>test copy</strong>. This is A Link"
When I look in "stuff" var I see this:
"This is test copy. This is A Link". There is no HTML in the output... it is pulled out.
Maybe Value should be innerXml or innerHtml? Does FirstOrDefault() have anything to do with this?
I don't think the xml needs a "cdata" block...
HEre is a more complete code snippet:
announcements =
from link in xdoc.Descendants(textContainer).Elements(textElement)
where link.Parent.Attribute("id").Value == Announcement.NodeId
select new AnnouncmentXml
{
NodeId = link.Attribute("id").Value,
InnerXml = link.Value
};
XDocument innerXml;
innerXml = XDocument.Parse(item.InnerXml);
var abstract = innerXml.Descendants("root").Elements("abstract").FirstOrDefault().Value;
Finally, here is a snippet of the Xml Node. Notice how there is "InnerXml" within the standard xml structure. It starts with . I call this the "InnerXml" and this is what I am passing into the XDocument called InnerXml:
<text id="T_403080"><root> <title>How do I do stuff?</title> <details> Look Here Some Form. Please note that lorem ipsum dlor sit amet.</details> </root></text>
[UPDATE]
I tried to use this helper lamda, and it will return the HTML but it is escaped, so when it displays on the page I see the actual HTML in the view (it shows instead of giving a link, the tag is printed to screen:
Title = innerXml.Descendants("root").Elements("title").FirstOrDefault().Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString());
So I tried both HTMLEncode and HTMLDecode but neither helped. One showed the escaped chars on the screen and the other did nothing:
Title =
System.Web.HttpContext.Current.Server.HtmlDecode(
innerXml.Descendants("root").Elements("details").Nodes().Aggregate(new System.Text.StringBuilder(), (sb, node) => sb.Append(node.ToString()), sb => sb.ToString())
);
I ended up using an XmlDocument instead of an XDocument. It doesn't seem like LINQ to XML is mature enough to support what I am trying to do. THere is no InnerXml property of an XDoc, only Value.
Maybe someday I will be able to revert to LINQ. For now, I just had to get this off my plate. Here is my solution:
// XmlDoc to hold custom Xml within each node
XmlDocument innerXml = new XmlDocument();
try
{
// Parse inner xml of each item and create objects
foreach (var faq in faqs)
{
innerXml.LoadXml(faq.InnerXml);
FAQ oFaq = new FAQ();
#region Fields
// Get Title value if node exists and is not null
if (innerXml.SelectSingleNode("root/title") != null)
{
oFaq.Title = innerXml.SelectSingleNode("root/title").InnerXml;
}
// Get Details value if node exists and is not null
if (innerXml.SelectSingleNode("root/details") != null)
{
oFaq.Description = innerXml.SelectSingleNode("root/details").InnerXml;
}
#endregion
result.Add(oFaq);
}
}
catch (Exception ex)
{
// Handle Exception
}
I do think wrapping your details node in a cdata block is the right decision. CData basically indicates that the information contained within it should be treated as text, and not parsed for XML special characters. The html charaters in the details node, especially the < and > are in direct conflict with the XML spec, and should really be marked as text.
You might be able to hack around this by grabbing the innerXml, but if you have control over the document content, cdata is the correct decision.
In case you need an example of how that should look, here's a modified version of the detail node:
<details>
<![CDATA[
This is <strong>test copy</strong>. This is A Link
]]>
</details>