I'm working on this using C# .net VS 2013.
I have a scenario where I'm having the structure as below,
<td>
<text text="abc">abc
<tspan text = "bcd">bcd
<tspan text = "def">def
<tspan text = "gef">gef
</tspan>
</tspan>
</tspan>
</text>
</td>
As shown above, I don't know how many tspan nodes will be there, currently I have 3, I may get 4 or more than that.
Once after finding the text node, to get the value of that node I'll use the code,
labelNode.Attributes["text"].Value
to get its adjacent tspan node, I have to use it like
labelNode.FirstChild.Attributes["text"].Value
to get its adjacent tspan node, I have to use it like
labelNode.FirstChild.FirstChild.Attributes["text"].Value
Like this it will keep on going.
Now my question is, if I know that i have 5 tags, is there any way to dynamically add "FirstChild" 5 times to "labelNode" so that I can get the text value of the last node, like this
labelNode.FirstChild.FirstChild.FirstChild.FirstChild.FirstChild.Attributes["text"].Value
If I need 2nd value i need to add it 2 times, if I need 3rd then I need to add it thrice.
Please let me know is there any solution for this.
Please ask me, if you got confused with my question.
Thanking you all in advance.
Rather than adding FirstChild dynamically, I think this would be a simpler solution:
static XmlNode GetFirstChildNested(XmlNode node, int level) {
XmlNode ret = node;
while (level > 0 && ret != null) {
ret = ret.FirstChild;
level--;
}
return ret;
}
Then you could use this function like this:
var firstChild5 = GetFirstChildNested(labelNode, 5);
I would suggesting using Linq to Xml which has cleaner way parsing Xml
Using XElement (or XDocument) you could flatten the hierarchy by calling Descendant method and do all required queries.
ex..
XElement doc= XElement.Load(filepath);
var results =doc.Descendants()
.Select(x=>(string)x.Attribute("text"));
//which returns
abc,
bcd,
def,
gef
If you want to get the last child you could simply use.
ex..
XElement doc= XElement.Load(filepath);
doc.Descendants()
.Last() // get last element in hierarchy.
.Attribute("text").Value
If you want to get third element, you could do something like this.
XElement doc= XElement.Load(filepath);
doc.Descendants()
.Skip(2) // Skip first two.
.First()
.Attribute("text").Value ;
Check this Demo
Related
Here's some fantastic example XML:
<root>
<section>Here is some text<mightbe>a tag</mightbe>might <not attribute="be" />. Things are just<label>a mess</label>but I have to parse it because that's what needs to be done and I can't <font stupid="true">control</font> the source. <p>Why are there p tags here?</p>Who knows, but there may or may not be spaces around them so that's awesome. The point here is, there's node soup inside the section node and no definition for the document.</section>
</root>
I'd like to just grab the text from the section node and all sub nodes as strings. BUT, note that there may or may not be spaces around the sub-nodes, so I want to pad the sub notes and append a space.
Here's a more precise example of what input might look like, and what I'd like output to be:
<root>
<sample>A good story is the<book>Hitchhikers Guide to the Galaxy</book>. It was published<date>a long time ago</date>. I usually read at<time>9pm</time>.</sample>
</root>
I'd like the output to be:
A good story is the Hitchhikers Guide to the Galaxy. It was published a long time ago. I usually read at 9pm.
Note that the child nodes don't have spaces around them, so I need to pad them otherwise the words run together.
I was attempting to use this sample code:
XDocument doc = XDocument.Parse(xml);
foreach(var node in doc.Root.Elements("section"))
{
output += String.Join(" ", node.Nodes().Select(x => x.ToString()).ToArray()) + " ";
}
But the output includes the child tags, and is not going to work out.
Any suggestions here?
TL;DR: Was given node soup xml and want to stringify it with padding around child nodes.
Incase you have nested tags to an unknown level (e.g <date>a <i>long</i> time ago</date>), you might also want to recurse so that the formatting is applied consistently throughout. For example..
private static string Parse(XElement root)
{
return root
.Nodes()
.Select(a => a.NodeType == XmlNodeType.Text ? ((XText)a).Value : Parse((XElement)a))
.Aggregate((a, b) => String.Concat(a.Trim(), b.StartsWith(".") ? String.Empty : " ", b.Trim()));
}
You could try using xpath to extract what you need
var docNav = new XPathDocument(xml);
// Create a navigator to query with XPath.
var nav = docNav.CreateNavigator();
// Find the text of every element under the root node
var expression = "/root//*/text()";
// Execute the XPath expression
var resultString = nav.evaluate(expression);
// Do some stuff with resultString
....
References:
Querying XML, XPath syntax
Here is a possible solution following your initial code:
private string extractSectionContents(XElement section)
{
string output = "";
foreach(var node in section.Nodes())
{
if(node.NodeType == System.Xml.XmlNodeType.Text)
{
output += string.Format("{0}", node);
}
else if(node.NodeType == System.Xml.XmlNodeType.Element)
{
output += string.Format(" {0} ", ((XElement)node).Value);
}
}
return output;
}
A problem with your logic is that periods will be preceded by a space when placed right after an element.
You are looking at "mixed content" nodes. There is nothing particularly special about them - just get all child nodes (text nodes are nodes too) and join they values with space.
Something like
var result = String.Join("",
root.Nodes().Select(x => x is XText ? ((XText)x).Value : ((XElement)x).Value));
This is my XML:
<?xml version="1.0"?>
<formatlist>
<format>
<formatName>WHC format</formatName>
<delCol>ID</delCol>
<delCol>CDRID</delCol>
<delCol>TGIN</delCol>
<delCol>IPIn</delCol>
<delCol>TGOUT</delCol>
<delCol>IPOut</delCol>
<srcNum>SRCNum</srcNum>
<distNum>DSTNum</distNum>
<connectTime>ConnectTime</connectTime>
<duration>Duration</duration>
</format>
<format>
<formatName existCombineCol="1">Umobile format</formatName> //this format
<delCol>billing_operator</delCol>
<hideCol>event_start_date</hideCol>
<hideCol>event_start_time</hideCol>
<afCombineName dateType="DateTime" format="dd/MM/yyyy HH:mm:ss"> //node i want
<name>ConnectdateTimeAFcombine</name>
<combineDate>event_start_date</combineDate>
<combineTime>event_start_time</combineTime>
</afCombineName>
<afCombineName dateType="DateTime" format="dd/MM/yyyy HH:mm:ss"> //node i want
<name>aaa</name>
<combineDate>bbb</combineDate>
<combineTime>ccc</combineTime>
</afCombineName>
<modifyPerfixCol action="add" perfix="60">bnum</modifyPerfixCol>
<srcNum>anum</srcNum>
<distNum>bnum</distNum>
<connectTime>ConnectdateTimeAFcombine</connectTime>
<duration>event_duration</duration>
</format>
</formatlist>
I want to find format with Umobile format then iterate over those two nodes.
<afCombineName dateType="DateTime" format="dd/MM/yyyy HH:mm:ss"> //node i want
<name>ConnectdateTimeAFcombine</name>
<combineDate>event_start_date</combineDate>
<combineTime>event_start_time</combineTime>
</afCombineName>
<afCombineName dateType="DateTime" format="dd/MM/yyyy HH:mm:ss"> //node i want
<name>aaa</name>
<combineDate>bbb</combineDate>
<combineTime>ccc</combineTime>
</afCombineName>
and list all the two node's child nodes. The result should like this:
ConnectdateTimeAFcombine,event_start_date,event_start_time.
aaa,bbb,ccc
How can I do this?
foreach(var children in format.Descendants())
{
//Do something with the child nodes of format.
}
For all XML related traversing, you should get used to using XPath expressions. It is very useful. Even if you could perhaps do something easier in your specific case, it is good practice to use XPath. This way, if your scheme changes at some point, you just update your XPath expression and your code will be up and running.
For a complete example, you can have a look at this article.
You can use the System.Xml namespace APIs along with System.Xml.XPath namespace API. Here is a quick algorithm that will help you do your task:
Fetch the text node containing the string Umobile format using the below XPATH:
XmlNode umobileFormatNameNode = document.SelectSingleNode("//formatName[text()='Umobile format']");
Now the parent of umobileFormatNameNode will be the node that you are interested in:
XmlNode formatNode = umobileFormatNameNode.ParentNode;
Now get the children for this node:
XmlNodeList afCombineFormatNodes = formatNode.SelectNodes("afCombineName");
You can now process the list of afCombineFormatNodes
for(XmlNode xmlNode in afCombineNameFormtNodes)
{
//process nodes
}
This way you can access those elements:
var doc = System.Xml.Linq.XDocument.Load("PATH TO YOUR XML FILE");
var result = doc.Descendants("format")
.Where(x => (string)x.Element("formatName") == "Umobile format")
.Select(x => x.Element("afCombineName"));
Then you can iterate the result this way:
foreach (var item in result)
{
string format = item.Attribute("format").Value.ToString();
string name = item.Element("name").Value.ToString();
string combineDate = item.Element("combineDate").Value.ToString();
string combineTime = item.Element("combineTime").Value.ToString();
}
Here is what I have so far:
HtmlAgilityPack.HtmlDocument ht = new HtmlAgilityPack.HtmlDocument();
TextReader reader = File.OpenText(#"C:\Users\TheGateKeeper\Desktop\New folder\html.txt");
ht.Load(reader);
reader.Close();
HtmlNode select= ht.GetElementbyId("cats[]");
List<HtmlNode> options = new List<HtmlNode>();
foreach (HtmlNode option in select.ChildNodes)
{
if (option.Name == "option")
{
options.Add(option);
}
}
Now I have a list of all the "options" for the select element. What properties do I need to access to get the key and the text?
So if for example the html for one option would be:
<option class="level-1" value="1">Funky Town</option>
I want to get as output:
1 - Funky Town
Thanks
Edit: I just noticed something. When I got the child elements of the "Select" elements, it returned elements of type "option" and elements of type "#text".
Hmmm .. #text has the string I want, but select has the value.
I tought HTMLAgilityPack was an html parser? Why did it give me confusing values like this?
This is due to the default configuration for the html parser; it has configured the <option> as HtmlElementFlag.Empty (with the comment 'they sometimes contain, and sometimes they don't...'). The <form> tag has the same setup (CanOverlap + Empty) which causes them to appear as empty nodes in the dom, without any child nodes.
You need to remove that flag before parsing the document.
HtmlNode.ElementsFlags.Remove("option");
Notice that the ElementsFlags property is static and any changes will affect all further parsing.
edit: you should probably be selecting the option nodes directly via xpath. I think this should work for that:
var options = select.SelectNodes("option");
that will get your options without the text nodes. the options should contain that string you want somewhere. waiting for your html sample.
foreach (var option in options)
{
int value = int.Parse(option.Attributes["value"].Value);
string text = option.InnerText;
}
you can add some sanity checking on the attribute to make sure it exists.
I have a simple XML
<AllBands>
<Band>
<Beatles ID="1234" started="1962">greatest Band<![CDATA[lalala]]></Beatles>
<Last>1</Last>
<Salary>2</Salary>
</Band>
<Band>
<Doors ID="222" started="1968">regular Band<![CDATA[lalala]]></Doors>
<Last>1</Last>
<Salary>2</Salary>
</Band>
</AllBands>
However ,
when I want to reach the "Doors band" and to change its ID :
using (var stream = new StringReader(result))
{
XDocument xmlFile = XDocument.Load(stream);
var query = from c in xmlFile.Elements("Band")
select c;
...
query has no results
But
If I write xmlFile.Elements().Elements("Band") so it Does find it.
What is the problem ?
Is the full path from the Root needed ?
And if so , Why did it work without specify AllBands ?
Does the XDocument Navigation require me to know the full level structure down to the required element ?
Elements() will only check direct children - which in the first case is the root element, in the second case children of the root element, hence you get a match in the second case. If you just want any matching descendant use Descendants() instead:
var query = from c in xmlFile.Descendants("Band") select c;
Also I would suggest you re-structure your Xml: The band name should be an attribute or element value, not the element name itself - this makes querying (and schema validation for that matter) much harder, i.e. something like this:
<Band>
<BandProperties Name ="Doors" ID="222" started="1968" />
<Description>regular Band<![CDATA[lalala]]></Description>
<Last>1</Last>
<Salary>2</Salary>
</Band>
You can do it this way:
xml.Descendants().SingleOrDefault(p => p.Name.LocalName == "Name of the node to find")
where xml is a XDocument.
Be aware that the property Name returns an object that has a LocalName and a Namespace. That's why you have to use Name.LocalName if you want to compare by name.
You should use Root to refer to the root element:
xmlFile.Root.Elements("Band")
If you want to find elements anywhere in the document use Descendants instead:
xmlFile.Descendants("Band")
The problem is that Elements only takes the direct child elements of whatever you call it on. If you want all descendants, use the Descendants method:
var query = from c in xmlFile.Descendants("Band")
My experience when working with large & complicated XML files is that sometimes neither Elements nor Descendants seem to work in retrieving a specific Element (and I still do not know why).
In such cases, I found that a much safer option is to manually search for the Element, as described by the following MSDN post:
https://social.msdn.microsoft.com/Forums/vstudio/en-US/3d457c3b-292c-49e1-9fd4-9b6a950f9010/how-to-get-tag-name-of-xml-by-using-xdocument?forum=csharpgeneral
In short, you can create a GetElement function:
private XElement GetElement(XDocument doc,string elementName)
{
foreach (XNode node in doc.DescendantNodes())
{
if (node is XElement)
{
XElement element = (XElement)node;
if (element.Name.LocalName.Equals(elementName))
return element;
}
}
return null;
}
Which you can then call like this:
XElement element = GetElement(doc,"Band");
Note that this will return null if no matching element is found.
The Elements() method returns an IEnumerable<XElement> containing all child elements of the current node. For an XDocument, that collection only contains the Root element. Therefore the following is required:
var query = from c in xmlFile.Root.Elements("Band")
select c;
Sebastian's answer was the only answer that worked for me while examining a xaml document. If, like me, you'd like a list of all the elements then the method would look a lot like Sebastian's answer above but just returning a list...
private static List<XElement> GetElements(XDocument doc, string elementName)
{
List<XElement> elements = new List<XElement>();
foreach (XNode node in doc.DescendantNodes())
{
if (node is XElement)
{
XElement element = (XElement)node;
if (element.Name.LocalName.Equals(elementName))
elements.Add(element);
}
}
return elements;
}
Call it thus:
var elements = GetElements(xamlFile, "Band");
or in the case of my xaml doc where I wanted all the TextBlocks, call it thus:
var elements = GetElements(xamlFile, "TextBlock");
Is it possible to get the path of the current XElement in an XDocument? For example, if I'm iterating over the nodes in a document is there some way I can get the path of that node (XElement) so that it returns something like \root\item\child\currentnode ?
There's nothing built in, but you could write your own extension method:
public static string GetPath(this XElement node)
{
string path = node.Name.ToString();
XElement currentNode = node;
while (currentNode.Parent != null)
{
currentNode = currentNode.Parent;
path = currentNode.Name.ToString() + #"\" + path;
}
return path;
}
XElement node = ..
string path = node.GetPath();
This doesn't account for the position of the element within its peer group though.
I know the question is old, but in case someone wants to get a simple one liner:
XElement element = GetXElement();
var xpath = string.Join ("/", element.AncestorsAndSelf().Reverse().Select(a => a.Name.LocalName).ToArray());
Usings:
using System.Linq;
using System.Xml.Linq;
Depending on how you want to use the XPath. In either case you'll need to walk tree up yourself and build XPath on the way.
If you want to have readable string for dispaly - just joining names of prent nodes (see BrokenGlass suggestion) works fine
If you want select later on the XPath
positional XPath (specify position of each node in its parent) is one option (something like //[3]/*). You need to consider attributes as special case as there is no order defined on attributes
XPath with pre-defined prefixes (namespace to prefix need to be stored separately) - /my:root/my:child[3]/o:prop1/#t:attr3
XPath with inlined namespaces when you want semi-readable and portable XPath /*[name()='root' & namespace-uri()='http://my.namespace']/.... (see specification for name and namespace-uri functions http://www.w3.org/TR/xpath/#function-namespace-uri)
Note that special nodes like comments and processing instructions may need to be taken into account if you want truly generic version of the XPath to a node inside XML.