Getting text from between two html nodes using HtmlAgilityPack - c#

Suppose I have the following HTML
<p id="definition">
<span class="hw">emolument</span> \ih-MOL-yuh-muhnt\, <i>noun</i>:
The wages or perquisites arising from office, employment, or labor
</p>
I want to extract each part separately using HTMLAgilityPack in C#
I can get the word and word class easily enough
var definition = doc.DocumentNode.Descendants()
.Where(x => x.Name == "p" && x.Attributes["id"] == "definition")
.FirstOrDefault();
string word = definition.Descendants()
.Where(x => x.Name == "span")
.FirstOrDefault().InnerText;
string word_class = definition.Descendants()
.Where(x => x.Name == "i")
.FirstOrDefault().InnerText;
But how do I get the pronunciation or actual definition? These fall between nodes, and if I use defintion.InnerText I get the whole lot in one string. Is there a way to do this in XPath perhaps?
How do I select text between nodes in HtmlAgilityPack?

Is there a way to do this in XPath perhaps?
Yes - and quite an easy one.
The key concept you need to understand is how text and child element nodes are organized in XML/HTML - and thus XPath.
If the textual content of an element is punctuated by child elements, they end up in separate text nodes. You can access individual text nodes by their position.
Simply using text() on any element retrieves all child text nodes. Applying //p/text() to the snippet you have shown yields (individual results separated by -------):
[EMPTY TEXT NODE, EXCEPT WHITESPACE]
-----------------------
\ih-MOL-yuh-muhnt\,
-----------------------
:
The wages or perquisites arising from office, employment, or labor
The first text node of this p element only contains whitespace, so that's probably not what you're after. //p/text()[2] retrieves
\ih-MOL-yuh-muhnt\,
and //p/text()[3]:
:
The wages or perquisites arising from office, employment, or labor

HtmlNode text = doc.DocumentNode.Descendants().Where(x => x.Name == "p" && x.Id == "definition").FirstOrDefault();
foreach (HtmlNode node in text.SelectNodes(".//text()"))
{
Console.WriteLine(node.InnerText.Trim());
}
Output of this will be:
emolument
\ih-MOL-yuh-muhnt\,
noun
:
The wages or perquisites arising from office, employment, or labor
If you want 2. \ih-MOL-yuh-muhnt\, result. You need this.
HtmlNode a = text.SelectNodes(".//text()[2]").FirstOrDefault();

Related

Creating XML nodes dynamically to fetch its attribute values

I'm working on this using C# .net VS 2013.
I have a scenario where I'm having the structure as below,
<td>
<text text="abc">abc
<tspan text = "bcd">bcd
<tspan text = "def">def
<tspan text = "gef">gef
</tspan>
</tspan>
</tspan>
</text>
</td>
As shown above, I don't know how many tspan nodes will be there, currently I have 3, I may get 4 or more than that.
Once after finding the text node, to get the value of that node I'll use the code,
labelNode.Attributes["text"].Value
to get its adjacent tspan node, I have to use it like
labelNode.FirstChild.Attributes["text"].Value
to get its adjacent tspan node, I have to use it like
labelNode.FirstChild.FirstChild.Attributes["text"].Value
Like this it will keep on going.
Now my question is, if I know that i have 5 tags, is there any way to dynamically add "FirstChild" 5 times to "labelNode" so that I can get the text value of the last node, like this
labelNode.FirstChild.FirstChild.FirstChild.FirstChild.FirstChild.Attributes["text"].Value
If I need 2nd value i need to add it 2 times, if I need 3rd then I need to add it thrice.
Please let me know is there any solution for this.
Please ask me, if you got confused with my question.
Thanking you all in advance.
Rather than adding FirstChild dynamically, I think this would be a simpler solution:
static XmlNode GetFirstChildNested(XmlNode node, int level) {
XmlNode ret = node;
while (level > 0 && ret != null) {
ret = ret.FirstChild;
level--;
}
return ret;
}
Then you could use this function like this:
var firstChild5 = GetFirstChildNested(labelNode, 5);
I would suggesting using Linq to Xml which has cleaner way parsing Xml
Using XElement (or XDocument) you could flatten the hierarchy by calling Descendant method and do all required queries.
ex..
XElement doc= XElement.Load(filepath);
var results =doc.Descendants()
.Select(x=>(string)x.Attribute("text"));
//which returns
abc,
bcd,
def,
gef
If you want to get the last child you could simply use.
ex..
XElement doc= XElement.Load(filepath);
doc.Descendants()
.Last() // get last element in hierarchy.
.Attribute("text").Value
If you want to get third element, you could do something like this.
XElement doc= XElement.Load(filepath);
doc.Descendants()
.Skip(2) // Skip first two.
.First()
.Attribute("text").Value ;
Check this Demo

Poorly defined XML, get node and contents of all child nodes as string concat with spaces?

Here's some fantastic example XML:
<root>
<section>Here is some text<mightbe>a tag</mightbe>might <not attribute="be" />. Things are just<label>a mess</label>but I have to parse it because that's what needs to be done and I can't <font stupid="true">control</font> the source. <p>Why are there p tags here?</p>Who knows, but there may or may not be spaces around them so that's awesome. The point here is, there's node soup inside the section node and no definition for the document.</section>
</root>
I'd like to just grab the text from the section node and all sub nodes as strings. BUT, note that there may or may not be spaces around the sub-nodes, so I want to pad the sub notes and append a space.
Here's a more precise example of what input might look like, and what I'd like output to be:
<root>
<sample>A good story is the<book>Hitchhikers Guide to the Galaxy</book>. It was published<date>a long time ago</date>. I usually read at<time>9pm</time>.</sample>
</root>
I'd like the output to be:
A good story is the Hitchhikers Guide to the Galaxy. It was published a long time ago. I usually read at 9pm.
Note that the child nodes don't have spaces around them, so I need to pad them otherwise the words run together.
I was attempting to use this sample code:
XDocument doc = XDocument.Parse(xml);
foreach(var node in doc.Root.Elements("section"))
{
output += String.Join(" ", node.Nodes().Select(x => x.ToString()).ToArray()) + " ";
}
But the output includes the child tags, and is not going to work out.
Any suggestions here?
TL;DR: Was given node soup xml and want to stringify it with padding around child nodes.
Incase you have nested tags to an unknown level (e.g <date>a <i>long</i> time ago</date>), you might also want to recurse so that the formatting is applied consistently throughout. For example..
private static string Parse(XElement root)
{
return root
.Nodes()
.Select(a => a.NodeType == XmlNodeType.Text ? ((XText)a).Value : Parse((XElement)a))
.Aggregate((a, b) => String.Concat(a.Trim(), b.StartsWith(".") ? String.Empty : " ", b.Trim()));
}
You could try using xpath to extract what you need
var docNav = new XPathDocument(xml);
// Create a navigator to query with XPath.
var nav = docNav.CreateNavigator();
// Find the text of every element under the root node
var expression = "/root//*/text()";
// Execute the XPath expression
var resultString = nav.evaluate(expression);
// Do some stuff with resultString
....
References:
Querying XML, XPath syntax
Here is a possible solution following your initial code:
private string extractSectionContents(XElement section)
{
string output = "";
foreach(var node in section.Nodes())
{
if(node.NodeType == System.Xml.XmlNodeType.Text)
{
output += string.Format("{0}", node);
}
else if(node.NodeType == System.Xml.XmlNodeType.Element)
{
output += string.Format(" {0} ", ((XElement)node).Value);
}
}
return output;
}
A problem with your logic is that periods will be preceded by a space when placed right after an element.
You are looking at "mixed content" nodes. There is nothing particularly special about them - just get all child nodes (text nodes are nodes too) and join they values with space.
Something like
var result = String.Join("",
root.Nodes().Select(x => x is XText ? ((XText)x).Value : ((XElement)x).Value));

Xpath to find first occurrence of two different elements

Using the example below, I would like to use xPath to find the first occurence of two different elements. For example, I want to figure out if b or d appears first. We can obviously tell that b appears before d (looking top-down, and not at the tree level). But, how can I solve this using xpath?
<a>
<b>
</b>
</a>
<c>
</c>
<d>
</d>
Right now, I find the node (b and d in this case) by getting the first element in the nodeset, which I find using the following code:
String xPathExpression = "//*[local-name()='b']";
XPathNodeIterator nodeSet = (XPathNodeIterator)navigator.Evaluate(xPathExpression);
and
String xPathExpression = "//*[local-name()='d']";
XPathNodeIterator nodeSet = (XPathNodeIterator)navigator.Evaluate(xPathExpression);
Now using xpath, I just can't figure out which comes first, b or d.
You want to scan the tree in document order (the order the elements occur). As though by chance this is the default search order, and all you've got to do is select the first element which is a <b/> or <d/> node:
//*[local-name() = 'b' or local-name() = 'd'][1]
If you want the name, add another local-name(...) call:
local-name(//*[local-name() = 'b' or local-name() = 'd'][1])
If you wanted to use LINQ to XML to solve the same solution you can try the following:
XDocument xmlDoc = new XDocument(filepath);
XElement first = (from x in xmlDoc.Descendants()
where x.Name == "b" || x.Name == "d"
select x).FirstOrDefault();
Now you can run a simple if statement to determine if "b" or "d" was the first element found that matches our criteria.
if(first.Name.Equals("b")
//code for b being first
else if(first.Name.Equals("d")
//code for d being first
Per a commentator's suggestion your code would is cleaner to use a lambda expression instead of a full LINQ query, but this can sometimes be confusing if you are new to LINQ. The following is the exact same query for my XElement assignment above:
XElement first = xmlDoc.Descendants().FirstOrDefault(x => x.Name == "b" || x.Name == "d");
I hope that's what you're looking for if you were open to not using xpath.
This XPath expressions would work:
//*[local-name()='b' or local-name()='d'][1]
Or for a shorter solution, you could try this:
(//b|//d)[1]
Both expressions will select either b elements or d elements, in the order in which they appear, and only select the first element from the result set.

Extracting content from Webpage

I am attempting to use HTMLagilitypack to extract all the content from the webpage.
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
sb.AppendLine(node.Text);
}
When i try to parse google.com using above code i get lots of javascript. All i want is to extract the content in the webpage like in h or p tags. Like taking the question,answer,comments on this page and removing everything else.
I am really new to XPath and don't exactly know where to move forward. So any help would be appreciated.
You can filter for the non-wanted tags by name and remove them from your document.
doc = page.Load("http://www.google.com");
doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());
You could use this XPath expression:
//body//*[local-name() != 'script']/text()
It takes only the elements inside the body and skips the script elements

How to find nodes entirely between two specified nodes

In a XML document such as the following:
<root>
<fish value="Start"/>
<pointlessContainer>
<anotherNode value="Pick me!"/>
<anotherNode value="Pick me too!"/>
<fish value="End"/>
</pointlessContainer>
</root>
How can I use the wonder of LINQ to XML to find any nodes completely contained by the fish nodes? Note that in this example, I have deliberately placed the fish nodes at different levels in the document, as I anticipate this scenario will occur in the wild.
Obviously, in this example, I would be looking to get the two anotherNode nodes, but not the pointlessContainer node.
NB: the two 'delimiting' nodes may have the same type (e.g. fish) as other non-delimiting nodes in the document, but they would have unique attributes and therefore be easy to identify.
For your sample, the following should do
XDocument doc = XDocument.Load(#"..\..\XMLFile2.xml");
XElement start = doc.Descendants("fish").First(f => f.Attribute("value").Value == "Start");
XElement end = doc.Descendants("fish").First(f => f.Attribute("value").Value == "End");
foreach (XElement el in
doc
.Descendants()
.Where(d =>
XNode.CompareDocumentOrder(d, end) == -1
&& XNode.CompareDocumentOrder(d, start) == 1
&& !end.Ancestors().Contains(d)))
{
Console.WriteLine(el);
}
But I haven't tested or thoroughly pondered whether it works for other cases. Maybe you can check for some of your sample data and report back whether it works.

Categories