HtmlAgilityPack remove childnode of childnode - c#

I have a string containing something like this :
string text = "<p>test <span> <font> here </font> </span> try</p><p> <font> try 2</font> </p>"
What I need is to filter something like this :
Keep Text inside P
Remove Span and content (font and text)
Keep Text inside font if its direct parent is not a Span*
What I have is :
StringBuilder sbtexttoCorrect = new StringBuilder();
HtmlDocument html = new HtmlDocument();
html.LoadHtml(textToFormat);
var nodes = html.DocumentNode.SelectNodes("//p");
foreach (var line in nodes)
{
if (line.Name =="SPAN")
{
line.RemoveAllChildren();
line.Remove();
}
}
foreach (var txt in nodes)
{
sbtexttoCorrect.Append(txt.InnerText);
}
But the sbtexttoCorrect at then end still gets the child font of the span. Even with the Removechild and his own Remove.
What am I missing?
Note : on another post someone told me :
foreach (var line in nodes.Select(node => node.ChildNodes.Where(
childNode => childNode.Name != "span"))
.Select(
textNodes => textNodes.Aggregate(String.Empty, (current, node) => current + node.InnerText)))
{
sbtexttoCorrect.Append(line);
}
But I do not understand all of the syntax so I wanted to rewrite my own try, plus it did not work all the time too, it is still getting the text inside the Font inside the Span.
Note 2 I can't find any doc on the specification of the Agilty Pack. If someone knows where to find it, I'd like to learn more about this library.
Edit The real HTML is way more complexe, with a number of childNode that I can't know for sur, they can be TD or DIV, the only thing really sure is when there is a span I need to skip his content and his childNode

I see these problems in your code:
You treat the span as UpperCase whereas HtmlAgilityPack handles it as LowerCase => your if block will never hit
You only loop on the p elements (instead on the childs of p elements) => your if block will never hit
Based on your additional explications this should work:
It selects all spans with an XPath (so should work for upper and lower case)
It removes the spans
It cleans all html elements (as indicated here)
string text = "<p>test <SPAN> <font> here </font> </SPAN> try</p><p><table> <tr><td><span>test</span></td></tr></table><font> try 2</font> </p>";
StringBuilder sbtexttoCorrect = new StringBuilder();
HtmlDocument html = new HtmlDocument();
html.LoadHtml(text);
var nodes = html.DocumentNode.SelectNodes("//span");
foreach (var node in nodes)
{
node.Remove();
}
foreach (var node in html.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string t = node.InnerText;
if (!string.IsNullOrEmpty(t))
sbtexttoCorrect.AppendLine(t);
}
}

Related

Suppress a serie of Tags into a document

I am using HtmlAgility but I am not really used to HTML documents.
A combination of tags create problem when printed, so I decided to cut them but how?
<p ><br clear=all>
</span></p>
I have a HtmlDocument that I load at the beginning and then I try to cut the previous tags.
to cut them I have tried:
HtmlAgilityPack.HtmlDocument document; // get the document
foreach (var node in document.DocumentNode.SelectNodes("//div"))
{
IEnumerable<HtmlNode> test= node.ChildNodes;
foreach(HtmlNode val in test)
{
if (val.Name == "br") //Want to check if I go through the node, I am looking for
{
int a = 0;
a++;
IEnumerable <HtmlAttribute> attribute = node.GetAttributes();
foreach(HtmlAttribute att in attribute)
{
if (att.Name == "clear")
{
HtmlNode getNode = node.NextSibling;
node.Remove();
}
}
}
}
}
It doesn't work!!!
If I would insert :
foreach (var node in document.DocumentNode.SelectNodes("//br"))
I could remove the Tag but I cannot access to the following node
could you help me?

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

get all nodes and its content using htmldocument/HtmlAgilityPack

I need to get all nodes from a html, then from that nodes I need to get the text and sub-nodes, and the same thing but from that sub-sub-nodes.
For example, I have this HTML:
<p>This <b>is a Link</b> with <b>bold</b></p>
So I need a way to get the p node, then the non-formatted text (this), the only-bold text (is a), the bolded link (Link) and the rest formatted and not formatted text.
I know that with the htmldocument I can select all nodes and sub-nodes, but, how Can I get the text before the sub-node, then the sub-node, and its text/sub-nodes so I can make the rendered version of the html ("This is a Link with bold")?
Please note that the above example is a simple one. The HTML would have more complex things like list, frames, numbered list, triple-formatted text, etc. Also note that the rendered thing is not a problem. I have already done that but in another way. What I need is the part to get the nodes and its content only.
Also, I can't ignore any node, so I can't filter by nothing. And the main node could start as p, div, frame, ul, etc.
After looking in the htmldoc and its properties, and thanks to #HungCao 's observation, I got a working simple way to interpretate a HTML code.
My code is a little more complex to add it as example, so I will post a lite version of it.
First of all, the htmlDoc has to be loaded. It could be on any function:
HtmlDocument htmlDoc = new HtmlDocument();
string html = #"<p>This <b>is a Link</b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);
Then we need to interpretate each "main" node (p in this case) and, depending its type, we need to load a LoopFunction (InterNode)
HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;
foreach (HtmlNode node in nodes)
{
if(node.Name.ToLower() == "p") //Low the typeName just in case
{
Paragraph newPPara = new Paragraph();
foreach(HtmlNode childNode in node.ChildNodes)
{
InterNode(childNode, ref newPPara);
}
richTextBlock.Blocks.Add(newPPara);
}
}
Please note that there is a property called "NodeType", but it will not return the correct type. So, instead use the "Name" property (Also note that the Name property in htmlNode is not the same as the Name attribute in HTML).
Finally, we have the InterNode function that will add inlines to the referred (ref) Paragraph
public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
string htmlNodeName = htmlNode.Name.ToLower();
List<string> nodeAttList = new List<string>();
HtmlNode parentNode = htmlNode.ParentNode;
while (parentNode != null) {
nodeAttList.Add(parentNode.Name);
parentNode = parentNode.ParentNode;
} //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.
Inline newRun = new Run();
foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
{
switch (noteAttStr)
{
case ("b"):
case ("strong"):
{
newRun.FontWeight = FontWeights.Bold;
break;
}
case ("i"):
case ("em"):
{
newRun.FontStyle = FontStyle.Italic;
break;
}
}
}
if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks #HungCao
{
((Run)newRun).Text = htmlNode.InnerText;
} else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
{
foreach (HtmlNode childNode in htmlNode.ChildNodes)
{
InterNode(childNode, ref originalPar);
}
}
return true;
}
Note: I know that I said that my app need to render the HTML in another way that a webview does, and I know that this example code generate the same thing as a Webview, but, as I said before, this is just a lite version of my final code. In fact, my original/full code is working as I need to and this is just the base.

How to wrap text with a span tag using HtmlAgilityPack

I'm currently using HtmlAgilityPack to strip the content of a div (with contentEditable) from all the unnecessary Tag, so I can only keep the text between <p></p> tag. After getting the text I send it to a corrector that give me back the words in error inside this specifique <p></p>.
Dictionary<string, List<string>> DicoError = new Dictionary<string, List<string>>();
int nbError = 0;
HtmlDocument html = new HtmlDocument();
html.LoadHtml(texteAFormater);
var nodesSpan = html.DocumentNode.SelectNodes("//span");
var nodesA = html.DocumentNode.SelectNodes("//div");
if (nodesSpan != null)
{
foreach (var node in nodesSpan)
{
node.Remove();
}
}
if (nodesA != null)
{
foreach (var node in nodesA)
{
if (node.Attributes["edth_type"] != null)
{
if (string.Equals(node.Attributes["edth_type"].Value, "contenu", StringComparison.InvariantCultureIgnoreCase)==false)
{
node.Remove();
}
}
}
}
var paragraphe = html.DocumentNode.SelectNodes("p");
for(int i =0; i< paragraphe.Count; i++){
string texteToCorrect = paragraphe[i].innerText;
List<string> errorInsideParagraph = new List<string>();
errorInsideParagraph = callProlexis(HtmlEntity.DeEntitize(texteToCorrect), nbError, DicoError);
for(int j=0;j<motEnErreur.Count; j++){
HtmlNode spanNode = html.CreateElement("span");
spanNode.Attributes.Add("class", typeError);
spanNode.Attributes.Add("id", nbError);
spanNode.Attributes.Add("oncontextmenu","rightClickMustWork(event, this);return false");
}
}
I manage to send the innerText to my corrector, the worry I got is admitting my innerText for this paragraph is :
<p>this is some text <em>error</em> how should this work</p>
In this one two words are in error : error and should
how can I add my spanNode so it will keep the <em></em> around error? (I need to keep the actual tag around the word in error if there is one already and just wrap the spanNode around it).
So the expected result will be :
<p>this is some text <span ...><em>error</em></span> how <span ...>should</span> this work</p>
Edit: I was thinking something like finding the word in error inside the innerHtml then get the parent node of this word, if it is <p> then there is no tag around him and we can just add the spanNode if it is another tag then we need to add the spanNodeas his parent node such as spanNode is the child of <p> but the parent of the tag around this word. I'm not sure how to do it.

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories