How to get text after link in htmlagilitypack

How to get text after link in htmlagilitypack - c#

I have next part of html code:
<div class="resum_card">
<p>Experience: 5 years</p>
</div>
And what i try inside the code:
nodeValue = hd.DocumentNode.SelectSingleNode("//div[#class='resum_card']//p//a[#class='no_decore']//following-sibling::a");
But it return me null, can anybody help me?

You can try this XPath to get text node after <a> element :
nodeValue = hd.DocumentNode
.SelectSingleNode("//div[#class='resum_card']/p/a/following-sibling::text()");
Note: simply use single slash (/) instead of double (//) to select element that is direct child of current element. It is better performance wise.

Related

Find element with selenium by display text

I am trying to hover over an element in a menu bar with selenium, but having difficulty locating the element. The element is displayed below :
<DIV onmouseover="function(blah blah);" class=mainItem>TextToFind</DIV>
There are multiple elements of this type so I need to find this element by TextToFind.
I've tried :
driver.FindElement(By.XPath("TextToFind"))
and
driver.FindElement(By.LinkText("TextToFind"))
which both didn't work. I even tried:
driver.FindElement(By.ClassName("mainItem"))
which also did not work. Can someone tell me what I am doing incorrectly?

You are using incorrect syntax of xpath in By.Xpath and By.LinkText works only on a element with text and By.ClassName looks ok but may be there are more elements with that class name that's why you couldn't get right element, So you should try use below provided xPath with text :-
driver.FindElement(By.XPath("//div[text() = 'TextToFind']"));
Or
driver.FindElement(By.XPath("//div[. = 'TextToFind']"));
Or
driver.FindElement(By.XPath("//*[contains(., 'TextToFind')]"));
Hope it works...:)

Better ignoring the whitespaces around the text with this:
var elm = driver.FindElement(By.XPath("//a[normalize-space() = 'TextToFind']"));
This searches text within an [a] element, you can replace it with any element you are interested in (div, span etc.).

How can I wrap a <span> around matched words in HTML without breaking the HTML

Using C# - WinForms
I have a valid HTML string which may or may not contain various HTML elements such as <a>.
I need to search this HTML and highlight certain keywords - the highlighting is done by adding a <span> around the text with inline styling. I should not be doing this for <a> tags, or any other HTML tag that isn't actually visible to the user.
e.g. currently I am doing this:
html = html.Replace(phraseToCount, "<span style=\"background: #FF0000; color: #FFFFFF; font-weight: bold;\">" + phraseToCount + "</span>");
This kind of works but it breaks <a> tags. So in the example below only the 1st instance of the word cereal should end up with a <span> around it:
<p>To view more types of cereal click here.</p>
How could I do this?
EDIT - more info.
This will be running in a Winforms app as the best way to get the HTML is using the WebBrowser control - I will be scraping web pages and highlighting various words.

You're handling HTML as plain text. You don't want that. You only want to search through the "InnerText" of your HTML elements, as in <p attribute="value">innertext</p>. Not through tags, comments, styles and script and whatever else can be included in your document.
In order to do that properly, you need to parse the HTML, and then obtain all elements' InnerTexts and do your logic on that.
In fact, InnerText is a simplification: when you have an element like <p>FooBar<span>BarBaz</span></p> where "Baz" is to be replaced, then you need to actually recursively iterate all the nodes in the DOM, and only replace text nodes, because writing into the InnerText property will remove all child nodes.
For how to do that, you'd want to use a library. You don't want to build an HTML parser on your own. See for example C#: HtmlAgilityPack extract inner text, Extracting Inner text from HTML BODY node with Html Agility Pack, How can i parse InnerText of <option> tag with HtmlAgilityPack?, Parsing HTML with CSQuery, HtmlAgilityPack - get all nodes in a document and so on.
Most importantly seems to be How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?:
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HtmlTextNode node in coll.Cast<HtmlTextNode>())
{
node.Text = node.Text.Replace(...);
}

Here's how you would do what #CodeCaster suggested in CSQuery
string str = "<p>To view more types of cereal click here cereal.</p>";
var cq = CQ.Create(str);
foreach (IDomElement node in cq.Elements)
{
PerformActionOnTextNodeRecursively(node, domNode => domNode.NodeValue = domNode.NodeValue.Replace("cereal", "<span>cereal</span>"));
}
Console.WriteLine(cq.Render());
private static void PerformActionOnTextNodeRecursively(IDomNode node, Action<IDomNode> action)
{
foreach (var childNode in node.ChildNodes)
{
if (childNode.NodeType == NodeType.TEXT_NODE)
{
action(childNode);
}
else
{
PerformActionOnTextNodeRecursively(childNode, action);
}
}
}
Hope it helps.

Custom tag not considered HTMLelement/parent in webbrowser DOM c#

I am working on a XPATH generator (using absolute paths).
The idea is that I have a function where you pass a HTMLElement (that is found in the webbrowser) and it will return the XPATH like:
/html/body/div[3]/div[1]/a
The function to generate the xpath looks something like this:
HTMLElement node=...;
while (node != null)
{
int i = FindElementIndex(node); //find the index of our current node in the parent elements
if(i==1)
xpath.Insert(0, "/" + node.TagName.ToLower());
else
xpath.Insert(0, "/" + node.TagName.ToLower() + "[" + i+ "]");
node = node.Parent;
}
The idea is this:
a)take the element
b)find the index position of element in element.parent
c)append xpath
The problem appears when the parent is a custom html code like "<layer>"
Example:
<html>
<body>
<div>
<layer>
aaa
</layer>
</div>
</body>
</html>
If our HTMLElement is aaa and we call
ourelement.Parent it will return the DIV element and NOT the element
So instead of having:
/html/body/div/layer/a
We will have (which is incorrect)
/html/body/div/a
How can this be solved? Really hope someone can help figure this out.
EDIT 1: Just for testing purposes I implemented the function from Get the full path of a node, after get it with an XPath query in JavaScript
The results were that if the page that contained a "custom" tag (like <layer>) AND if the page was opened in firefox, the xpath was showed correctly.
If the page was opened in Internet Explorer (like webbrowser is) the <layer> was not included as a parent.
So the issue is with internet explorer not "parsing" the dom correctly. What is the solution? What function can help create xpath for cases like this (if using webbrowser htmlelement).

This is not a direct answer to your question; but have considered using http://htmlagilitypack.codeplex.com/ to load the HTML. It will not have the problem of ignoring the element.

How to remove nodes using HTML agility pack and XPath so as to clean the HTML page

I need to extract Text from webpages mostly related to business news.
say the HTML page is as follows..
<html>
<body>
<div>
<p> <span>Desired Content - 1</span></p>
<p> <span>Desired Content - 2</span></p>
<p> <span>Desired Content - 3</span></p>
</div>
</body>
</html>"
I have a sample stored in a string that can take me to Desired Content -1 directly, so i can collect that content. But i need to collect Desired Content -2 and 3.
For that what i tried is from the current location i.e from with in span node of desired Content -1 i used parentof and moved to the external node i.e Para node and got the content but actually i need to get the entire desired content in div. How to do it? You might ask me to go to div directly using parentof.parentof.span. But that would be specific to this example, i need a general idea.
Mostly news articles will have desired content in a division and i will go directly to some nested inner node of that division. I need to come out of those inner nodes only till i encounter a division and then get the innerText.
I am using XPath and HTMLagilitypack.
Xpath i am using is -
variable = doc.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]").ParentNode.ParentNode.InnerText;
Here "searchData" is a variable that is holding a sample of Desired Content -1 used for searching the node having news in the entire body of the webpage.
What i am thinking is clean up the webpages and have only main tags like HTML, BODY, Tables, Division and Paragraphs but no spans and other formating elements. But some other website might use Spans only instead of divs so i am not sure how to implement this requirement.
Basic requirement is to extract the News content from different webpages(almost 250 different websites). So i can not code specific to each webpage..i need a generic method.
Any ideas appreciated. Thank you.

This XPath expression selects the innermost div element with $searchData variable reference value as part of its string value.
//div[contains(.,$searchData)]
[not(.//div[contains(.,$searchData)])]

Found out the answer myself...
Using a while loop till i find a div parent and then getting innertext is working.
`{ //Select the desired node, move up till you find a div and then get the inner text.
node = hd.DocumentNode.SelectSingleNode("//*[contains(text(),'" + searchData + "')]"); //Find the desired Node.
while (node.ParentNode.Name != "div") //Move up till you find a encapsulating Div node.
{
node = node.ParentNode;
Console.WriteLine(node.InnerText);
}
Body = node.InnerText;
}`

Check Empty XML data validation before displaying

I want to check xml before displaying it .I am using XPath not xsl for it. For e.g.
<title></title>
<url></url>
<submit></submit>
i wanna check that if xml data is not there for it . Then don't display it. because I m putting these values in <a href=<%#container.dataitem,url%>>new link</a>.
So i want that if url is empty then don't display new link otherwise display it and similarly for title that if title is not empty display it otherwise don't display it.
Main problem is I can check like in ascx.cs file
if(iterator.current.value="") don't display it but the problem is in ascx file i m givin
new link
i want that new link should not come if url is empty...
Any idea how to check this condition?

I've seen this handled using an asp:Literal control.
In the web form, you'd have <asp:Literal id='literal' runat='server' text='<%# GetAnchorTag(container.dataitem) %>' />
And in the code behind, you'd have:
protected string GetAnchorTag(object dataItem) {
if(dataItem != null) {
string url = Convert.ToString(DataBinder.Eval(dataItem, "url"));
if(!string.IsNullOrEmpty(url)) {
string anchor = /* build your anchor tag */
return anchor;
}
}
return string.Empty;
}
this way, you either output a full anchor tag or an empty string. I don't know how this would fit in with your title and submit nodes, but it solves the anchor display issue.
Personally, I don't like this approach, but I've seen it quite a bit.

Use XPath. Assuming that the elements are enclosed in an element named link:
link[title != '' and url !='']
will find you the link elements whose title and url child elements contain no descendant text nodes. To make it a little more bulletproof,
link[normalize-space(title) != '' and normalize-space(url) !='']
will keep the expression from matching link elements whose title or url children contain whitespace.

If you don't have access to the .cs file for this then you can still embed the code right in the .ascx file. Remember, you don't HAVE to put all your code in the code behind file, it can go inline right inside the .ascx file.
<%
if(iterator.current.value!="") {
%>
<a href=<%#container.dataitem,url%>>new link</a>
<%
}
%>

what about //a[not(./#href) or not(text()='']

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get text after link in htmlagilitypack - c#

I have next part of html code: <div class="resum_card"> <p>Experience: 5 years</p> </div> And what i try inside the code: nodeValue = hd.DocumentNode.SelectSingleNode("//div[#class='resum_card']//p//a[#class='no_decore']//following-sibling::a"); But it return me null, can anybody help me?

Related

Find element with selenium by display text

How can I wrap a <span> around matched words in HTML without breaking the HTML

Custom tag not considered HTMLelement/parent in webbrowser DOM c#

How to remove nodes using HTML agility pack and XPath so as to clean the HTML page

Check Empty XML data validation before displaying

Categories

Resources