get text inside a html element - c#

I'm using HTMLAgilityPack to get text inside a list of certain nodes.
Basically I'm reading out a HTML page with the following HTML:
<div class="video-overview yt-grid-fluid">
<h3 class="video-title-container">
<span class="yt-badge-std">
WATCHED
</span>
<a href="/watch?v=S5FCdx7Dn0o&list=FLArRQZAMoAgECBIOn08gNeA&index=1" title="Bob Marley - Buffalo soldier" class="yt-uix-tile-link yt-uix-sessionlink" data-sessionlink="ei=c9tWUb3eJaOzhgG6kYHYAg&feature=plpp_video">
<span class="title video-title" dir="ltr">Bob Marley - Buffalo soldier</span>
</a>
</h3>
<p class="video-details">
<span class="video-owner">
by <span class="yt-user-name " dir="ltr">Pgroenberg</span>
</span>
<span class="video-view-count">
44,342,136 views
</span>
</p>
</div>
I want to get the text "Bob Marley - Buffalo soldier", which is inside <span class="title video-title" dir">.
I cant seem to find the right pattern:
string expression = #"//span[#class='title video-title' and #dir='ltr']/text()";
HtmlNodeCollection hnc = htmlDoc.DocumentNode.SelectNodes(expression);
hnc will be null because no nodes have matched with the expression. Why wont my expression work?

Related

Selecting a parent of a child element using XPath

Before I start in the source code I have included in this post I have replaced any sensitive data with a string of "x".
<div class="pod productPod icon-life clearFix nonwrap"
id="xxxxxxxxxx"
internalid="xxxxxxxxxx"
data-properties-type="xxxxxxxxxx"
data-properties-code="xxxxxxxxxx"
data-properties-loaded="False"
data-properties-url="xxxxxxxxxx"
data-properties-showmaintenanceerrormessage="False"
data-properties-product-type="xxxxxxxxxx">
<span class="podTitle">Your policies</span>
<div id="1" class="productPodInner" data-qa-productpod="">
<span class="notificationBubble fullProdLarge" aria-hidden="true"></span>
<h2><a data-feedback="" class="clearFix roundelLink" href="#productId0" data-roundel-id="0">
<span class="productIconWrapper">
<span class="productIcon">
</span>
</span>
<span class="productData">
<span class="productName" style="min-height: auto;">Protection</span>
<span class="notificationBubble fullProd"></span>
</span>
</a></h2>
<div class="productDetails podContent" id="#productId0" style="width: 1903px; left: -306.703px;">
<div class="productDetailsInner clearFix">
<h3>xxxxxxxxxx</h3>
<!-- TODO:: Display this in the 2nd column-->
<dl class="initialPolicyContainer detailsList" style="display: none;">
<dt>Policy Number</dt>
<dd class="initialPolicyNumber" data-qa-text="xxxxxxxxxx">xxxxxxxxxx</dd>
</dl>
<div class="clearFix variant3Container" style=""><div class="group-1-2 clearFix">
<div class="column">
<dl class="detailsList">
<dt>Policy number</dt>
<dd class="policyNumberVal">xxxxxxxxxx</dd>
<dt>Term</dt>
<dd>xxxxxxxxxx</dd>
</dl>
So my goal here is to be able to select the top parent using an Xpath locator and then click on that with selenium.
On the website there is a pod with multiple roundel sections each containing its own policy. Normally I can select a roundel using an Xpath query that finds the roundel by the policyType(text) but if there are 2 policies with the same policyType then this Xpath locator will only find the first one.
Roundel has a unique policyNumberVal (line 29)
<dd class="policyNumberVal">xxxxxxxxxx</dd>
That I can locate using this Xpath:
//dd[#class='policyNumberVal' and text() = '{policyNumber}']
What I could like to know is, once I have located the 'policyNumberVal' how do I then step up and select the parent ?
Please let me know if you need any more information than this.
You can use the parent axis,
//dd[#class='policyNumberVal' and text() = '{policyNumber}']/parent::*
its abbreviation,
//dd[#class='policyNumberVal' and text() = '{policyNumber}']/..
or, elevate the predicate and select the parent directly:
//dl[dd[#class='policyNumberVal' and text() = '{policyNumber}']]

XPath for an element with same class names

Can anyone help me to derive the xpath (from second div the span element label which is GP)
<div class="ohg-patient-banner-suppl-info-section ohg-patient-banner-suppl-info-custom pure-u-1-5">
<div class="ohg-patient-banner-suppl-info-component-container ohg-patient-banner-suppl-info-component-left-bordered ohg-patient-banner-suppl-info-custom" aria-label="Visit Info" role="group"><div class="ohg-patient-banner-suppl-info-section-summary" aria-hidden="false">
</div>
<div class="ohg-patient-banner-suppl-info-section-detail" aria-label="" aria-hidden="true">
<div class="ohg-patient-banner-suppl-info-custom-field">
<span class=" " aria-hidden="true"></span>
<span class=" ohp-metadata-label">Location</span>
<span class="ohg-patient-banner-suppl-info-custom-row-value-icon " aria-hidden="true"></span>
<span class=" ohg-patient-banner-suppl-info-value">Tauranga Hospital - Assmt Plan Unit TAU - </span>
</div>
</div>
</div>
</div>
<div class="ohg-patient-banner-suppl-info-section ohg-patient-banner-suppl-info-custom pure-u-1-5">
<div class="ohg-patient-banner-suppl-info-component-container ohg-patient-banner-suppl-info-component-left-bordered ohg-patient-banner-suppl-info-custom" aria-label="GP Info" role="group"><div class="ohg-patient-banner-suppl-info-section-summary" aria-hidden="false">
</div>
<div class="ohg-patient-banner-suppl-info-section-detail" aria-label="" aria-hidden="true">
<div class="ohg-patient-banner-suppl-info-custom-field">
<span class=" " aria-hidden="true"></span>
<span class=" ohp-metadata-label">GP</span>
<span class="ohg-patient-banner-suppl-info-custom-row-value-icon " aria-hidden="true"></span>
<span class=" ohg-patient-banner-suppl-info-value">-</span>
</div>
</div>
</div>
</div>
My XPath which i wrote it work for the first div and it return the value Location :
.//*[#class='ohg-patient-banner-suppl-info-section-detail']/div[#class='ohg-patient-banner-suppl-info-custom-field']/span[#class=' ohp-metadata-label']
Xpath:
//*[#class='ohg-patient-banner-suppl-info-custom-field']//span[2]
then you can use gettext() to get GP as a text
Try this XPath-1.0 expression:
//*[#class='ohg-patient-banner-suppl-info-section-detail']/div[#class='ohg-patient-banner-suppl-info-custom-field' and span[#class=' ohg-patient-banner-suppl-info-value']='-']/span[#class=' ohp-metadata-label']
Its result is:
GP
programmatic solution:
your xpath actually does return both elements:
in your selenium lib you receive most likely an array and can select the second element of it
select the second element:
if its always the second element adding a [2] to the xpath helps
e.g. "(.//*[#class='ohg-patient-banner-suppl-info-section-detail'])[2]/div[#class='ohg-patient-banner-suppl-info-custom-field']/span[#class=' ohp-metadata-label']"
by a fixed text
If you have some text in the page that is fixed, e.g. Location you can use that as reference and then using ancestor and sibling axes
".//span[.='Location']//ancestor::div[#class='ohg-patient-banner-suppl-info-section-detail']//following-sibling::div[#class='ohg-patient-banner-suppl-info-section-detail']//span[#class=' ohp-metadata-label']"
Since you don't have other unique identifiers, and the class name is used by 2 spans, this is how you can identify that span based on its index, and is the shortest way:
xpath: (//span[#class=' ohp-metadata-label'])[2]
Now you can scrape the text by using selenium getText() method.
Note that we used index 2 to identify your locator, but if html code will change, and new similar spans will be added before this one, you will need to change the index.

Find text from html with matches with regex pattern

I want a regex solution to find some text value which looks like MLA818214667 and this value placed in a id like id="MLA818214667". There should be 3 type of pattern to find these value from HTML.
It should start with MLA and placed in id="".
The number after MLA should be more than 6 characters long.
The number should be fully numeric not string mixed.
Note: I want to avoid HtmlAgilityPack for this case because the text not always valid html. So i want to treat it as text not html and need solution without any html parser
C#:
var listOfIds = new List<string>();
string html = #"below html sample goes here";
Match match = Regex.Match(input, #"/([A-Za-z0-9\-]+)\.$",
RegexOptions.IgnoreCase);
//from matched ids it should be added in list listOfIds
Html:
<span class="main-title">
Casco Integral Halcon H57 + Combo Termico Invierno Sti Motos
</span>
</h2>
<div class="item__status">
<div class="item__condition">541 vendidos</div>
</div>
</div>
</a>
<form class="item__bookmark-form" action="/search/bookmarks/MLA614364106/make" method="post" id="bookmarkForm" class="bookmark-form">
<button type="submit" class="bookmarks favorite" data-id="MLA614364106">
<div class="item__bookmark">
<div class="icon"></div>
</div>
</button>
<input type="hidden" name="method" value='add'/>
<input type="hidden" name="itemId" value='MLA614364106'/>
<input type="hidden" name="_csrf" value="5fe7b4e6-19d3-42bc-a3bb-15eaeee81f64"/>
</form>
</div>
</li>
<li class="results-item highlighted article grid item-info-height-179">
<div class="rowItem item highlighted item--grid item--has-row-logo new" id="MLA751765547">
<div class="item__image item__image--grid">
<div class="images-viewer" item-url="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&type=item&tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" item-id="MLA751765547">
<div class="carousel">
<ul>
<li><a href="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&type=item&tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item-link item__js-link">
<img class='lazy-load' width='284' height='284' alt='Casco Moto Hawk Htl Dr46 Rebatible Lett Store' src='https://http2.mlstatic.com/casco-moto-hawk-htl-dr46-rebatible-lett-store-D_NQ_NP_624166-MLA31021954439_062019-W.jpg'/>
</a>
</li>
</ul>
</div>
</div>
</div>
<span class="item-loading__status-bar item-loading__hide"></span>
<a href="https://articulo.mercadolibre.com.ar/MLA-751765547-casco-moto-hawk-htl-dr46-rebatible-lett-store-_JM#position=5&type=item&tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item__info-link item__js-link">
<div class="item__info ">
<div class="item__price ">
<span class="price__symbol">$</span>
<span class="price__fraction">3.725</span>
</div>
<span class="item-installments item__installments--show-card-icon highlighted free-interest item--has-shipping">
<span class="item-installments-text">Hasta 6 cuotas sin interés</span>
</span>
<div class="item__shipping-promise item__shipping highlighted free-shipping">
<span class="text-shipping next_day">Llega gratis el lunes</span>
</div>
<div class="item__brand-logo item__brand-img--ultra-wide">
<span class="item__brand-img-container">
<img src="https://http2.mlstatic.com/D_NQ_NP_796276-MLA31050681849_062019-T.jpg"/>
</span>
</div>
<h2 class="item__title list-view-item-title">
<span class="main-title">Casco Moto Hawk Htl Dr46 Rebatible Lett Store</span>
</h2>
<div class="item__status">
<div class="item__condition">362 vendidos</div>
</div>
</div>
</a>
<form class="item__bookmark-form" action="/search/bookmarks/MLA751765547/make" method="post" id="bookmarkForm" class="bookmark-form">
<button type="submit" class="bookmarks favorite" data-id="MLA751765547">
<div class="item__bookmark">
<div class="icon"></div>
</div>
</button>
<input type="hidden" name="method" value='add'/>
<input type="hidden" name="itemId" value='MLA751765547'/>
<input type="hidden" name="_csrf" value="5fe7b4e6-19d3-42bc-a3bb-15eaeee81f64"/>
</form>
</div>
</li>
<li class="results-item highlighted article grid item-info-height-179">
<div class="rowItem item highlighted item--grid item--has-row-logo new to-item" id="MLA817988063">
<div class="item__image item__image--grid">
<div class="images-viewer" item-url="https://articulo.mercadolibre.com.ar/MLA-817988063-cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-_JM#position=6&type=item&tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" item-id="MLA817988063">
<div class="carousel">
<ul>
<li>
<a href="https://articulo.mercadolibre.com.ar/MLA-817988063-cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-_JM#position=6&type=item&tracking_id=897c653e-1565-4371-8a4d-b2ea29d09d4d" class="item-link item__js-link">
<img class='lazy-load' width='284' height='284' alt='Cascos Motos Vega Vflow Motocross Mx Enduro Atv + Acces Cam' src='https://http2.mlstatic.com/cascos-motos-vega-vflow-motocross-mx-enduro-atv-acces-cam-D_NQ_NP_629038-MLA32405702773_102019-W.jpg' />
</a>
</li>
</ul>
</div>
You can use this example "id=\"(MLA[0-9]{6,})\"" to find all the values of id form HTML
Paste the RegEx in here https://regex101.com to see how it works
static void Main(string[] args)
{
var listOfIds = new List<string>();
string html = " id=\"MLA12334566\" id=\"MLA123354566\" id=\"MLA123346566\"";
Regex idRegex = new Regex("id=\"(MLA[0-9]{6,})\"");
var matches = idRegex.Matches(html);
foreach(var match in matches)
{
listOfIds.Add(match.ToString());
}
}

Getting next element

I need to get a next (sibling) element of the one with "Yes" as its text. I can use the text "Yes", css and part of the id, but the number (e.g.. 106) is unfortunately excluded. Also I can't directly get that sibling, because of that exclusion. Here is a part of the HTML code:
<a style="right: auto;" class="x-btn x-box-item x-toolbar-item" id="button-106">
<span id="button-106-btnWrap" role="presentation" class="x-btn-wrap" unselectable="on">
<span id="button-106-btnEl" class="x-btn-button" role="presentation">
<span id="button-106-btnInnerEl" class="x-btn-inner x-btn-inner-center">Yes
</span>
<span role="presentation" id="button-106-btnIconEl" class="x-btn-icon-el">
</span>
</span>
</span>
</a>
I came up with this query, but it doesn't seem to work:
By.XPath(".//*[text() = 'Yes' and contains(id(), '-btnInnerEl')/following-sibling::*]")
How can I alter this query so I can get the next element?
To select the span with id button-106-btnIconEl:
//span[contains(#id,'-btnInnerEl')][normalize-space(text())='Yes']/following-sibling::span

Regex for specific html tag in C# [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I am trying to scrape specific html tags including their data from a google products page. I want to get all the <li> tags within this ordered list and put them in a list.
Here is the code:
<td valign="top">
<div id="center_col">
<div id="res">
<div id="ires">
<ol>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
<li class="g">
<div class="pslires">
<div class="psliimg">
<a href=
"https://www.google.com">
</a>
</div>
<div class="psliprice">
<div>
<b>$59.99</b> used
</div><cite>google auctions</cite>
</div>
<div class="pslimain">
<h3 class="r"><a href=
"https://www.google.com">
google</a></h3>
<div>
dummy data </div>
</div>
</div>
</li>
</ol>
</div>
</div>
</div>
<div id="foot">
<p class="flc" id="bfl" style="margin:19px 0 0;text-align:center"><a href=
"/support/websearch/bin/answer.py?answer=134479&hl=en">Search Help</a>
<a href=
"/quality_form?q=Pioneer+Automotive+PF-555-2000&hl=en&tbm=shop">Give us
feedback</a></p>
<div class="flc" id="fll" style="margin:19px auto 19px auto;text-align:center">
Google Home <a href=
"/intl/en/ads">Advertising Programs</a> <a href="/services">Business
Solutions</a> Privacy & Terms <a href=
"/intl/en/about.html">About Google</a>
</div>
</div>
</td>
I want to get all the <li class="g"> tags and the data in each of them. Is that possible?
instead of using a regex using something like an xml parser may be more useful to your situation. Load it up into an xml document and then use something like SelectNodes to get out your data you are looking for
http://msdn.microsoft.com/en-us/library/4bektfx9.aspx
I wouldn't use regex for this particular problem.
Instead I would attack it thus:
1)Save off page as html string.
2)Use aforementioned htmlagilitypack or htmltidy(my preference) to convert to XML.
3)Use xDocument to navigate through Dom object by tag and save data.
Trying to create a regex to extract data from a possibly fluid HTML page will break your heart.
Instead of using regex you can use HtmlAgilityPack to parse the HTML.
var doc = new HtmlDocument();
doc.LoadHtml(html);
var listItems = doc.DocumentNode.SelectNodes("//li");
The code above will give you all <li> items in the document. To add them to a list you'll just have to iterate the collection and add each item to the list.

Categories