how to find and extract text from webpage in c#

how to find and extract text from webpage in c# - c#

i want to know how can i get data from webpage
example :
<li id="hello1">about me
<ul class="square">
<li><strong>name: john</strong></li>
</ul>
</li>
i want to read john in front of name: so how i cant read it in c#
oh i have tried to use HTML Agility Pack :( but due to its poor documentation i was not able to use so need help .

Use HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var nameElement= doc.DocumentNode.SelectSingleNode("//li[#id='hello1']").InnerText;
//name would contain `about me name: john`
Regex.Match(nameElement,#"(?<=name:\s*)\w+").Value;//john

I have used HTML Agility Pack before and it is great tool
HtmlDocument document = new HtmlDocument();
document.LoadHtml(YourHTML);
var collection = document.DocumentNode.SelectNodes("//li[#id='hello1']");

Related

grab text value using html agillity pack

Please check the code bellow. I am trying to grab a html text value from this html doc. I want to grab text Quick Kill 32 oz. Mosquito Yard Spray and i already tried to do it using SelectSingleNode like bellow and this cant grab this text value. Any idea how to fix it?
string html = #"<div class='pod-plp__description js-podclick-analytics' data-podaction='product name'>
<a class='' data-pos='0' data-request-type='sr' data-pod-type='pr' href='/p/AMDRO-Quick-Kill-32-oz-Mosquito-Yard-Spray-100530440/304755303'>
<span class='pod-plp__brand-name'>AMDRO</span>
Quick Kill 32 oz. Mosquito Yard Spray
</a>
</div>";
var doc = new HtmlDocument();
doc.Load(html);
string title = doc.DocumentNode
.SelectSingleNode("//div[#class='pod-plp__description js-podclick-analytics']span[#class='pod-plp__brand-name']")
.InnerText;

You are trying to targeting only span[#class='pod-plp__brand-name'] which will return you only inside span but you need following-sibling::text() to grab text after your span. Please see my example code bellow. Also you can learn more from html-agility-pack official site.
var Content = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='pod-plp__brand-name']/following-sibling::text()[1]");
string title = titleAgain.InnerText.Trim();
Found solution from here

HTML Agility Pack Node Selection

I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.

HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)

Unable to build a regex to match the article tag

I have been trying to create a regex to match the article tag and get all the text .
Here is my article tag-
<article id="post-82" class="post-82 post type-post status-publish format-standard hentry category-publishing">
<div class="entry-content clearfix">
<div class="abh_box abh_box_up abh_box_drop-down"><ul class="abh_tabs"> <li class="abh_about abh_active">
<p>With India playing host,</p>
<footer class="entry-meta-bar clearfix"><div class="entry-meta clearfix">
<span class="comments">No Comments</span>
</div></footer>
</article>
I need everything which is inside the article tag.So far I have tried the following Regex-
<article (.*?)</article>
(?:<article>)(.*?)(?:</article>)
None of them works .Please help.

Don't use regex for parsing of HTML. Use Html parser like Html Agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var result = doc.DocumentNode.SelectNodes("article").FirstOrDefault();

You don't want to use regex for something like this and you don't need to load an XML parser. Just use .getAttribute("innerHTML") on the element you want the contained HTML for.
For example, this gets only the article element in your supplied HTML by ID.
System.out.println(driver.findElement(By.id("post-82")).getAttribute("innerHTML"));
This gets the HTML for all articles on the page.
for (WebElement article : driver.findElements(By.tagName("article")))
{
System.out.println(article.getAttribute("innerHTML"));
}

You can try this regex:
<[article][^>]*>((.|\n)*?)<\/article>
https://regex101.com/r/oOJ9bt/2

How to get an element using c#

I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)

You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".

HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.

You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html

Reading hidden web site textboxes with C#?

First of all, I am pretty much still a beginner, especially when it comes to web stuff.
I am trying to read the content of a text box from a web page that is open in a browser with my winforms application and I am not able to modify the source code of the web page itself. Sadly, the string I am looking for is not simply written in the source code of the page. So I can't just read the page source and parse it. It seems as if the content of the textbox is populated via javascript.
I am generally speaking not sure where to even start here. Any suggestions are very welcome.
Also, I am not sure what other information I should put here. I don't have an idea where to start, so I don't have any code yet to show.
Edit:
I have been trying to use the agility pack, but I am still not sure how to get to what I need. Here is my code so far:
WebClient client = new WebClient();
String html = client.DownloadString(URL);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//div[#class='ember-view']"))
{
HtmlAttribute div = link.Attributes["div"];
if (div != null)
{
outputBox.Text += div.Value;
}
}
When I run the code, I get this:
An unhandled exception of type 'System.NullReferenceException' occurred.
Additional information: Object reference not set to an instance of an object.
When I go to the web page and do Inspect Element I get this (I only copied a few lines):
<html class="no-js" lang="en">
<head></head>
<body class="ember-application" lang="en-US" data-environment="production">
<div id="booting" style="display: none;"></div>
<div id="ember2493" class="ember-view">
<div id="alert" class="ember-view"></div>
I am not sure how to get to, let's say, the inner code of this line:
<div id="alert" class="ember-view"></div>
Also, my apologies if this is something obvious that I am missing, but again, this is all new for me. Thanks for the help so far.

Do you know Html Agility Pack? I always using agility pack for html crawling.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");

Perhaps something along the following lines may help ?
var inputs = webBrowser1.Document.GetElementsByTagName("input");
foreach (HtmlElement input in inputs)
{
var id = input.Id;
var name = input.Name;
var val = input.OuterHtml; // can parse value from here
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how to find and extract text from webpage in c# - c#

Use HtmlAgilityPack HtmlDocument doc = new HtmlDocument(); doc.Load(yourStream); var nameElement= doc.DocumentNode.SelectSingleNode("//li[#id='hello1']").InnerText; //name would contain `about me name: john` Regex.Match(nameElement,#"(?<=name:\s*)\w+").Value;//john

I have used HTML Agility Pack before and it is great tool HtmlDocument document = new HtmlDocument(); document.LoadHtml(YourHTML); var collection = document.DocumentNode.SelectNodes("//li[#id='hello1']");

Related

grab text value using html agillity pack

HTML Agility Pack Node Selection

Unable to build a regex to match the article tag

How to get an element using c#

Reading hidden web site textboxes with C#?

Categories

Resources