I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)
You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".
HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.
You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html
Related
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("https://www.google.com/search?q=" + "msg");
HttpWebResponse myres = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myres.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
if (pageContent.Contains("find"))
{
display = "done";
}
currently what this code does is check if "find" exists on a url and display done if it is present
What I want is to display the whole line or para which contains "find".
So like instead display="done" I want to store the line which contains find in display
HTML pages don't have lines. Whitespace outside tags is ignored and an entire minified page may have no newlines at all. Even if it did, newlines are simply ignored even inside tags.That's why <br> is necessary. If you want to find a specific element you'll have to use an HTML parser like HTMLAgilityPack and identify the element using an XPath or CSS selector expression.
Copying from the landing page examples:
var url = $"https://www.google.com/search?q={msg}" ;
var web = new HtmlWeb();
var doc = web.Load(url);
var value = doc.DocumentNode
.SelectNodes("//div[#id='center_col']")
.First()
.Attributes["value"].Value;
What you put in SelectNodes depends on what you want to find.
One way to test various expressions is to open the web page you want in a browser, open the browser's Developer Tools and start searching in the Element inspector. The search functionality there accepts XPath and CSS selectors.
Please check the code bellow. I am trying to grab a html text value from this html doc. I want to grab text Quick Kill 32 oz. Mosquito Yard Spray and i already tried to do it using SelectSingleNode like bellow and this cant grab this text value. Any idea how to fix it?
string html = #"<div class='pod-plp__description js-podclick-analytics' data-podaction='product name'>
<a class='' data-pos='0' data-request-type='sr' data-pod-type='pr' href='/p/AMDRO-Quick-Kill-32-oz-Mosquito-Yard-Spray-100530440/304755303'>
<span class='pod-plp__brand-name'>AMDRO</span>
Quick Kill 32 oz. Mosquito Yard Spray
</a>
</div>";
var doc = new HtmlDocument();
doc.Load(html);
string title = doc.DocumentNode
.SelectSingleNode("//div[#class='pod-plp__description js-podclick-analytics']span[#class='pod-plp__brand-name']")
.InnerText;
You are trying to targeting only span[#class='pod-plp__brand-name'] which will return you only inside span but you need following-sibling::text() to grab text after your span. Please see my example code bellow. Also you can learn more from html-agility-pack official site.
var Content = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='pod-plp__brand-name']/following-sibling::text()[1]");
string title = titleAgain.InnerText.Trim();
Found solution from here
I wanted to make a program which reads the description of a picture album over at imgur.com (this for example: https://imgur.com/gallery/DsAE9cv)
The element would be
<div class="post-image-description">One owner?</div>
but I have a hard time getting the description (One owner).
Would be very helpful to get some tips!
I tried using HtmlAgilityPack and using the XPath, but it's not working.
string link = txt_Link.Text;
var web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(link);
var description = doc.DocumentNode.SelectSingleNode("/html[1]/body[1]/div[8]/div[2]/div[2]/div[2]/div[1]/div[2]/p[1]");
txt_Return.Text = description.ToString();
I expected the output of "One owner?" but I got "NULL" (textbox is showing "HtmlAgility.Node".
description.ToString() does not return the expected result.
Use description.InnerText property to view the title.
Returns "One owner?" in your example.
Try using some online XPath tester tool, like http://xpather.com/
You might try this XPath to get the result you need:
//p[#class='post-image-description']/text()
I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.
HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)
I am trying to pull some timer values off of websites using the xpath in the HtmlAgilityPack. However, when I am using the xpath, I get null reference exceptions because a particular node does not exist when I am grabbing it. To test why this was, I used a doc.Save to check the nodes myself, and I found that the nodes truly do not exist. From my understanding, HtmlAgilityPack should download the webpage almost exactly how I see it, with all the data in there as well. However, most of the data in fact is missing.
How exactly am I supposed to grab the timer values, or even an event title from either of the following websites:
http://dulfy.net/2014/04/23/event-timer/
http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1
My current code to pull just the title of the event from the first timebox from guildwarstemple is:
public void updateEventData()
{
//string Url = "http://dulfy.net/2014/04/23/event-timer/";
string Url = "http://guildwarstemple.com/dragontimer/eventsb.php?serverKey=108&langKey=1";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
doc.Save("c:/doc.html");
Title = doc.DocumentNode.SelectNodes("//*[#id='ep1']/p")[0].InnerText;
//*[#id="scheduleList"]/div[3]
//*[#id="scheduleList"]/div[3]/div[3]/text()
}
You XPath expression fails because there is only one div with #id='ep1' in the document, and it has no p inside:
<div id="ep1" class="eventTimeBox"></div>
In fact, all the divs in megaContainer are empty in the link you are trying to load with your code.
If you think there should be p elements in there, it's probably being added dynamically via JavaScript, so it might not be available when you are scraping the site with a C# client.
In fact, there are some JavaScript variables:
<script>
...
var e7 = 'ep1';
...
var e7t = '57600';
...
Maybe you want to get that data. This:
substring-before(substring-after(normalize-space(//script[contains(.,"var e7t")]),"var e7t = '"),"'")
selects the <script> which contains var e7t and extracts the string in the apostrophes. It will return:
57600
The same with your other link. The expression:
//*[#id="scheduleList"]
is a an empty div. You can't navigate further inside it:
<div id="scheduleList" style="width: 720px; min-width: 720px; background: #1a1717; color: #656565;"></div>
But this time there seems to be no nested JavaScript that refers to it in the page.