HtmlAgilityPack scraping - extracting specific nodes from html document

HtmlAgilityPack scraping - extracting specific nodes from html document - c#

Apologies in advance if this has already been answered(if so please point me to right location), I searched here, web, youtube and so on for two days and still haven't founnd an answer.
I would like to extract some data from following url: https://betcity.ru/en/results/sp_fl=a:46;
I am trying to get all event names for the day(1st one is:
Ho Kwan Kit/Wong Chun Ting — Fan Zhendong/Xu Xin and all others after it). When I inspect that element I can see this part of html:
<div class="content-results-data__event"><span>Ho Kwan Kit/Wong Chun Ting — Fan Zhendong/Xu Xin</span></div>
I was thinking of getting all div's with class="content-results-data__event" and than get inner text from those div's. Every time I run my code I get zero results. Why am I not getting any nodes when I can see that div's with such class exist and how can I get all events (if I learn how to do that I could get other info which I need from this site). Here is my code (have to say I am fairly new to this).
public partial class Scrapper : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
List<string> Events = new List<string>();
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = NewMethod(web);
var Nodes = doc.DocumentNode.SelectNodes(xpath: "//div[#class='content - results - data__event'']").ToList();
foreach (var item in Nodes)
{
Events.Add(item.InnerText);
}
GridView1.DataSource = Events;
GridView1.DataBind();
}
private static HtmlDocument NewMethod(HtmlAgilityPack.HtmlWeb web)
{
return web.Load("https://betcity.ru/en/results/sp_fl=a:46;");
}
}
}

Here is how to get the HTML for one day of matches using Selenium. Rest is HtmlAgilityPack. The site uses self signed certificates so I had to configure the driver to accept self signed certificates. Have fun.
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = #"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
string url = "https://betcity.ru/en/results/date=2017-11-19;"; //remember to update the date accordingly.
driver.Navigate().GoToUrl(url);
Thread.Sleep(2000);
Console.Write(driver.PageSource);

Related

Scrape data from div in Windows.Form

I am new in c# programming. I am trying to scrape data from div (I want to display temperature from web page in Forms application).
This is my code:
private void btnOnet_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
var temperatura = doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[3]/div/section/div/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]");
onet.Text = temperatura.InnerText;
}
This is the exception:
System.NullReferenceException:
temperatura was null.

You can use this:
public static bool TryGetTemperature(HtmlAgilityPack.HtmlDocument doc, out int temperature)
{
temperature = 0;
var temp = doc.DocumentNode.SelectSingleNode(
"//div[contains(#class, 'temperature')]/div[contains(#class, 'temp')]");
if (temp == null)
{
return false;
}
var text = temp.InnerText.EndsWith("°") ?
temp.InnerText.Substring(0, temp.InnerText.Length - 5) :
temp.InnerText;
return int.TryParse(text, out temperature);
}
If you use XPath, you can select with more precission your target. With your query, a bit change in the HTML structure, your application will fail. Some points:
// is to search in any place of document
You search any div that contains a class "temperature" and, inside that node:
you search a div child with "temp" class
If you get that node (!= null), you try to convert the degrees (removing '°' before)
And check:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
HtmlWeb web = new HtmlWeb();
doc = web.Load("https://pogoda.onet.pl/");
if (TryGetTemperature(doc, out int temperature))
{
onet.Text = temperature.ToString();
}
UPDATE
I updated a bit the TryGetTemperature because the degrees are encoded. The main problem is the HTML. When you request the source code you get some HTML that browser update later dynamically. So the HTML that you get is not valid for you. It doesn't contains the temperature.
So, I see two alternatives:
You can use a browser control (in Common Controls -> WebBrowser, in the Form Tools with the Button, Label...), insert into your form and Navigate to the page. It's not difficult, but you need learn some things: wait to events for page downloaded and then get source code from the control. Also, I suppose you'll want to hide the browser control. Be carefully, sometimes the browser doesn't works correctly if you hide. In that case, you can use a visible Form outside desktop and manage activate events to avoid activate this window. Also, hide from Task Window (Alt+Tab). Things become harder in this way but sometimes is the only way.
The simple way is search the location that you want (ex: Madryt) and look in DevTools the request done (ex: https://pogoda.onet.pl/prognoza-pogody/madryt-396099). Use this Url and you get a valid HTML.

getting through multiple pages in web scraping

I am working on web scraping, to get values from yello pages and while iterating through pages the loop function isnt getting the page count increment. I have added a loop its keep on showing data from same page. i am attaching my code below.
static void Main(string[] args)
{
string webUrl = "https://www.yellowpages.com";
bool Loop = true;
HtmlWeb Web = new HtmlWeb();
//First Url
HtmlDocument doc = Web.Load(webUrl + "/search?search_terms=software&geo_location_terms=Los+Angeles%2C+CA");
var HeaderName = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var abc in HeaderName)
{
Console.WriteLine(abc.InnerText);
}
//Loop through different pages from the paging of that first url and then keep on doing it until Next button returns nothing
while (Loop == true)
{
var NextPageCheck = doc.DocumentNode.SelectNodes("//a[text()='Next']/#href").ToList();
if (NextPageCheck.Count != 0)
{
string link = webUrl + NextPageCheck[0].Attributes["href"].Value;
doc = Web.Load(link);
HeaderName = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var abc in HeaderName)
{
Console.WriteLine(abc.InnerText);
}
}
else
{
Loop = false;
}
}
}
So the issue i am facing is, it keeps on showing the result from 2nd page. i want it to iterate that page and till there is no page number left like if it has 400 pages(in total), it should take that page url to 400
https://www.yellowpages.com/search?search_terms=software&geo_location_terms=Los%20Angeles%2C%20CA&page=2
page=2

Whilst debugging your code it seems I was getting a null error on the line in which you looking for the business names the second time around, in the version of HtmlAgilityPack that had installed it was encoding the urls so I simply added a decoding to the url
string link = webUrl + NextPageCheck[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
doc = Web.Load(urlDecode);
And it seemed to work fine - as the comment says next time you post it would be helpful to post the error you are getting and what line so it's easier and faster to track down the actual bug
Hope this helps.

C# Looking to get obtain the <div> value using HtmlAgilityPack but receiving a System.NullReferenceException

I am trying to get the value of the div class "darkgreen" which is 46.98. I tried the following code but am getting a Null exception.
Below is the code I am trying:
private void button1_Click(object sender, EventArgs e)
{
var doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='darkgreen']");
foreach (HtmlAgilityPack.HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
If I run the same code but with doc.DocumentNode.SelectNodes("//div[#class='rgt-hdr colorize']") it does pull the header data with no error.
I am thinking that maybe child nodes may be a solution but I am not sure as I am unable to get it to work still.

Your problem is that the HTML your looking it is created by a javascript. And the HTML you load into your Document variable is pre-what-ever is created by the javascript. If you look at the page source in your web browser you will see the exact HTML that gets loaded in your HtmlDocument variable.
The example below will give you the data(JSON) that is used to create the table. I don't know whether that is enough for whatever you're trying to do.
public static void Main(string[] args)
{
Console.WriteLine("Program Started!");
HtmlDocument doc;
doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//section[#class='bdy content article full cflex reset long table-page']/following-sibling::script[1]");
int start = node.InnerText.IndexOf("[");
int length = node.InnerText.IndexOf("]") - start +1;
Console.WriteLine(node.InnerText.Substring(start, length));
Console.WriteLine("Program Ended!");
Console.ReadKey();
}
Alternative solution
Alternatively you can use Selenium with PhantomJS. And then load the HTML from the headless browser into your document variable and then your xpath will work.

Im trying to get all the links from a website and put them in a List but sometimes im getting strange links why?

This is the code to get the links:
private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{
List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[#href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;
}
Sometimes the links im getting are starting like "/" or:
"/videos?feature=mh"
Or
"//www.youtube.com/my_videos_upload"
Im not sure if just "/" meaning a proper site or a site that start with "/videoes?...
Or "//www.youtube...
I need to get each time the links from a website that start with http or https maybe just www also count as a proper site. The question is what i define as a proper site address and a link and whats not ?
Im sure my getLinks function is not good the code is not the proper way it should be.
This is how im adding the links to the List:
private List<string> test(string url, int levels , DoWorkEventArgs eve)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;// = new List<string>();
List<string> csFiles = new List<string>();
try
{
doc = hw.Load(url);
webSites = getLinks(doc);
webSites is a List
After few times i see in the List sites like "/" or as above "//videoes... or "//www....

not sure if understood your question but
/Videos means it is accessing Videos folder from the root of the host you are accessing
ex:
www.somesite.com/Videos

There are absolute and relative Urls - so you are getting different flavors from different links, you need to make them absolute url appropriately (Uri class mostly will handle it for you).
foo/bar.txt - relative url from the same path as current page
../foo/bar.txt - relative path from one folder above current
/foo/bar.txt - server-relative pat from root - same server, path starting from root
//www.sample.com/foo/bar.txt - absolute url with the same scheme (http/https) as current page
http://www.sample.com/foo/bar.txt - complete absolute url

It looks like you are using a library that is able to parse/read html tags.
For my understanding
var href = link.Attributes["href"].Value;
is doing nothing but reading the value of the "href" attribute.
So assuming the website's source code is using links like href="/news"
it will grab and save even the relative links to your list.
Just view the target website's sourcecode and check it against your results.

Html parser to get blog posts

I need to create a html parser, that given a blog url, it returns a list, with all the posts in the page.
I.e. if a page has 10 posts, it
should return a list of 10 divs,
where each div contains h1 and
a p
I can't use its rss feed, because I need to know exactly how it looks like for the user, if it has any ad, image etc and in contrast some blogs have just a summary of its content and the feed has it all, and vice-versa.
Anyway, I've made one that download its feed, and search the html for similar content, it works very well for some blogs, but not for others.
I don't think I can make a parser that works for 100% of the blogs it parses, but I want to make the best possible.
What should be the best approach? Look for tags that have its id attribute equal "post", "content"? Look for p tags? etc etc etc...
Thanks in advance for any help!

I don't think you will be successful on that. You might be able to parse one blog, but if the blog engine changes stuff, it won't work any more. I also don't think you'll be able to write a generic parser. You might even be partially successful, but it's going to be an ethereal success, because everything is so error prone on this context. If you need content, you should go with RSS. If you need to store (simply store) how it looks, you can also do that. But parsing by the way it looks? I don't see concrete success on that.

"Best possible" turns out to be "best reasonable," and you get to define what is reasonable. You can get a very large number of blogs by looking at how common blogging tools (WordPress, LiveJournal, etc.) generate their pages, and code specially for each one.
The general case turns out to be a very hard problem because every blogging tool has its own format. You might be able to infer things using "standard" identifiers like "post", "content", etc., but it's doubtful.
You'll also have difficulty with ads. A lot of ads are generated with JavaScript. So downloading the page will give you just the JavaScript code rather than the HTML that gets generated. If you really want to identify the ads, you'll have to identify the JavaScript code that generates them. Or, your program will have to execute the JavaScript to create the final DOM. And then you're faced with a problem similar to that above: figuring out if some particular bit of HTML is an ad.
There are heuristic methods that are somewhat successful. Check out Identifying a Page's Primary Content for answers to a similar question.

Use the HTML Agility pack. It is an HTML parser made for this.

I just did something like this for our company's blog which uses wordpress. This is good for us because our wordress blog hasn't changed in years, but the others are right in that if your html changes a lot, parsing becomes a cumbersome solution.
Here is what I recommend:
Using Nuget install RestSharp and HtmlAgilityPack. Then download fizzler and include those references in your project (http://code.google.com/p/fizzler/downloads/list).
Here is some sample code I used to implement the blog's search on my site.
using System;
using System.Collections.Generic;
using Fizzler.Systems.HtmlAgilityPack;
using RestSharp;
using RestSharp.Contrib;
namespace BlogSearch
{
public class BlogSearcher
{
const string Site = "http://yourblog.com";
public static List<SearchResult> Get(string searchTerms, int count=10)
{
var searchResults = new List<SearchResult>();
var client = new RestSharp.RestClient(Site);
//note 10 is the page size for the search results
var pages = (int)Math.Ceiling((double)count/10);
for (int page = 1; page <= pages; page++)
{
var request = new RestSharp.RestRequest
{
Method = Method.GET,
//the part after .com/
Resource = "page/" + page
};
//Your search params here
request.AddParameter("s", HttpUtility.UrlEncode(searchTerms));
var res = client.Execute(request);
searchResults.AddRange(ParseHtml(res.Content));
}
return searchResults;
}
public static List<SearchResult> ParseHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode.QuerySelectorAll("#content-main > div");
var searchResults = new List<SearchResult>();
foreach(var node in results)
{
bool add = false;
var sr = new SearchResult();
var a = node.QuerySelector(".posttitle > h2 > a");
if (a != null)
{
add = true;
sr.Title = a.InnerText;
sr.Link = a.Attributes["href"].Value;
}
var p = node.QuerySelector(".entry > p");
if (p != null)
{
add = true;
sr.Exceprt = p.InnerText;
}
if(add)
searchResults.Add(sr);
}
return searchResults;
}
}
public class SearchResult
{
public string Title { get; set; }
public string Link { get; set; }
public string Exceprt { get; set; }
}
}
Good luck,
Eric

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack scraping - extracting specific nodes from html document - c#

Related

Scrape data from div in Windows.Form

getting through multiple pages in web scraping

C# Looking to get obtain the <div> value using HtmlAgilityPack but receiving a System.NullReferenceException

Im trying to get all the links from a website and put them in a List but sometimes im getting strange links why?

Html parser to get blog posts

Categories

Resources