getting through multiple pages in web scraping

getting through multiple pages in web scraping - c#

I am working on web scraping, to get values from yello pages and while iterating through pages the loop function isnt getting the page count increment. I have added a loop its keep on showing data from same page. i am attaching my code below.
static void Main(string[] args)
{
string webUrl = "https://www.yellowpages.com";
bool Loop = true;
HtmlWeb Web = new HtmlWeb();
//First Url
HtmlDocument doc = Web.Load(webUrl + "/search?search_terms=software&geo_location_terms=Los+Angeles%2C+CA");
var HeaderName = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var abc in HeaderName)
{
Console.WriteLine(abc.InnerText);
}
//Loop through different pages from the paging of that first url and then keep on doing it until Next button returns nothing
while (Loop == true)
{
var NextPageCheck = doc.DocumentNode.SelectNodes("//a[text()='Next']/#href").ToList();
if (NextPageCheck.Count != 0)
{
string link = webUrl + NextPageCheck[0].Attributes["href"].Value;
doc = Web.Load(link);
HeaderName = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var abc in HeaderName)
{
Console.WriteLine(abc.InnerText);
}
}
else
{
Loop = false;
}
}
}
So the issue i am facing is, it keeps on showing the result from 2nd page. i want it to iterate that page and till there is no page number left like if it has 400 pages(in total), it should take that page url to 400
https://www.yellowpages.com/search?search_terms=software&geo_location_terms=Los%20Angeles%2C%20CA&page=2
page=2

Whilst debugging your code it seems I was getting a null error on the line in which you looking for the business names the second time around, in the version of HtmlAgilityPack that had installed it was encoding the urls so I simply added a decoding to the url
string link = webUrl + NextPageCheck[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
doc = Web.Load(urlDecode);
And it seemed to work fine - as the comment says next time you post it would be helpful to post the error you are getting and what line so it's easier and faster to track down the actual bug
Hope this helps.

Related

Can the PageSource property in Selenium be updated as JavaScript loads data?

I'm trying to determine if there's specific text on the page. I'm doing this:
public static void WaitForPageToLoad(this IWebDriver driver, string textOnPage)
{
var pageSource = driver.PageSource.ToLower();
var timeOut = 0;
while (timeOut < 60)
{
Thread.Sleep(1000);
if (pageSource.Contains(textOnPage.ToLower()))
{
timeOut = 60;
}
}
}
The problem is that the web driver's PageSource property isn't updated after the initial load. The page I'm navigating to loads a bunch of data via JS after the page has already loaded. I don't control the site, so I'm trying to figure out a method to get the updated HTML.

You are trying to solve the wrong problem. You need to wait for the text to appear using an XPath locator:
var wait = new WebDriverWait(driver);
var xpath = $"//*[contains(., '{textOnPage}')]";
wait.Until(ExpectedConditions.ElementIsVisible(By.XPath(xpath));

Do you really need to search entire page?
I'll reference you to here: https://stackoverflow.com/a/41223770/1387701
with this code:
String Verifytext= driver.findElement(By.tagName("body")).getText().trim();
You can then check to see if the Verifytext contains the string you're checking for.
This works MUCH better if you can narrow the location of the text down to a particular webElement other than the body.

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?

Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

HtmlAgilityPack scraping - extracting specific nodes from html document

Apologies in advance if this has already been answered(if so please point me to right location), I searched here, web, youtube and so on for two days and still haven't founnd an answer.
I would like to extract some data from following url: https://betcity.ru/en/results/sp_fl=a:46;
I am trying to get all event names for the day(1st one is:
Ho Kwan Kit/Wong Chun Ting — Fan Zhendong/Xu Xin and all others after it). When I inspect that element I can see this part of html:
<div class="content-results-data__event"><span>Ho Kwan Kit/Wong Chun Ting — Fan Zhendong/Xu Xin</span></div>
I was thinking of getting all div's with class="content-results-data__event" and than get inner text from those div's. Every time I run my code I get zero results. Why am I not getting any nodes when I can see that div's with such class exist and how can I get all events (if I learn how to do that I could get other info which I need from this site). Here is my code (have to say I am fairly new to this).
public partial class Scrapper : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
List<string> Events = new List<string>();
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = NewMethod(web);
var Nodes = doc.DocumentNode.SelectNodes(xpath: "//div[#class='content - results - data__event'']").ToList();
foreach (var item in Nodes)
{
Events.Add(item.InnerText);
}
GridView1.DataSource = Events;
GridView1.DataBind();
}
private static HtmlDocument NewMethod(HtmlAgilityPack.HtmlWeb web)
{
return web.Load("https://betcity.ru/en/results/sp_fl=a:46;");
}
}
}

Here is how to get the HTML for one day of matches using Selenium. Rest is HtmlAgilityPack. The site uses self signed certificates so I had to configure the driver to accept self signed certificates. Have fun.
var ffOptions = new FirefoxOptions();
ffOptions.BrowserExecutableLocation = #"C:\Program Files (x86)\Mozilla Firefox\firefox.exe";
ffOptions.LogLevel = FirefoxDriverLogLevel.Default;
ffOptions.Profile = new FirefoxProfile { AcceptUntrustedCertificates = true };
var service = FirefoxDriverService.CreateDefaultService();
var driver = new FirefoxDriver(service, ffOptions, TimeSpan.FromSeconds(120));
string url = "https://betcity.ru/en/results/date=2017-11-19;"; //remember to update the date accordingly.
driver.Navigate().GoToUrl(url);
Thread.Sleep(2000);
Console.Write(driver.PageSource);

c# webclient parent page

I need to obtain some data from website. Manually I would open search form, enter document ID, click search, get search result (page 1), click on specific icon near the only hit (document ID is unique) and get the desired string on another page (page 2).
Search query URL contains document ID explicitly, so there is no problem to load page 1. URL of the page 2 could be found in the source code of page 1. But the server seems to be checking the reference before loading page 2, and the code execution terminates with error 500.
C# code is:
using (WebClient client = new WebClient())
{
string htmlCode0 = client.DownloadString("page_1_url");
//page 2 URL consists of common part and 6 digits ID
string toSearch("page_2_url_common");
int index = htmlCode.IndexOf(toSearch);
string toFind = htmlCode.Substring(index, toSearch.Length + 6);
string htmlCode1 = client.DownloadString(toFind);
}
Firebug has shown this while loading the page 2:
if (jQuery) {
jQuery(window).bind('beforeunload', function () {
if (window.doNotLock2 === undefined) {
if (window.doNotLock) {
window.doNotLock = false;
} else {
showLoading();
}
}
try {
if (window.childPopup) {
window.childPopup.close();
unBlockWindow();
window.childPopup = null;
}
var win = window.opener;
while (win.parent && win.parent != win) win = win.parent;
win.unBlockWindow();
win = null;
} catch (error) {
// if window open from other context - window.opener access denied, do nothing
}
return;
});
}
Is there any way to skip or override this check and get page 2 content?

Select elements added to the DOM by a script

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers

Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

getting through multiple pages in web scraping - c#

Related

Can the PageSource property in Selenium be updated as JavaScript loads data?

Aspose PDF - get text from page that has a matching string

HtmlAgilityPack scraping - extracting specific nodes from html document

c# webclient parent page

Select elements added to the DOM by a script

Categories

Resources