C# .NET: Scraping dynamic (JS) websites

C# .NET: Scraping dynamic (JS) websites - c#

After hours of fails, I am coming here. I need to scrape a dynamically generated webpage (made using Vue.JS, but I would prefer not to share the link).
I have tried multiple approaches (1, 2, 3). None of them works on this webpage.
The most promising solution was using Selenium and PhantomJS. I tried it like this and I'm not sure why it's not even working for Google:
private void button1_Click(object sender, EventArgs e) {
PhantomJSDriverService service = PhantomJSDriverService.CreateDefaultService();
service.IgnoreSslErrors = true;
service.LoadImages = false;
service.ProxyType = "none";
var driver = new PhantomJSDriver(service); // I also tried: new PhantomJSDriver();
driver.Manage().Timeouts().PageLoad = TimeSpan.FromSeconds(10);
driver.Url = "https://google.com";
driver.Navigate();
var source = driver.PageSource;
textBox1.AppendText(source);
}
Did not work:
I also tried with a WebBrowser Control, but the page never fully loads:
(EDIT: I found out WebBrowser just instantiates IE, and after trying to open the target website in standalone IE browser, the webpage also never loads completely, so it makes sense to see the same behaviour inside WebView. I think I am bound to Selenium&PhantomJS due to this fact.)
Surely this shouldn't be so complicated. How to do it properly?

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget.
https://www.nuget.org/packages/ScrapySharp/
Install-Package ScrapySharp -Version 2.6.2
It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:
ScrapingBrowser Browser = new ScrapingBrowser();
WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
HtmlNode rawHTML = PageResult.Html;
Console.WriteLine(rawHTML.InnerHtml);
Console.ReadLine();

Related

Avoid load images and other resources like css when use Selenium in c#

Good morning.
I am developing a spider to review a few web pages. I can't do it without using Selenium. But the problem with Selenium is that it consumes a lot of resources and is slow. I am looking for the optimization way.
From what I see the main problem is that Selenium loads the entire website, with all its resources. But I just need javascript and html to work for me. But I don't need images. Can I somehow prevent images from loading in the Selenium browser in C #?
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using (IWebDriver driver = SeleniumUtility.GetChromeDriverHidden())
{
driver.Url = "https://stackoverflow.com/";
string html = driver.PageSource;
}
internal static ChromeDriver GetChromeDriverHidden(bool hidden = true)
{
ChromeDriverService service = ChromeDriverService.CreateDefaultService(".");
service.HideCommandPromptWindow = true; // Hide output commands in console
var options = new ChromeOptions()
{
AcceptInsecureCertificates = true // This lets the browser accept the insecure certificate. Set hidden = false
};
if (hidden)
{
options.AddArgument("headless"); // hide window if added to options
}
return new ChromeDriver(service, options);
}
I see one solution, but in C# I don't understand how to do it.

Try this, I hope it helps
ChromeOptions options = new ChromeOptions();
options.addArguments("headless","--blink-settings=imagesEnabled=false");
Or
IWebDriver driver;
ChromeOptions options = new ChromeOptions();
options.AddUserProfilePreference("profile.default_content_setting_values.images", 2);
driver = new ChromeDriver(options);
See the original answer here

WatiN doesn't find anything

I'm new to C# and I'm trying to do an application that automatize Internet Explorer.
When I click a button, the application does :
using ( var Browser = new IE())
{
Browser.GoTo("http://testweb.com");
Browser.TextField(Find.ByName("username")).TypeText("User");
Browser.TextField(Find.ByName("password")).TypeText("Pass");
}
But it doesn't write anything. It navigates to the web but...

Try this:
IE ie = null;
ie = new IE();
ie.GoTo("Link");
ie.WaitForComplete();
At least to get started.
For the other bit, you need to get an exact identification and then you can tell WaTiN to interact with it.
Textfield userTextBox = ie.Textfield(Find.ByName("name"));
userTextBox.TypeText("user");
This may seem banal but now you can add a peek definition in your code and see if "userTextBox" gets found by name. If it doesn't you need to find it through another method (ID or class).

Open link in new tab selenium c#

I am writing a program to run videos listed on my site for testing purpose and here what I need is to run videos in different tabs of the same browser window.
I have hundred video urls in the List videoLinks = getVideoUrls();
and now what I need is to execute these videos 5 at a time.
ChromeDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("https://www.withoutabox.com" + videoLink);
If I go the above way then for all videos I will have to create a new ChromeDriver object. I want to use single chrome browser object.
I have tried this
IWebElement body = driver.FindElement(By.TagName("body"));
body.SendKeys(Keys.Control + "t");
it only adds a new tab but not open a link there.
Please let me know how should I go around it. I have googled but couldn't find my solution so thought to ask for help.

Try this:
public void SwitchToTab(object pageId)
{
webDriver.SwitchTo().Window(pageId.ToString());
}
You can use CurrentWindowHandle to find current tab.
webDriver.CurrentWindowHandle;
For your scenario I'm using that code:
public IPageAdapter OpenNewTab(string url)
{
var windowHandles = webDriver.WindowHandles;
scriptExecutor.ExecuteScript(string.Format("window.open('{0}', '_blank');", url));
var newWindowHandles = webDriver.WindowHandles;
var openedWindowHandle = newWindowHandles.Except(windowHandles).Single();
webDriver.SwitchTo().Window(openedWindowHandle);
return new SeleniumPage(webDriver);
}
Update
Window open create new popup. By default this option can be blocked by browser settings. Disable popup blocking in your browser manually.
To check this, open js console in your browser and try to execute command window.open('http://facebook.com', '_blank');
If new window open successfully than everythng is OK.
You can also create your chrome driver with specific setting. Here is my code:
var chromeDriverService = ChromeDriverService.CreateDefaultService();
var chromeOptions = new ChromeOptions();
chromeOptions.AddUserProfilePreference("profile.default_content_settings.popups", 0);
return new ChromeDriver(chromeDriverService, chromeOptions, TimeSpan.FromSeconds(150));

Here is a simple solution for open a new tab in seleneium c#:
driver.Url = "http://www.gmail.net";
IJavaScriptExecutor js = (IJavaScriptExecutor)driver;
js.ExecuteScript("window.open();");

disable IE visibility while using WatiN

I use watin, because I need to open some websites in the background for which the user needs to support Javascript. I don't know if WatiN is the best for this job, but at the moment it takes very long until Internet Explorer gets visible. I need to disable to popping up of Internet Explorer while using WatiN. User doesn't need to see the opening of sites. Is it possible while using WatiN to visit a website without showing it the user or should I use another alternative which supports JS on client side?
My code at the moment;
public static void visitURL()
{
IE iehandler = new IE("http://www.isjavascriptenabled.com");
if (iehandler.ContainsText("Yes"))
Console.WriteLine("js on");
else
Console.WriteLine("js off");
}

The WatIn.Core.IE class has a Visible property, you can initialize the object like that:
new WatiN.Core.IE() { Visible = true }
This way the IE will just blink on the screen when it's created, and then it will get hidden. You can later control the visibility of the IE with the ShowWindow method of WatiN.Core.IE class - I mean you can show it on the screen if you need, or you can hide again.

I use exactly that trick (of hiding IE) for writing UnitTests (using https://github.com/o2platform/FluentSharp_Fork.WatiN) that run in an hidden IE window
For example here is how I create a helper class (with an configurable hidden value)
public IE_TeamMentor(string webRoot, string path_XmlLibraries, Uri siteUri, bool startHidden)
{
this.ie = "Test_IE_TeamMentor".popupWindow(1000,700,startHidden).add_IE();
this.path_XmlLibraries = path_XmlLibraries;
this.webRoot = webRoot;
this.siteUri = siteUri;
}
which is then consumed by this test:
[Test] public void View_Markdown_Article__Edit__Save()
{
var article = tmProxy.editor_Assert() // assert the editor user (or the calls below will fail due to security demands)
.library_New_Article_New() // create new article
.assert_Not_Null();
var ieTeamMentor = this.new_IE_TeamMentor_Hidden();
var ie = ieTeamMentor.ie;
ieTeamMentor.login_Default_Admin_Account("/article/{0}".format(article.Metadata.Id)); // Login as admin and redirect to article page
var original_Content = ie.element("guidanceItem").innerText().assert_Not_Null(); // get reference to current content
ie.assert_Has_Link("Markdown Editor")
.link ("Markdown Editor").click(); // open markdown editor page
ie.wait_For_Element_InnerHtml("Content").assert_Not_Null()
.element ("Content").innerHtml()
.assert_Is(original_Content); // confirm content matches what was on the view page
var new_Content = "This is the new content of this article".add_5_RandomLetters(); // new 'test content'
ie.element("Content").to_Field().value(new_Content); // put new content in markdown editor
ie.button("Save").click(); // save
ie.wait_For_Element_InnerHtml("guidanceItem").assert_Not_Null()
.element ("guidanceItem").innerHtml()
.assert_Is("<P>{0}</P>".format(new_Content)); // confirm that 'test content' was saved ok (and was markdown transformed)
ieTeamMentor.close();
}
Here are a number of posts that might help you to understand how I use it:
https://github.com/TeamMentor/Dev/tree/master/Source_Code/TM_UnitTests/TeamMentor.UnitTests.QA/TeamMentor_QA_IE
http://blog.diniscruz.com/2014/07/how-to-debug-cassini-hosted-website-and.html
http://blog.diniscruz.com/2014/07/using-watin-and-embedded-cassini-to-run.html
http://blog.diniscruz.com/search/label/WatiN

WatiN need to declare browser outside of STAThreaded thread

I'm currently trying to use WatiN to do some automatic data collection. I used to use a WebBrowser controll, and the way I did it was I declared a stathread and ran it from there, then using a Browser.DocumentCompleted I started a void LoginPageLoaded on which I would set the user and pass, and login. Thing is, with WatiN, I'm trying to do
var th = new Thread(() =>
{
Browser browser = new IE(url);
browser.WaitForComplete();
HolyThunder hl = new HolyThunder();
l.LoginPageLoaded();
});
th.SetApartmentState(ApartmentState.STA);
th.Start();
But obviously, when I try to use the browser instance on the LoginPageLoaded void it says it downs't know what it is, because it was declared inside the th thread. I didn't do it when I ran LoginPageLoaded through browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(LoginPageLoaded);, and I obviously can't declare the browser outside of that STAThread. What do you think I can do to fix this?
EDIT - I failed so bad... if I do everything I need to do inside that STAthread, everything will work. My question now is, in watin, is there anyway to open a "webbrowser" without actually having the visual component of the webbrowser?

You can try using NHtmlUnit, it's a headless browser. I am not sure if you can do it with WatiN.

You can run WebKit headless (see, for example http://phantomjs.org/), but I don't know if you can drive it from WatiN.

I guess you want to hide web browser window.
In that case try this:
Settings.Instance.MakeNewIeInstanceVisible = false;

To open a "webbrowser" without actually having the visual component of the webbrowser, use
var browser = new IE();
browser.Visible = false;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# .NET: Scraping dynamic (JS) websites - c#

Related

Avoid load images and other resources like css when use Selenium in c#

WatiN doesn't find anything

Open link in new tab selenium c#

disable IE visibility while using WatiN

WatiN need to declare browser outside of STAThreaded thread

Categories

Resources