Capture failed loads of contents using Selenium - c#

When loading a web page, it executes many GET request to fetch resources, such as images, css files, fonts and other stuff.
Is there a possibility to capture failed GET requests using Selenium in C#?

Selenium does not natively provide this capability. I'm coming to this conclusion for two reasons:
I've not seen any function exported by Selenium's API which would allow doing what you want in a cross-platform way.
(I'm saying "cross-platform way" because I'm excluding from considerations possible non-standard APIs that could be exported by one browser but not others.)
If there is any doubt I may have missed something, then consider that ...
The Selenium team has quite consciously decided not to provide any means to get the response code of the HTTP request that downloads the page in the first place. It is extremely doubtful that they would have slipped behind the scenes a way to get the response code of the other HTTP requests that are launched to load other resources.
The way to check on any such requests is to have the Selenium browser launched by Selenium connect through a proxy that records such responses. Or to load the page with something else than Selenium.

Related

Scraping web pages (including AJAX) from a .NET solution

Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.
I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?
Your question is purely .NET-C#-ish (at least I assume you use .net c#).
Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.
For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.
I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).

Prevent Selenium from opening new window

today, I use Selenium to parse data from a website. Here is my code:
public ActionResult ParseData()
{
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl(myURL);
IList<IWebElement> nameList = driver.FindElements(By.XPath(myXPath));
return View(nameList);
}
The problem is, whenever it runs, it opens new window at myURL location, then get the data, and leave that window opening.
I don't want Selenium to open any new window here. Just run at the background and give me the parsed data. How can I achieve that? Please help me. Thanks a lot.
Generally I agree with andrei: why use Selenium if you are not planning to interact with browser window?
Having said that, simplest thing to do to prevent Selenium from leaving the window open, is to close it before returning from the function:
driver.Quit();
Another option, if the page doesn't have to be loaded in Firefox, is to use HtmlUnit Driver instead (it has no UI)
Well, it seems that on each web request you are creating (though, not closing / disposing) a Selenium driver object. As I have said in the comment, there may be better solutions for your problem...
As you want to fetch a web page and extract some data from it, feel free to use:
WebClient
WebRequest
A web application is not very a hospitable environment for a Selenium driver instance IMHO. Though, if you still want to play a bit with it, make the Selenium instance static and reuse it among requests. Still, if it will be used from concurrent requests (multiple threads running at the same time), a crush is very probable :) You have the option to protect the instance (locks, critical section etc.) but then you will have zero scalability.
Short answer: fetch the data by in another way, Selenium is just for automatic exploration tests as far as I know...
But...
If you really have to explore that website - the source of your data - with Selenium... Then fetch the data using Selenium in advance - speculatively, in another process (a console application that runs in background) and store it in some files or in a database. Then, from the web application, read the data and return it to your clients :)
If you do not have yet the data the client has asked for, respond with some error - "please try again in 5 minutes", and tell the console application (that's running in background) to fetch that data (there are various ways of communicating across process boundaries - the web app and the console app in our case, but you can use a simple file / db for queuing "data requests" - whatever)...

WebClient runs javascript

I have one. aspx page that has some JavaScript functions that control paging.
I can run this javascript function via webbrowser with the following method within the WebBrowser1_DocumentCompleted
WebBrowser1.Document.Window.DomWindow.execscript ("somefunction();", "JavaScript")
The webbrowser is very slow and I prefer to use System.Net.WebClient.DownloadString.
Has some way to run this script with the System.Net.WebClient methods that are faster, or some other way?
Well, no. WebClient is an HTTP client, not a web browser.
An HTTP client follows the HTTP spec; the fact that your HTTP requests result in HTML is irrelevant to the client.
A web browser, on the other hand, in addition to being an HTTP client, also knows how to parse HTML responses (and execute JavaScript, etc.).
It seems that what you are looking for is called a "headless browser", which supports loading HTML and running JavaScript on the DOM, exactly like you need. Headless browsers are also generally quite fast compared to normal browsers, since they don't need to do any rendering.
There are several headless browsers. HtmlUnit (which can be converted to run on .NET) seems like a good choice, as does envjs (it's written in JavaScript, which can be embedded in .NET). Unfortunately, I have no experience with either, but they both look super-cool, especially envjs. Update: a nice, more up to date list of headless browsers has been published on GitHub.
There are also other alternatives to the WebBrowser control which may or may not be faster in your case, if you want to stay with a control.

How to capture visited URLs and their html by any browsers

I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.
Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap

Web Scraper design

can we make application that searches google for a word and navigates to various pages? Using Httpwebresponse or and search for a word on rendered page?and it should have multiple proxy usage i.e all above is multi threaded and each thread has different proxy.
So far I have failed to do so in return GetResponse says "Method not allowed"
You might want to look at Google API as that would be much more easy to access but I do no know if your usage would require you to buy the service license.
Also, verify that you get the right response as Google sometimes uses iframes for content, maybe you only get the outer frame?

Categories