I have one. aspx page that has some JavaScript functions that control paging.
I can run this javascript function via webbrowser with the following method within the WebBrowser1_DocumentCompleted
WebBrowser1.Document.Window.DomWindow.execscript ("somefunction();", "JavaScript")
The webbrowser is very slow and I prefer to use System.Net.WebClient.DownloadString.
Has some way to run this script with the System.Net.WebClient methods that are faster, or some other way?
Well, no. WebClient is an HTTP client, not a web browser.
An HTTP client follows the HTTP spec; the fact that your HTTP requests result in HTML is irrelevant to the client.
A web browser, on the other hand, in addition to being an HTTP client, also knows how to parse HTML responses (and execute JavaScript, etc.).
It seems that what you are looking for is called a "headless browser", which supports loading HTML and running JavaScript on the DOM, exactly like you need. Headless browsers are also generally quite fast compared to normal browsers, since they don't need to do any rendering.
There are several headless browsers. HtmlUnit (which can be converted to run on .NET) seems like a good choice, as does envjs (it's written in JavaScript, which can be embedded in .NET). Unfortunately, I have no experience with either, but they both look super-cool, especially envjs. Update: a nice, more up to date list of headless browsers has been published on GitHub.
There are also other alternatives to the WebBrowser control which may or may not be faster in your case, if you want to stay with a control.
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I am using the Abot library to crawl a web page. The crawler can request the pages correctly but the problem is that almost all of the content is loaded dynamically through knockout.js. The crawler currently has no way of requesting this content which results in only a small part of the page being loaded.
I've tried making the program wait in hopes of the requests for the dynamic being sent anyways but that doesn't seem to work.
I want the entire page to be loaded but instead, only the base of the page is loaded.
What can I do to make the crawler request all data?
Thanks!
Short answer:
Not possible this way, you need something that can handle the JS for you like browsers do.
I would recommend Splash from Scrapy (It can be integrated with any language through it's REST API).
But in my humble opinion, if you don't need an enterprise solution don't use C# for web crawling, there are easiest solutions and more complete libraries in python for example.
Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.
I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?
Your question is purely .NET-C#-ish (at least I assume you use .net c#).
Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.
For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.
I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).
When loading a web page, it executes many GET request to fetch resources, such as images, css files, fonts and other stuff.
Is there a possibility to capture failed GET requests using Selenium in C#?
Selenium does not natively provide this capability. I'm coming to this conclusion for two reasons:
I've not seen any function exported by Selenium's API which would allow doing what you want in a cross-platform way.
(I'm saying "cross-platform way" because I'm excluding from considerations possible non-standard APIs that could be exported by one browser but not others.)
If there is any doubt I may have missed something, then consider that ...
The Selenium team has quite consciously decided not to provide any means to get the response code of the HTTP request that downloads the page in the first place. It is extremely doubtful that they would have slipped behind the scenes a way to get the response code of the other HTTP requests that are launched to load other resources.
The way to check on any such requests is to have the Selenium browser launched by Selenium connect through a proxy that records such responses. Or to load the page with something else than Selenium.
We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?
It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...
You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.
Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!
How about capturing the whole page as an image and running an OCR on the page to read the data