I am using the Abot library to crawl a web page. The crawler can request the pages correctly but the problem is that almost all of the content is loaded dynamically through knockout.js. The crawler currently has no way of requesting this content which results in only a small part of the page being loaded.
I've tried making the program wait in hopes of the requests for the dynamic being sent anyways but that doesn't seem to work.
I want the entire page to be loaded but instead, only the base of the page is loaded.
What can I do to make the crawler request all data?
Thanks!
Short answer:
Not possible this way, you need something that can handle the JS for you like browsers do.
I would recommend Splash from Scrapy (It can be integrated with any language through it's REST API).
But in my humble opinion, if you don't need an enterprise solution don't use C# for web crawling, there are easiest solutions and more complete libraries in python for example.
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.
I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?
Your question is purely .NET-C#-ish (at least I assume you use .net c#).
Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.
For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.
I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).
We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?
It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...
You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.
Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!
How about capturing the whole page as an image and running an OCR on the page to read the data
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
I want to make a program that will simulate a user browsing a site and clicking on links. Cookies and javascript have to be enabled. I've successfully done this in python, but I want to write it an compilable language (python ide's don't cut it). The links on the site are generated with javascript and are dynamic. With python I used PAMIE (third party module that uses win32com) to launch an instance of Internet explorer, scrape the generated html for the links, then navigate to one of them. The point is for the whole process to be transparent to the server. What's the best (compilable) language and method to do this? I was thinking C# with WebBrowser control but I don't want to spend a lot of time learning something if it isn't going to work. Any kind help is appreciated!
You might want to look at the automated testing via browser suites:
http://www.teknologika.com/blog/the-holy-grail-net-automated-web-gui-testing-for-internet-explorer/
http://watin.sourceforge.net/
I wrote a blog post on this awhile back: Web scraping in .NET. That discusses cookies but not JavaScript; I don't know if that would require additional coding.
Might be worth having a look at selenium .
We use it for web testing in a C# asp.net envirnorment.
The documentation isn't to bad