I have to write a Console Application that grap and parse data from a website.
Unluckly, the website uses some kind of Javascript framework to compose the page.
So what I need to do is get the HTML once time the page is rendered by Javascript.
This is just the first step, my second step is to navigate the website to collect data from different page but... Unluckly the pages that I have to parse does not have Urls, but they are loaded from Javascript too...
Do you have some ideas ?
Thanks to support
Dario
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I am working on scrapping of website. so i make one desktop application for that.
I check website using inspect element then i can see whole data of website but when i try to check website data using page source(ctrl+U) then there is nothing.
means i can't find any website data in page source but can see in firebug(inspect element).
because of this when i am trying to get data using c# coding then i am getting only page source data which doesn't contains any website data only contains schema(structure) and js links.
see below image of firebug.
And this is page source image.
You met the js-powered site. The content is dynamicly loaded thru js, thus it's not visible in page-source. Turn to the scrape libraries that support js code evaluation. See here an example.
I have developed crawler in C#.
I am reading data from one page that is list page, It uses javascript for redirecting to next page.
function is - <a onclick="redirectToNextPage(PageID)">More</a>
How i can run this function in serverside and get url of the next page, so that by that url i can save that page.
I want to run javascript function in C# to get url of next page
You'll almost certainly need a headless browser to do that, not just running JavaScript code without the context it expects to run in. This question and its answer list some headless browsers that can be used from C# (not all of them have JavaScript support, though). That list may well be out of date now, but that's the term you need to search for.
Try https://javascriptdotnet.codeplex.com/.
It exposes Google V8 JS engine to CLI and also allows to CLI objects to be manipulated by JS
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
One of my friends is working on having a good solution to generate aspx pages, out of html pages generated from a legacy asp application.
The idea is to run the legacy app, capture html output, clean the html using some tool (say HtmlTidy) and parse it/transform it to aspx, (using Xslt or a custom tool) so that existing html elements, divs, images, styles etc gets converted neatly to an aspx page (too much ;) ).
Any existing tools/scripts/utilities to do the same?
Here's what you do.
Define what the legacy app is supposed to do. Write down the scenarios of getting pages, posting forms, navigating, etc.
Write unit test-like scripts for the various scenarios.
Use the Python HTTP client library to exercise the legacy app in your various scripts.
If your scripts work, you (a) actually understand the legacy app, (b) can make it do the various things it's supposed to do, and (c) you can reliably capture the HTML response pages.
Update your scripts to capture the HTML responses.
You have the pages. Now you can think about what you need for your ASPX pages.
Edit the HTML by hand to make it into ASPX.
Write something that uses Beautiful Soup to massage the HTML into a form suitable for ASPX. This might be some replacement of text or tags with <asp:... tags.
Create some other, more useful data structure out of the HTML -- one that reflects the structure and meaning of the pages, not just the HTML tags. Generate the ASPX pages from that more useful structure.
Just found HTML agility pack to be useful enough, as they understand C# better than python.
I know this is an old question, but in a similar situation (50k+ legacy ASP pages that need to display in a .NET framework), I did the following.
Created a rewrite engine (HttpModule) which catches all incoming requests and looks for anything that is from the old site.
(in a separate class - keep things organized!) use WebClient or HttpRequest, etc to open a connection to the old server and download the rendered HTML.
Use the HTML agility toolkit (very slick) to extract the content that I'm interested in - in our case, this is always inside if a div with the class "bdy".
Throw this into a cache - a SQL table in this example.
Each hit checks the cache and either a)retrieves the page and builds the cache entry, or b) just gets the page from the cache.
An aspx page built specifically for displaying legacy content receives the rewrite request and displays the relevant content from the legacy page inside of an asp literal control.
The cache is there for performance - since the first request for a given page has a minimum of two hits - one from the browser to the new server, one from the new server to the old server - I store cachable data on the new server so that subsequent requests don't have to go back to the old server. We also cache images, css, scripts, etc.
It gets messy when you have to handle forms, cookies, etc, but these can all be stored in your cache and passed through to the old server with each request if necessary. I also store content expiration dates and other headers that I get back from the legacy server and am sure to pass those back to the browser when rendering the cached page. Just remember to take as content-agnostic an approach as possible. You're effectively building an in-page web proxy that lets IIS render old ASP the way it wants, and manipulating the output.
Works very well - I have all of the old pages working seamlessly within our ASP.NET app. This saved us a solid year of development time that would have been required if we had to touch every legacy asp page.
Good luck!