I am scraping a certain web page using HAP, and I want to access the submit button on the page but the problem is I don't know how it could be done in HAP and C#, is there a way I could do this?
The HTML Agility Pack is not a browser, so while it can parse an HTML file, there is no way to really interact with it. You can find the submit object, read its properties and so forth, but you can't make it do anything.
You have two options:
Either read the form, build a Http Request object that matches the forms fields and post method and send it to the server. This is all manual work. The Agility Pack only helps you list the fields on the form and their properties
If you need to interact with the page you'll need a browser. There are headless browsers, like PhantomJS, that will actually load the page, parse the Javascript and run what's sent by the server. There are wrappers around those wrappers for C#, one of such examples is Awesonium. It's similar to the HTML Agility Pack in that it allows you to parse HTML documents, but it takes it one step further, actually running it without ever showing a browser screen.
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I am trying to get information by C# and (AngleSharp or HTML Agility Pack) about available schedules from a web page. The problem is that to see what schedules are available on different days, you have to press a "div" (previous, next). So to have one month schedules, I would have to go through and pag page by page. The problem that I find, is that I can not click on the div. In contrast to javascript in Chrome console if I can do it. I have seen that there is a similar response using DoClick on IHtmlElement, but it does not work, I do not change the page. The browser keeps tending the same html in the Document.
Let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
The problem I see in your question description is that in order to "click" a div on a page a script needs to be run. This script can now fall into the "trivial" category, however, most likely it is not. Now you have 2 options:
Try it out and maybe it works / great, otherwise ...
See what the script is doing (obviously some HTTP request eventually ...) and do the same
The latter can of course be re-implemented in C# / AngleSharp. So you can create an HTTP request, get the data and either do something on that data set directly (it may be JSON and already what you want ....) or (if it is serving partial HTML) re-parse it and integrate it on the real page.
HTH!
I am attempting to get the resulting web page content so I can extract the display text. I have attempted the code below but it gets me the source html and not the resulting html.
string urlPath = "http://www.cbsnews.com/news/jamar-clark-protests-follow-decision-not-to-file-charges-in-minneapolis-police-shooting/";
WebClient client = new WebClient();
string str = client.DownloadString(urlPath);
Compare the text in the str variable with the html in the Developer Tools in the Chrome browser and you will get different results.
Any recommendations will be appreciated.
I'm assuming you mean that you want the article text. If so you will need to follow a different course of action. The page you refer to is loaded with client script that injects loads of content into the base HTML document. This is done by executing the client-side script. You will need to parse the DOM after the script is executed to get the content you're interested in.
As others have pointed out, an actual web browser will parse the downloaded HTML and execute javascript against it, potentially altering its content. While you could try to do that parsing yourself, the easiest route is to ask a real web browser to do it for you and then grab the results.
The easiest solution specifically in C# would be to use the WebBrowser Control from Windows Forms, which essentially exposes IE to your program, allowing you to control it. Use the Navigate method to load the URL in question, then use the Document property to navigate the DOM. You can, at that point, get the outerHTML to get the final content of the DOM as HTML.
If you're not writing a Windows program and are interested more in headless operation, have a look at PhantomJS. It's a headless Webkit browser that is scriptable from javascript and would give you similar capability, although not in C#.
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
One of my friends is working on having a good solution to generate aspx pages, out of html pages generated from a legacy asp application.
The idea is to run the legacy app, capture html output, clean the html using some tool (say HtmlTidy) and parse it/transform it to aspx, (using Xslt or a custom tool) so that existing html elements, divs, images, styles etc gets converted neatly to an aspx page (too much ;) ).
Any existing tools/scripts/utilities to do the same?
Here's what you do.
Define what the legacy app is supposed to do. Write down the scenarios of getting pages, posting forms, navigating, etc.
Write unit test-like scripts for the various scenarios.
Use the Python HTTP client library to exercise the legacy app in your various scripts.
If your scripts work, you (a) actually understand the legacy app, (b) can make it do the various things it's supposed to do, and (c) you can reliably capture the HTML response pages.
Update your scripts to capture the HTML responses.
You have the pages. Now you can think about what you need for your ASPX pages.
Edit the HTML by hand to make it into ASPX.
Write something that uses Beautiful Soup to massage the HTML into a form suitable for ASPX. This might be some replacement of text or tags with <asp:... tags.
Create some other, more useful data structure out of the HTML -- one that reflects the structure and meaning of the pages, not just the HTML tags. Generate the ASPX pages from that more useful structure.
Just found HTML agility pack to be useful enough, as they understand C# better than python.
I know this is an old question, but in a similar situation (50k+ legacy ASP pages that need to display in a .NET framework), I did the following.
Created a rewrite engine (HttpModule) which catches all incoming requests and looks for anything that is from the old site.
(in a separate class - keep things organized!) use WebClient or HttpRequest, etc to open a connection to the old server and download the rendered HTML.
Use the HTML agility toolkit (very slick) to extract the content that I'm interested in - in our case, this is always inside if a div with the class "bdy".
Throw this into a cache - a SQL table in this example.
Each hit checks the cache and either a)retrieves the page and builds the cache entry, or b) just gets the page from the cache.
An aspx page built specifically for displaying legacy content receives the rewrite request and displays the relevant content from the legacy page inside of an asp literal control.
The cache is there for performance - since the first request for a given page has a minimum of two hits - one from the browser to the new server, one from the new server to the old server - I store cachable data on the new server so that subsequent requests don't have to go back to the old server. We also cache images, css, scripts, etc.
It gets messy when you have to handle forms, cookies, etc, but these can all be stored in your cache and passed through to the old server with each request if necessary. I also store content expiration dates and other headers that I get back from the legacy server and am sure to pass those back to the browser when rendering the cached page. Just remember to take as content-agnostic an approach as possible. You're effectively building an in-page web proxy that lets IIS render old ASP the way it wants, and manipulating the output.
Works very well - I have all of the old pages working seamlessly within our ASP.NET app. This saved us a solid year of development time that would have been required if we had to touch every legacy asp page.
Good luck!