I am trying to get information by C# and (AngleSharp or HTML Agility Pack) about available schedules from a web page. The problem is that to see what schedules are available on different days, you have to press a "div" (previous, next). So to have one month schedules, I would have to go through and pag page by page. The problem that I find, is that I can not click on the div. In contrast to javascript in Chrome console if I can do it. I have seen that there is a similar response using DoClick on IHtmlElement, but it does not work, I do not change the page. The browser keeps tending the same html in the Document.
Let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
The problem I see in your question description is that in order to "click" a div on a page a script needs to be run. This script can now fall into the "trivial" category, however, most likely it is not. Now you have 2 options:
Try it out and maybe it works / great, otherwise ...
See what the script is doing (obviously some HTTP request eventually ...) and do the same
The latter can of course be re-implemented in C# / AngleSharp. So you can create an HTTP request, get the data and either do something on that data set directly (it may be JSON and already what you want ....) or (if it is serving partial HTML) re-parse it and integrate it on the real page.
HTH!
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I am trying to make this feature, and I'm really stuck.
I have two applications that run on the same domain. and I need to have one application load pages from the other one inside it's own (the first) master page.
I have full control of the code of both sides, of course.
I have tries using HTTPRequest, and HTTPResponse, and I have tried using WebBrowser. Both work great as long as I have static(plain HTML) pages. However,
those pages are actually dynamic. the user need to press server-side buttons (postback) and generally use the session, viewstate, and/or cookies.
because of that, HTTPRequest and WebBrowser fail me, as they do not cause postback, and therefore those server-side controls are not working. more so, if I try to "fake" a postback by saving the ViewState after each response and than resend it on the next request, after a few (3-4) times the original page will return a "The state information is invalid for this page and might be corrupted" error, even if I use
EnableViewStateMac ="false" EnableSessionState="True" EnableEventValidation ="false" ValidateRequest ="false" ViewStateEncryptionMode ="Never
So... any ideas how can I solve this issue?
Thanks in advance
What is the main desire here?
Wrap one site's content in another without any architecture changes?
ANSWER: Iframe
Have a single submit button submit from two sites?
ANSWER: Not a good idea. You might be able to kludge this by creating a scraper and parser, but it would only be cool as an "I can do it trophy". Better to rearchitect the solution. But assuming you really want to do this, you will have to parse the result from the embedded site and redirect the submit to the new site. That site will then take the values and submit the form to the first site and wait for the result, which it will scrape to give a response to the user. It is actually quite a bit more complex, as you have to parse the HTML DOM (easier if all of the HTML is XHTML compliant, of course) to figure out what to intercept.
Caveat: Any changes to the embedded site can blow up your code, so the persons who maintain the first site must be aware of this artificially created dependency so they don't change anything that might cause problems. Ouch, that sounds brittle! ;-)
Other?
If using an iFrame does not work, then I would look at the business problem and draw up an ideal architecture to solve it, which might mean making the functionality of the embedded site available via a web service for the second site.
I am writing a web scaper for my company. Our client gives us access to their website for this purpose, but our client's IT team does not communicate with us, so I have to do the program with no help from the source.
Their website uses javascript on all of their buttons/dropdown menus to send postData to the server so that the screen will update to show the end user the correct info.
I am trying to get my program to simulate clicking the 'next page'. The 'next page' button has an onclick event that reads like this...
onclick="javascript:WebForm_DoPostBackWithOptions(
new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$ucTaxQueueListView$lviewOrderQueue$DataPager2$ctl00$btnNextPage"
, "", true, "", "", false, false))"
In my C# program, i am using the HTTPWebRequest class and the HTMLAgilityPack to do my requests / scrapping respectively.
I've done all i can in my code to try and get this to work. The only thing that works is to use Fiddler to copy the postData and paste that verbatim into my WebRequest function. This is very impractical when i have to potentially go to 1000+ 'next pages'.
I have also tried extracting the ViewState from page and using that, but that always gives me an 'error' page.
Any help or guidance would be appreciated and even compensated...my boss wants this project completed this weekend!!!
The last time I had to do a project similar to this, I took a very different approach.
I used GreaseMonkey -- though you could also use a Windows HTA file with the same effect --
And I let the GreaseMonkey script run and step through the pages one by one. To handle the DoPostBack I simply invoked the click handler on the appropriate elements.
I had several data stores going.
One DataStore covered every menu item that I had "clicked" on to avoid duplicating things.
Another DataStore was the raw HTML of the page (taken by body.innerHTML)
Once I had cloned all the pages, I wrote another GreaseMonkey script to load up each saved page and mine whatever info I needed off of it. I build up a third datastore of resources (images and CSS) and then pulled those down with a big text file piped into CuRL.
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
One of my friends is working on having a good solution to generate aspx pages, out of html pages generated from a legacy asp application.
The idea is to run the legacy app, capture html output, clean the html using some tool (say HtmlTidy) and parse it/transform it to aspx, (using Xslt or a custom tool) so that existing html elements, divs, images, styles etc gets converted neatly to an aspx page (too much ;) ).
Any existing tools/scripts/utilities to do the same?
Here's what you do.
Define what the legacy app is supposed to do. Write down the scenarios of getting pages, posting forms, navigating, etc.
Write unit test-like scripts for the various scenarios.
Use the Python HTTP client library to exercise the legacy app in your various scripts.
If your scripts work, you (a) actually understand the legacy app, (b) can make it do the various things it's supposed to do, and (c) you can reliably capture the HTML response pages.
Update your scripts to capture the HTML responses.
You have the pages. Now you can think about what you need for your ASPX pages.
Edit the HTML by hand to make it into ASPX.
Write something that uses Beautiful Soup to massage the HTML into a form suitable for ASPX. This might be some replacement of text or tags with <asp:... tags.
Create some other, more useful data structure out of the HTML -- one that reflects the structure and meaning of the pages, not just the HTML tags. Generate the ASPX pages from that more useful structure.
Just found HTML agility pack to be useful enough, as they understand C# better than python.
I know this is an old question, but in a similar situation (50k+ legacy ASP pages that need to display in a .NET framework), I did the following.
Created a rewrite engine (HttpModule) which catches all incoming requests and looks for anything that is from the old site.
(in a separate class - keep things organized!) use WebClient or HttpRequest, etc to open a connection to the old server and download the rendered HTML.
Use the HTML agility toolkit (very slick) to extract the content that I'm interested in - in our case, this is always inside if a div with the class "bdy".
Throw this into a cache - a SQL table in this example.
Each hit checks the cache and either a)retrieves the page and builds the cache entry, or b) just gets the page from the cache.
An aspx page built specifically for displaying legacy content receives the rewrite request and displays the relevant content from the legacy page inside of an asp literal control.
The cache is there for performance - since the first request for a given page has a minimum of two hits - one from the browser to the new server, one from the new server to the old server - I store cachable data on the new server so that subsequent requests don't have to go back to the old server. We also cache images, css, scripts, etc.
It gets messy when you have to handle forms, cookies, etc, but these can all be stored in your cache and passed through to the old server with each request if necessary. I also store content expiration dates and other headers that I get back from the legacy server and am sure to pass those back to the browser when rendering the cached page. Just remember to take as content-agnostic an approach as possible. You're effectively building an in-page web proxy that lets IIS render old ASP the way it wants, and manipulating the output.
Works very well - I have all of the old pages working seamlessly within our ASP.NET app. This saved us a solid year of development time that would have been required if we had to touch every legacy asp page.
Good luck!