I have a webpage that I want to monitor that has stock market information that I want to read and store. The information gathered is to be stored somewhere, say a .csv file or similar for later analysis.
The first problem I have is detecting when this page has fully loaded. The time taken to load can vary enormously. The event handlers I have tried all fire multiple times (I know this has been covered and I have tried the various techniques, but to no avail). Perhaps it is something specific to do with this web-page? Anyway, I need to know when this page has fully loaded and is sitting pretty with all graphics displayed properly.
The second problem is that, I cannot get the true source page into the webbrowser. As as a consequence, all access to the DOM fails as the HTML representation inside the webbrowser control appears not match what is actually happening on the webpage. I have dumped the text (webBrowser2.DocumentText) and it looks nothing like what you see when I check source in a browser, chrome for example. (I also use the firebug extension in Firefox to double check things). How can I get to the correct page into the webbrowser so I can start to manipulate things?
Essentially, in terms of the data, I need the GMT Time, Strike Rate and expiration time. My process will monitor with a timer control. To be able to read all the other element data on screen is a nice-to-have.
Can this be done?
I am an experienced programmer new to web programming and C#.
I think you want this AJAX request.
As a review, the web works by first loading the web page, then scanning the web page for additional files it needs to load (js, css, images, etc). When those finish, the onload event is triggered and some AJAX functions may run.
In this case, only some of the page is loaded and AJAX functions update the data in the graph later. As you've seen "Show Source" only shows the original file that was downloaded and is not a dump of its current state.
The easiest way to get the data is to find the URL of the AJAX request that loads the graph data. It is already conveniently formatted in JSON for you to scrap.
Related
I am writing a web scaper for my company. Our client gives us access to their website for this purpose, but our client's IT team does not communicate with us, so I have to do the program with no help from the source.
Their website uses javascript on all of their buttons/dropdown menus to send postData to the server so that the screen will update to show the end user the correct info.
I am trying to get my program to simulate clicking the 'next page'. The 'next page' button has an onclick event that reads like this...
onclick="javascript:WebForm_DoPostBackWithOptions(
new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$ucTaxQueueListView$lviewOrderQueue$DataPager2$ctl00$btnNextPage"
, "", true, "", "", false, false))"
In my C# program, i am using the HTTPWebRequest class and the HTMLAgilityPack to do my requests / scrapping respectively.
I've done all i can in my code to try and get this to work. The only thing that works is to use Fiddler to copy the postData and paste that verbatim into my WebRequest function. This is very impractical when i have to potentially go to 1000+ 'next pages'.
I have also tried extracting the ViewState from page and using that, but that always gives me an 'error' page.
Any help or guidance would be appreciated and even compensated...my boss wants this project completed this weekend!!!
The last time I had to do a project similar to this, I took a very different approach.
I used GreaseMonkey -- though you could also use a Windows HTA file with the same effect --
And I let the GreaseMonkey script run and step through the pages one by one. To handle the DoPostBack I simply invoked the click handler on the appropriate elements.
I had several data stores going.
One DataStore covered every menu item that I had "clicked" on to avoid duplicating things.
Another DataStore was the raw HTML of the page (taken by body.innerHTML)
Once I had cloned all the pages, I wrote another GreaseMonkey script to load up each saved page and mine whatever info I needed off of it. I build up a third datastore of resources (images and CSS) and then pulled those down with a big text file piped into CuRL.
The user fills in a form to download a file. The form results load in a new window (target="blank"). The MVC Controller Action returns a FileResult on success or my "SelfClosingPage" view on failure.
The goal behind this is to have the user download the file in a new page, and if any errors occur, the original calling page's url doesn't change (to the /DownloadFile url) and the user remains on the form page, instead of being directed to an error page.
This all works great, except I need to know when the file download is complete because I'd like to 1) hide the "File is downloading, please be patient" message if the download is successful 2) show an error message if the file download failed.
I was using a Cookie to do this and a JS interval to regularly check the cookies value. It either never worked or doesn't work any more (I can never get the cookie to show up on the original page).
Please advice. I can't use C# code in my JS because well, it wouldn't work since it executes once when the page is loaded and I'm trying to decouple the JS from the C# code.
I think my only solution is to do ajax javascript callbacks, but I'd like to avoid that.
UPDATE:
Found these related SO links that use the same approach I was trying to use.
MVC3 - File Download - Wait Status indicator
Detect when browser receives file download
Update 2
It's working again. I think the cookies expiry date was not long enough (though it should have been). I just changed it from 10 min (a file download should not take longer than that) to half a day.
Hello guys I have an issue bugging me for the past few weeks.
What I'm trying to accomplish: I need a webbrowser control with the ability to change user agent (once at start) and referrer. But most important The ability to see the urls responses. What I mean by that for example if you navigate to a website you get back Images/Javascripts files/Dyanmic URLS in response I need access to those urls which some of them have dynamic variables (Regular Webbrowser Control will not show you those & you can't access it in any way beside using fiddler core).
I was able to do that with webbrowser + fiddlercore I can see and do what ever with those urls addresses. The problem was if you run few instances of this program (or sometimes once if the program has some automation to work with the url responses) It gets stuck or doesn't work. I tried fixing it and making it work but it's kind of a hacky solution that doesn't work right. I need a simple way to access those urls just as if you used httpwebrequest but as a webbrowser. Why I need it as a webbrowser? The way I work I need the execution of all the tracking pixels and scripts and images etc.. a normal webbrowser behaivor in httpwebrequest you can't just navigate and all the scripts will be execute as webbrowser, or can you?
Using the System.Windows.Forms.WebBrowser control in a WinForms app, set the webBrowser.URL property to the URL of the page you're interested in.
The webbrowser's DocumentCompleted event fires after the page has loaded. Any dynamically loaded JavaScript should be done by then. Hook the DocumentCompleted event and use the webbrowser.Document.Images to get a list of all image elements on the page. From those images you can get their SRC attributes which contains their URLs including any query parameters hanging off the end. You can use webbrowser.Document.Links to get a list of all hyperlinks on the page. For other HTML elements of interest, you can use GetElementsByTagName("foo") to fetch all elements with that tag name from the page, then dig into their attributes to pull out URL properties.
With webbrowser.Document you can get to any HTML element, whether it is statically or dynamically created.
What you can't get to through webbrower.Document is data that is loaded asynchronously using XMLHttpRequest(), because this data is not part of the browser Document Object Model. Web pages with scripted false buttons will be difficult to intercept.
However, if you know where the data is stored by the JavaScript executing on the page, you may be able to access it using webbrowser.Document.InvokeScript(). If the JavaScript on the page stores URLs in a mydata property of the window object, for example, you could try webbrowser.Document.InvokeScript("window.mydata") or some variation to retrieve the value of mydata into the C# app.
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
I'm working with a CMS which allows you to develop your own custom controls which get dynamically included at runtime. So I have a custom control which alters a datasource (NHibernate cache) and as I'm at a point in the process where the CMS has already read this data from the cache, I need to restart the processing of the page somehow so that the CMS picks up the new cache data.
I know there are probably more elegant ways of doing this, but because I am unable to directly alter the data held by the CMS' core once it has read from the cache and because of the way the control gets loaded by the CMS I am out of alternatives (I think).
I have tried doing a Response.Redirect() to the requested URL, but most browsers will think this is an infinite loop and kill the request. Any other ideas?
You can do this from your initial page:
Response.Clear ();
Server.Transfer (Request.Url.PathAndQuery, true);
The second argument passes the initial page QueryString and Form values.