I am writing a web scaper for my company. Our client gives us access to their website for this purpose, but our client's IT team does not communicate with us, so I have to do the program with no help from the source.
Their website uses javascript on all of their buttons/dropdown menus to send postData to the server so that the screen will update to show the end user the correct info.
I am trying to get my program to simulate clicking the 'next page'. The 'next page' button has an onclick event that reads like this...
onclick="javascript:WebForm_DoPostBackWithOptions(
new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$ucTaxQueueListView$lviewOrderQueue$DataPager2$ctl00$btnNextPage"
, "", true, "", "", false, false))"
In my C# program, i am using the HTTPWebRequest class and the HTMLAgilityPack to do my requests / scrapping respectively.
I've done all i can in my code to try and get this to work. The only thing that works is to use Fiddler to copy the postData and paste that verbatim into my WebRequest function. This is very impractical when i have to potentially go to 1000+ 'next pages'.
I have also tried extracting the ViewState from page and using that, but that always gives me an 'error' page.
Any help or guidance would be appreciated and even compensated...my boss wants this project completed this weekend!!!
The last time I had to do a project similar to this, I took a very different approach.
I used GreaseMonkey -- though you could also use a Windows HTA file with the same effect --
And I let the GreaseMonkey script run and step through the pages one by one. To handle the DoPostBack I simply invoked the click handler on the appropriate elements.
I had several data stores going.
One DataStore covered every menu item that I had "clicked" on to avoid duplicating things.
Another DataStore was the raw HTML of the page (taken by body.innerHTML)
Once I had cloned all the pages, I wrote another GreaseMonkey script to load up each saved page and mine whatever info I needed off of it. I build up a third datastore of resources (images and CSS) and then pulled those down with a big text file piped into CuRL.
Related
I am trying to get information by C# and (AngleSharp or HTML Agility Pack) about available schedules from a web page. The problem is that to see what schedules are available on different days, you have to press a "div" (previous, next). So to have one month schedules, I would have to go through and pag page by page. The problem that I find, is that I can not click on the div. In contrast to javascript in Chrome console if I can do it. I have seen that there is a similar response using DoClick on IHtmlElement, but it does not work, I do not change the page. The browser keeps tending the same html in the Document.
Let's first visit what can be done with AngleSharp:
Any kind of requests incl. their manipulation (on request, but also before response)
General cookie management (and their manipulation, of course)
Querying the DOM and perform "simple" actions (e.g., clicking a button, submitting a form)
Running trivial JavaScript files
Here trivial means: Scripts that do not need any capabilities beyond what AngleSharp offers, e.g., rendering tree information, advanced CSSOM access, ... - or scripts that require non-ES5 compliant parsers (e.g., make use of ES6 or some special non-standard capabilities).
The problem I see in your question description is that in order to "click" a div on a page a script needs to be run. This script can now fall into the "trivial" category, however, most likely it is not. Now you have 2 options:
Try it out and maybe it works / great, otherwise ...
See what the script is doing (obviously some HTTP request eventually ...) and do the same
The latter can of course be re-implemented in C# / AngleSharp. So you can create an HTTP request, get the data and either do something on that data set directly (it may be JSON and already what you want ....) or (if it is serving partial HTML) re-parse it and integrate it on the real page.
HTH!
I have a webpage that I want to monitor that has stock market information that I want to read and store. The information gathered is to be stored somewhere, say a .csv file or similar for later analysis.
The first problem I have is detecting when this page has fully loaded. The time taken to load can vary enormously. The event handlers I have tried all fire multiple times (I know this has been covered and I have tried the various techniques, but to no avail). Perhaps it is something specific to do with this web-page? Anyway, I need to know when this page has fully loaded and is sitting pretty with all graphics displayed properly.
The second problem is that, I cannot get the true source page into the webbrowser. As as a consequence, all access to the DOM fails as the HTML representation inside the webbrowser control appears not match what is actually happening on the webpage. I have dumped the text (webBrowser2.DocumentText) and it looks nothing like what you see when I check source in a browser, chrome for example. (I also use the firebug extension in Firefox to double check things). How can I get to the correct page into the webbrowser so I can start to manipulate things?
Essentially, in terms of the data, I need the GMT Time, Strike Rate and expiration time. My process will monitor with a timer control. To be able to read all the other element data on screen is a nice-to-have.
Can this be done?
I am an experienced programmer new to web programming and C#.
I think you want this AJAX request.
As a review, the web works by first loading the web page, then scanning the web page for additional files it needs to load (js, css, images, etc). When those finish, the onload event is triggered and some AJAX functions may run.
In this case, only some of the page is loaded and AJAX functions update the data in the graph later. As you've seen "Show Source" only shows the original file that was downloaded and is not a dump of its current state.
The easiest way to get the data is to find the URL of the AJAX request that loads the graph data. It is already conveniently formatted in JSON for you to scrap.
So I'm making a online chat program.
Technologies: -AJAX(methodology) -PHP -C# -ASP.net -JQuery -HTML5 -MYSQL -IIS
Issue (Long):
I've implemented group chat which works fine up to now. My issue is with multi-chat. Mind you I now realize i should have done the entire thing in PHP, but only knew ASP.net and C# when i started and will end up using PHP only, as a last resort. Anyway, when a multi-chat window is made, it injects the pre-made code via jquery into a div, stores it in sessionStorage for when the page refreshes, it loads the code from sessionstorage, and all ID's are incremented by one, for each user to have a max of 6 windows open at any given time. Now i'm trying to get specific query's for each specific users request like "SELECT * WHERE user1 privateChatID = '1' AND user2 privateChatID = '1'; (not the actual query, just pseudo code)", but since i'm using AJAX to get the query, I cant really manipulate the php file since it's loaded because the main file is a .aspx page. Now for group chat I'm using an update panel which works fine, but i can't dynamically make a draggable chat window, inside the update panel, because I'd have to use an ' runat="server" ' attribute, and if i run that in the pre-scripted jquery and wanted to increment the ID (like so: 'IDName "+ i +"'), visual studio/iis gives an error, hence the reason i'm trying to use Ajax. So all the problems I've worked out so far, once i get a working version I'll probably rethink the whole structure all together. My only issue now:
Issue (short):
Would it be easier since i can't manupulate functions or variables in the php file, to just select everything from the DB ('message table') and sort everything client side? or would that not be optimal? OR is there a way to alter query's externally for a php file that is loaded by jquery?
So my solution was to create php files on the server that would have their own queries, and users would have their own directories on the server for those files and i guess any other files that might need to be added to them in the future if any. I'm taking a chance if the load of bandwidth is too much because a number of users will be creating files on the server, when a chat window is create, i don't even know how secure it'll be but it's working for the time being, i'll tweak security issues, after i can get everything working.
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack
I have
an ashx file,
Visual Studio 10,
no knowledge at all about C# ASP.NET
What is the proper way to compile and run this?
Context
The ashx file in question can be found in this zip, and is a demo application for a Tetris AI competition. It is a very enticing idea even if it depends a great deal on luck, and I thought I might use the occasion to learn a new language.
An ashx file is a just a generic HTTP handler, so the easiest way to get this working is to create a new Web Site in the File menu, and just add the Handler.ashx file to the website root directory.
Then, just run the site (F5) and browse to "YourSite/Handler.ashx".
An ASHX file is like an ASPX file, but it's a handler. That means it doesn't respond back with HTML by default, and can therefore "handle" otherwise unhandled file types, but it's not necessarily tied to that meaning. In this case, you'll only be presenting the response
position=8°rees=180
...to a posted board and piece. So you don't need HTML, so you want an ASHX.
You can make .ashx files the startup page in your project, just the same as .aspx pages. If I were writing a HelloUser.ashx page, I might set it as the start page, with some parameters passed in as querystrings or something.
You're probably going to want a test harness that posts a board / piece to your service, and that could be any kind of project. Command line program, website, test class run through NUnit, whatever. There's a lot of logic to keep track of beyond the "player" logic.
If you need a more detailed answer than that, SO might not be the place for this question. But I wish you all kinds of luck with this - it's an interesting problem.
You need to deploy it to an IIS server that has the proper .NET framework installed and that should be it.
If you are trying to get it working locally, create a web site project in visual studio, go to "add existing items" in solution explorer, and locate your ashx. Then click the play button (or press F5) to compile and run it.
Good luck!
You're missing an some form (an ASPX file maybe) that goes with this handler. It looks like this thing probably handles some AJAX request from another page.
It's expecting 2 pieces of data with the request as well:
string board = context.Request.Form["board"];
string piece = context.Request["piece"];
You could reverse engineer the form that this is for, but it will probably take some time to get that board array right.