How to trigger ajax calls from a HTMLDocument Object in C# - c#

I am trying to retrieve some data from a web page by pulling the HTML into a string and parsing through it. The problem is that the information I want only shows up on the page when a user browses to the bottom of the page and triggers an Ajax call to update the DOM. Is there any way to do this in code without loading the HTML into a browser control and telling it to scroll?

Related

Getting content from web page, that is populated through a javascript

I'm trying to parse a web page using Html Agility Pack, what I have understod from my attempts is that the web page is "populated" using a javascript. When I load the page using
HtmlDocument doc = web.Load(linkToPage);
I get an empty page. The page is a sub page so to say, and I'm using the original page to scrap the links to these sub pages (it works for the main page since this one does not used javascript to populate the page, I assume).
Is there a way to parse a web page that populates through javascript, or is there a better tool for this?
See this if you wish to use JAVA, I worked with FTL and also JSrender, both were pretty cool

Find URL Responses? Alternative To Default WebBrowser Control?

Hello guys I have an issue bugging me for the past few weeks.
What I'm trying to accomplish: I need a webbrowser control with the ability to change user agent (once at start) and referrer. But most important The ability to see the urls responses. What I mean by that for example if you navigate to a website you get back Images/Javascripts files/Dyanmic URLS in response I need access to those urls which some of them have dynamic variables (Regular Webbrowser Control will not show you those & you can't access it in any way beside using fiddler core).
I was able to do that with webbrowser + fiddlercore I can see and do what ever with those urls addresses. The problem was if you run few instances of this program (or sometimes once if the program has some automation to work with the url responses) It gets stuck or doesn't work. I tried fixing it and making it work but it's kind of a hacky solution that doesn't work right. I need a simple way to access those urls just as if you used httpwebrequest but as a webbrowser. Why I need it as a webbrowser? The way I work I need the execution of all the tracking pixels and scripts and images etc.. a normal webbrowser behaivor in httpwebrequest you can't just navigate and all the scripts will be execute as webbrowser, or can you?
Using the System.Windows.Forms.WebBrowser control in a WinForms app, set the webBrowser.URL property to the URL of the page you're interested in.
The webbrowser's DocumentCompleted event fires after the page has loaded. Any dynamically loaded JavaScript should be done by then. Hook the DocumentCompleted event and use the webbrowser.Document.Images to get a list of all image elements on the page. From those images you can get their SRC attributes which contains their URLs including any query parameters hanging off the end. You can use webbrowser.Document.Links to get a list of all hyperlinks on the page. For other HTML elements of interest, you can use GetElementsByTagName("foo") to fetch all elements with that tag name from the page, then dig into their attributes to pull out URL properties.
With webbrowser.Document you can get to any HTML element, whether it is statically or dynamically created.
What you can't get to through webbrower.Document is data that is loaded asynchronously using XMLHttpRequest(), because this data is not part of the browser Document Object Model. Web pages with scripted false buttons will be difficult to intercept.
However, if you know where the data is stored by the JavaScript executing on the page, you may be able to access it using webbrowser.Document.InvokeScript(). If the JavaScript on the page stores URLs in a mydata property of the window object, for example, you could try webbrowser.Document.InvokeScript("window.mydata") or some variation to retrieve the value of mydata into the C# app.

How to extract the extra content loaded in a web page

How to extract the extra content loaded in a web page, which will not be visible in view page source. The extra content is being loaded using ajax. This data can be seen under NET tab using firebug. How to extract this data using c# code.
Two ways :
1- You can use webbrowser to load the same page and get the active document.
2- You can replicate the ajax call made, and use that to get the extra bits that are appended to the document.
And reading your linkedin example above:
When you select the checkbox a ajax call is made , which brings back results and populates the table.You can see that call using firebug console window and see the post parameter and replicate them to get the same result.
Depends on your application in the first place, if you are using c# application as the client for reading a web page, then the the ajax content may not be visible until you put in a javascript engine.
if you are serving the said pages, you only have to log the request response of the server.
More specific question would be appreciated
That extra content is dynamically generated by ajax (for eg: Gridview is generated as table), it is stored in browser's memory. and can be viewed by client side debugging tools (IE has developer tools option).
Once you do a post back, all the control's values are available for C#.
If you are saying extra content, can you please clarify what exactly you are trying to extract using c#?

Webbrowser Trying to parse some page which uses ajax to load data

I am trying to parse some images from this site..
I was using htmlagilitypack for the other pages
but this page uses ajax to load images
so this is how the webpage works.
has a div tag including nothing.
right below the div they have a script tag in cdata thing
<script type="text/javscript> //<
![CDATA[ (function(){
ajax POST request to a 'aaaaaa.js' file with the id parameter and if the request is success, it updates the blank div my changing the innerHTML value.
})(); //]]>
</script>
So.. what I tried was...
navigate to that page using the webbrowser control.
which loads the image just fine but when I try to get the value of 'DocumentText'
it only shows the blank div tag...
try to get straight data from the ajax POST using webrequest and webresponse..
but.. maybe cause its .js file.. it doesnt work.. I only get http errors.
Browse right to the js file with the parameters attached.
gives me an error page
Browse the page Im trying to parse and then navigate to the .js page.
(I guess the browser caches something when i browse the original page.. but i dont know what it is.)
I do get the json response! i can use this data.. but since the webbrowser control is using IE. It just asks me if i want to download the responsed js file.
So theres the method of using the DOM I mentioned in my comment.
Another method could be to use FiddlerCore ( http://www.fiddler2.com/fiddler/Core/)
Or make the ajax call yourself. You will have to make sure you respect cookies, redirects, and all the headers.

Opening an external page inside our page

I used to implement this above title by using iframe but now I dont want to use it any more I have some plans in my mind I need to implement them by opening an external page inside our asp.net page without using any iframe I have only simple aspx page with div tage and panel and some other serverside componants, I just want to know how I can do it without iframe ? I don't want to design new complex control but I am looking for some methods can do that for me.
I have to mention that I need to control area which is loaded by external site as the same as iframe but the difference is that iframe can not handled by ajax even you put iframe inside the update panel your page has refresh and postback while you are changing the src value programmatically (in c# code) so we have to design some others methods what is the solution ?
I thought I can make request an get some html and show into div but I couldn't to implement it.
You could
Make a WebRequest on the server-side and then set the div's text to HTML returned
You could make an invisible iFrame to make the request and then use JavaScript to grab the HTML from the iFrame and put it in a DIV. (EDIT: Comment suggests this won't work)
You can't generally make calls (like XmlHttpRequest) to external websites because of cross-site scripting issues.
Your direct request, "opening an external page inside our asp.net page without using any iframe" is not possible, by design.
You mention AJAX. You can use AJAX to load your page, remove the headers (or do that serverside) and replace the <body> tag with a <div> tag (or do that server side too). This way, you can place the contents of your page anywhere you like. As a container, I suggest you use a block level element, a <div> would suffice.
The only (!) problem here is: cross-site requests like this are not honored by browsers. You can solve this server-side by loading the page from elsewhere using WebRequest or similar means.
Depends on where you'd like to merge the data. If you'd like to merge the data on the client browser, your only other option besides frames is to use Javascript/Ajax.
You can do a jQuery.ajax() on page load and use the html() method on a div to populate it with the textual result of that AJAX call.
Try to use as little of the WebForms control hierarchy and life-cycle as possible. It sounds like your problem can be fixed with AJAX if you don't mind the second request on page load.
If you would like to merge the content on the server side ( rarely the right thing to do ) you can use System.Net.HttpWebRequest to get and merge the data before returning it to the browser.
there's no substitute for an iframe in your situation. you're not going to be able to make ajax requests to the other site due to security concerns. you could retrieve the contents of a single page server side and render it to the client but none of the functionality will be included, since the content is now running in the context of your own site.

Categories