I'm trying to parse a web page using Html Agility Pack, what I have understod from my attempts is that the web page is "populated" using a javascript. When I load the page using
HtmlDocument doc = web.Load(linkToPage);
I get an empty page. The page is a sub page so to say, and I'm using the original page to scrap the links to these sub pages (it works for the main page since this one does not used javascript to populate the page, I assume).
Is there a way to parse a web page that populates through javascript, or is there a better tool for this?
See this if you wish to use JAVA, I worked with FTL and also JSrender, both were pretty cool
Related
I have to write a Console Application that grap and parse data from a website.
Unluckly, the website uses some kind of Javascript framework to compose the page.
So what I need to do is get the HTML once time the page is rendered by Javascript.
This is just the first step, my second step is to navigate the website to collect data from different page but... Unluckly the pages that I have to parse does not have Urls, but they are loaded from Javascript too...
Do you have some ideas ?
Thanks to support
Dario
I am scraping a certain web page using HAP, and I want to access the submit button on the page but the problem is I don't know how it could be done in HAP and C#, is there a way I could do this?
The HTML Agility Pack is not a browser, so while it can parse an HTML file, there is no way to really interact with it. You can find the submit object, read its properties and so forth, but you can't make it do anything.
You have two options:
Either read the form, build a Http Request object that matches the forms fields and post method and send it to the server. This is all manual work. The Agility Pack only helps you list the fields on the form and their properties
If you need to interact with the page you'll need a browser. There are headless browsers, like PhantomJS, that will actually load the page, parse the Javascript and run what's sent by the server. There are wrappers around those wrappers for C#, one of such examples is Awesonium. It's similar to the HTML Agility Pack in that it allows you to parse HTML documents, but it takes it one step further, actually running it without ever showing a browser screen.
I am trying to parse some images from this site..
I was using htmlagilitypack for the other pages
but this page uses ajax to load images
so this is how the webpage works.
has a div tag including nothing.
right below the div they have a script tag in cdata thing
<script type="text/javscript> //<
![CDATA[ (function(){
ajax POST request to a 'aaaaaa.js' file with the id parameter and if the request is success, it updates the blank div my changing the innerHTML value.
})(); //]]>
</script>
So.. what I tried was...
navigate to that page using the webbrowser control.
which loads the image just fine but when I try to get the value of 'DocumentText'
it only shows the blank div tag...
try to get straight data from the ajax POST using webrequest and webresponse..
but.. maybe cause its .js file.. it doesnt work.. I only get http errors.
Browse right to the js file with the parameters attached.
gives me an error page
Browse the page Im trying to parse and then navigate to the .js page.
(I guess the browser caches something when i browse the original page.. but i dont know what it is.)
I do get the json response! i can use this data.. but since the webbrowser control is using IE. It just asks me if i want to download the responsed js file.
So theres the method of using the DOM I mentioned in my comment.
Another method could be to use FiddlerCore ( http://www.fiddler2.com/fiddler/Core/)
Or make the ajax call yourself. You will have to make sure you respect cookies, redirects, and all the headers.
I have webBrowser component and I would like to save modified HTML code to file.
I don't know if you understood me but browser navigates to one page, receives HTML + JS and then JS modifies HTML code, now I need to save that modified HTML code.
I have tried to use DocumentText but form result I get it outputs original HTML code not HTML code modified by JS.
Does anyone know how to solve this problem?
A lot of developer plug-ins (Firebug or Firefox or Developer tools for IE or Chrome) will allow you to see the updated HTML.
You can use outerHTML of an element you are interested in (i.e. BODY).
Look at methods of HTmlDocument like http://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument.getelementsbytagname.aspx and HtmlElement - http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.outerhtml.aspx
I have one HTML file containing several <div> elements. I want to refresh just part of the page using either JavaScript or C#. Can someone help?
I am trying to do it this way:
document.location.reload(document.getElementById("contentdiv"));
It reloads the whole page. I wish to reload contentdiv. If contentdiv is at the middle of the page then it should load only that part.
Thank you.
You could move the contents of everything you want reloaded into an external file, and either use the <iframe> tag and only refresh that frame, or you could use JavaScript and refresh the div with Ajax.
Ajax isn't that simple to explain in a short answer, but you can find plenty of information on it here: http://www.w3schools.com/Ajax/ajax_example.asp or if you use a framework like jQuery ajax is much easier.
iFrames can be implemented (on mypage.html, for example) like so: <iframe src='mypagecontent.html'></iframe> and in mypagecontent.html you could use <script type='text/javascript'>window.location.reload();</script> to refresh the frame.
Not sure if this is what you're looking for, but hope it helps somewhat.
What ASP.NET Framework are you using? If you are using Web Forms, look into UpdatePanel