Get output of web page in C# - c#

I am attempting to get the resulting web page content so I can extract the display text. I have attempted the code below but it gets me the source html and not the resulting html.
string urlPath = "http://www.cbsnews.com/news/jamar-clark-protests-follow-decision-not-to-file-charges-in-minneapolis-police-shooting/";
WebClient client = new WebClient();
string str = client.DownloadString(urlPath);
Compare the text in the str variable with the html in the Developer Tools in the Chrome browser and you will get different results.
Any recommendations will be appreciated.

I'm assuming you mean that you want the article text. If so you will need to follow a different course of action. The page you refer to is loaded with client script that injects loads of content into the base HTML document. This is done by executing the client-side script. You will need to parse the DOM after the script is executed to get the content you're interested in.

As others have pointed out, an actual web browser will parse the downloaded HTML and execute javascript against it, potentially altering its content. While you could try to do that parsing yourself, the easiest route is to ask a real web browser to do it for you and then grab the results.
The easiest solution specifically in C# would be to use the WebBrowser Control from Windows Forms, which essentially exposes IE to your program, allowing you to control it. Use the Navigate method to load the URL in question, then use the Document property to navigate the DOM. You can, at that point, get the outerHTML to get the final content of the DOM as HTML.
If you're not writing a Windows program and are interested more in headless operation, have a look at PhantomJS. It's a headless Webkit browser that is scriptable from javascript and would give you similar capability, although not in C#.

Related

HtmlAgilityPack table returns null when selecting nodes [duplicate]

I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

Parsing web page with HtmlAgilityPack and simulate a click

I am scraping a certain web page using HAP, and I want to access the submit button on the page but the problem is I don't know how it could be done in HAP and C#, is there a way I could do this?
The HTML Agility Pack is not a browser, so while it can parse an HTML file, there is no way to really interact with it. You can find the submit object, read its properties and so forth, but you can't make it do anything.
You have two options:
Either read the form, build a Http Request object that matches the forms fields and post method and send it to the server. This is all manual work. The Agility Pack only helps you list the fields on the form and their properties
If you need to interact with the page you'll need a browser. There are headless browsers, like PhantomJS, that will actually load the page, parse the Javascript and run what's sent by the server. There are wrappers around those wrappers for C#, one of such examples is Awesonium. It's similar to the HTML Agility Pack in that it allows you to parse HTML documents, but it takes it one step further, actually running it without ever showing a browser screen.

Find URL Responses? Alternative To Default WebBrowser Control?

Hello guys I have an issue bugging me for the past few weeks.
What I'm trying to accomplish: I need a webbrowser control with the ability to change user agent (once at start) and referrer. But most important The ability to see the urls responses. What I mean by that for example if you navigate to a website you get back Images/Javascripts files/Dyanmic URLS in response I need access to those urls which some of them have dynamic variables (Regular Webbrowser Control will not show you those & you can't access it in any way beside using fiddler core).
I was able to do that with webbrowser + fiddlercore I can see and do what ever with those urls addresses. The problem was if you run few instances of this program (or sometimes once if the program has some automation to work with the url responses) It gets stuck or doesn't work. I tried fixing it and making it work but it's kind of a hacky solution that doesn't work right. I need a simple way to access those urls just as if you used httpwebrequest but as a webbrowser. Why I need it as a webbrowser? The way I work I need the execution of all the tracking pixels and scripts and images etc.. a normal webbrowser behaivor in httpwebrequest you can't just navigate and all the scripts will be execute as webbrowser, or can you?
Using the System.Windows.Forms.WebBrowser control in a WinForms app, set the webBrowser.URL property to the URL of the page you're interested in.
The webbrowser's DocumentCompleted event fires after the page has loaded. Any dynamically loaded JavaScript should be done by then. Hook the DocumentCompleted event and use the webbrowser.Document.Images to get a list of all image elements on the page. From those images you can get their SRC attributes which contains their URLs including any query parameters hanging off the end. You can use webbrowser.Document.Links to get a list of all hyperlinks on the page. For other HTML elements of interest, you can use GetElementsByTagName("foo") to fetch all elements with that tag name from the page, then dig into their attributes to pull out URL properties.
With webbrowser.Document you can get to any HTML element, whether it is statically or dynamically created.
What you can't get to through webbrower.Document is data that is loaded asynchronously using XMLHttpRequest(), because this data is not part of the browser Document Object Model. Web pages with scripted false buttons will be difficult to intercept.
However, if you know where the data is stored by the JavaScript executing on the page, you may be able to access it using webbrowser.Document.InvokeScript(). If the JavaScript on the page stores URLs in a mydata property of the window object, for example, you could try webbrowser.Document.InvokeScript("window.mydata") or some variation to retrieve the value of mydata into the C# app.

webbrowser midified HTML code c#

I have webBrowser component and I would like to save modified HTML code to file.
I don't know if you understood me but browser navigates to one page, receives HTML + JS and then JS modifies HTML code, now I need to save that modified HTML code.
I have tried to use DocumentText but form result I get it outputs original HTML code not HTML code modified by JS.
Does anyone know how to solve this problem?
A lot of developer plug-ins (Firebug or Firefox or Developer tools for IE or Chrome) will allow you to see the updated HTML.
You can use outerHTML of an element you are interested in (i.e. BODY).
Look at methods of HTmlDocument like http://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument.getelementsbytagname.aspx and HtmlElement - http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.outerhtml.aspx

C# AJAX or Java response HTML scraping

Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack

Categories