I have webBrowser component and I would like to save modified HTML code to file.
I don't know if you understood me but browser navigates to one page, receives HTML + JS and then JS modifies HTML code, now I need to save that modified HTML code.
I have tried to use DocumentText but form result I get it outputs original HTML code not HTML code modified by JS.
Does anyone know how to solve this problem?
A lot of developer plug-ins (Firebug or Firefox or Developer tools for IE or Chrome) will allow you to see the updated HTML.
You can use outerHTML of an element you are interested in (i.e. BODY).
Look at methods of HTmlDocument like http://msdn.microsoft.com/en-us/library/system.windows.forms.htmldocument.getelementsbytagname.aspx and HtmlElement - http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.outerhtml.aspx
Related
I am scraping a certain web page using HAP, and I want to access the submit button on the page but the problem is I don't know how it could be done in HAP and C#, is there a way I could do this?
The HTML Agility Pack is not a browser, so while it can parse an HTML file, there is no way to really interact with it. You can find the submit object, read its properties and so forth, but you can't make it do anything.
You have two options:
Either read the form, build a Http Request object that matches the forms fields and post method and send it to the server. This is all manual work. The Agility Pack only helps you list the fields on the form and their properties
If you need to interact with the page you'll need a browser. There are headless browsers, like PhantomJS, that will actually load the page, parse the Javascript and run what's sent by the server. There are wrappers around those wrappers for C#, one of such examples is Awesonium. It's similar to the HTML Agility Pack in that it allows you to parse HTML documents, but it takes it one step further, actually running it without ever showing a browser screen.
I am attempting to get the resulting web page content so I can extract the display text. I have attempted the code below but it gets me the source html and not the resulting html.
string urlPath = "http://www.cbsnews.com/news/jamar-clark-protests-follow-decision-not-to-file-charges-in-minneapolis-police-shooting/";
WebClient client = new WebClient();
string str = client.DownloadString(urlPath);
Compare the text in the str variable with the html in the Developer Tools in the Chrome browser and you will get different results.
Any recommendations will be appreciated.
I'm assuming you mean that you want the article text. If so you will need to follow a different course of action. The page you refer to is loaded with client script that injects loads of content into the base HTML document. This is done by executing the client-side script. You will need to parse the DOM after the script is executed to get the content you're interested in.
As others have pointed out, an actual web browser will parse the downloaded HTML and execute javascript against it, potentially altering its content. While you could try to do that parsing yourself, the easiest route is to ask a real web browser to do it for you and then grab the results.
The easiest solution specifically in C# would be to use the WebBrowser Control from Windows Forms, which essentially exposes IE to your program, allowing you to control it. Use the Navigate method to load the URL in question, then use the Document property to navigate the DOM. You can, at that point, get the outerHTML to get the final content of the DOM as HTML.
If you're not writing a Windows program and are interested more in headless operation, have a look at PhantomJS. It's a headless Webkit browser that is scriptable from javascript and would give you similar capability, although not in C#.
I'm trying to parse a web page using Html Agility Pack, what I have understod from my attempts is that the web page is "populated" using a javascript. When I load the page using
HtmlDocument doc = web.Load(linkToPage);
I get an empty page. The page is a sub page so to say, and I'm using the original page to scrap the links to these sub pages (it works for the main page since this one does not used javascript to populate the page, I assume).
Is there a way to parse a web page that populates through javascript, or is there a better tool for this?
See this if you wish to use JAVA, I worked with FTL and also JSrender, both were pretty cool
How can I force HtmlAgilityPack to use Chrome's interpretation of something in XPath?
for example these two lines of code point to the exact same thing on the web page, however the xpath is completely different.
for Chrome:
/html/body[#class=' hasGoogleVoiceExt']/div[#class='fjfe-bodywrapper']/div[#id='fjfe-real-body']/div[#id='fjfe-click-wrapper']/div[#id='appbar']/div[#class='elastic']/div[#class='appbar-center']/div[#class='appbar-snippet-primary']/span
for FireFox:
//*[#id='appbar']/div/div[2]/div[1]/span
I would like to use Chrome however I receive null for both queries.
The Html Agility Pack has no dependency on any browser whatsoever. It uses .NET XPATH implementation. You can't change this, unless you rewrite it completely.
The HTML you see in a browser can be very different from the HTML you download for an url, as the first one could have been modified by dynamic code (javascript, DHTML).
If you have an existing HTML or url, we could help you more.
Here is what I found using a copied XPATH from Chrome:
- I had to remove all of the tbody elements and double up forward slashes and then following code would return the proper element.
doc.DocumentNode.SelectSingleNode(
"//html//body//center//table[3]//tr//td//table//tr//td//table//tr//td//table[3]//tr[3]//td[3]//table//tr//td//table");
I have the following code in my C# windows app which places the data from my webbrowser control into the clipboard. However when I come to pasting this into MSWord it pastes the HTML markup rather than the contents of the page.
Clipboard.SetDataObject(WebBrowser.DocumentText, true);
Any Idea how I can get around this?
OK this feels like a dirty hack, but it solves my problem:
WebBrowser1.Document.ExecCommand("SelectAll", false, null);
WebBrowser1.Document.ExecCommand("Copy", false, null);`
Another option would be to capture an image of the page, rather than the html and paste that into the document. I don't think the WebBrowser control can handle this, but Watin can. Watin's (http://watin.sourceforge.net/) capturewebpagetofile() function works well for this functionality. I have had to use this instead of capturing HTML because outlook cannot format HTML well at all.
string allText = WebBrowser1.DocumentText;
will return you all currently laoded document markup. Is it what are you lookin for?
I guess that happens because what the webbrowser actually contains is the markup, not all the images etc.
You might be best to use the webbrowser to save the full page to disk, and then use word to open that. That way it'll all be available locally for IE to use. Just means you have to clean up afterwards though.
The link below has some stuff about saving using the webbrowser in c#
http://www.c-sharpcorner.com/UploadFile/mahesh/WebBrowserInCS12072005232330PM/WebBrowserInCS.aspx