I am trying to obtain html from the WebBrowser control, but it must include the value attributes of input elements on the page as well.
If I use webBrowser.DocumentText, I get the full HTML of the page as it was initially loaded. The input field values are not included.
If I use webBrowser.Document.Body.OuterHtml, I get the values, but not the other contents of (), which I need so I can get the stylesheet links, etc.
Is there a clean dependable way to obtain the full HTML of the DOM in its current state from the WebBrowser? I am passing the HTML to a library for it to be rendered to PDF, so suggestions for programmatically saving from the WebBrowser control to PDF will also be appreciated.
Thanks
There are some undocumented ways (changing registry, undocumented dll export) to print the document to XPS or PDF printers without parsing the page, that is, if your can afford to roll out required printer drivers to your customer's network.
If you want to parse the web page, documentElement.outerHTML should give you the full canonicalized document, but not the linked image, script or stylesheet files. You need to parse the page, enumerate elements and check element types and get resource urls before digging the WinInet cache or downloading for additional resources. To get the documentElement property, you need to cast HtmlDocument.DomDocument to mshtml.IHTMLDocument2 if you use Windows Forms, or cast WebBrowser.Document to mshtml.IHTMLDocument2 if you use WPF. If you need to wait before the Ajax code finishes execution, starting a timer when the DocumentComplete event is raised.
At this stage, I would parse the HTML DOM and get the necessary data in order to generate a report via a template, so you always have the option to generate other formats supported by the report engine, such as Microsoft Word. Very rarely I need to render the HTML as parsed, for example, printing a long table without adding customized header and footer on each page. That said, you can check Convert HTML to PDF in .NET and test which one of the suggested software/components works best with your target web site, if you do not have long tables.
Related
I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).
I have a application where I have a pdf to show. I used AxAcroPDFLib. I can successfully show any pdf in that control. Now I want to get the current page of that pdf. There is no method like getCurrentPage in AxAcroPDFLib.
How to get that current page number. I searched it but did not find any solution for this.
You're not showing the code you currently are using, but your problem likely stems from the fact that you don't realize there is an additional layer present here. The document methods in the Adobe PDF library truly only deal with the PDF file itself and a PDF file doesn't have a current page number.
To display PDF documents, Acrobat uses an AVPageView. The AVPageView is your link for anything that concerns the display of your PDF files.
AVPageView has a method to get the currently visible page:
PDPageNumber AVPageViewGetPageNum(AVPageView pageView)
So from the document, get its page view and work with that to get the page number, zoom factor, display mode and so on.
On the server that my application is being run on, a virtual PDF printer is being installed (don't know much about this yet, except it's from Adobe), and my application needs to use this 'printer' to create PDF's from HTML pages (a GridView mostly), and then redirect the user to the URL of the where the PDF is stored.
I've been looking at the PrintDocument object in System.Drawing.Printing, however I've read that you can't simply feed this a HTML page. What are my choices? The easiest option would be to be able to 'print' a given HTML page (choosing what and what not to print using CSS), but from what I've read this is fairly difficult, so I'm thinking about somehow constructing whatever object PrintDocument needs programatically, if that makes sense.
Any ideas on how I should do this?
there are some free/cheap libs for creating pdfs on the fly. I've used itextsharp before and it worked pretty well. Takes a bit of time to get up to speed in how it works but I'd suggest checking it out.
There are also printing services like Neevia DocConverter that will monitor a folder and auto convert whatever you put in the folder to a pdf, jpg, etc. you can set it up so that if you drop a url shortcut in the folder it will render the webpage at that url to pdf. it's a bit more of a pain if you want to do realtime rendering but works excellent for generating mass reports in batches that you want to post up to a website or email later.
Is there a way to write PDF to a div from DataBase i.e. Retrieve a Byte[] from Database and Reponse.BinaryWrite to a div.
We do similar thing for Images using src = "anotherpage.aspx" where image is written on anotherpage.
Is it possible with PDF without using IFrame?
If what you're trying to do is show a PDF file inside a DIV, you're going down the wrong path. You either need to:
Convert the PDF to Flash (ala Flash Paper)
or
Convert the PDF to HTML (like Scribd does using HTML 5).
Then you can embed the PDF inside a DIV. But no browser I know of supports directly embedding PDFs.
Otherwise you have to put the PDF in an IFRAME, but how this is shown is PDF plug-in dependent.
No. The reason it works with a src=otherpage.aspx request is that the src attribute results in the user's web browser making a completely separate request for the other resource. You're serving up an additional page to make that happen. Writing a PDF file directly is trying to inject the PDF into the same request as your page - not really "similar" to your img src at all. In fact, what is most similar to the "src=otherpage.aspx" method is the iframe approach that you mentioned.
As a side note, you our "AnotherPage.aspx" example should really be changed to "AnotherPage.ashx". Note the letter 'h' in there. That means you're using a handler rather than a page, which will perform better.
I am trying to implement a feature where i open (suppose in iframe) a PDF file (multiple pages), Highlight a section of the document a get the page number (the one that is displayed in the PDF tool bar).
Eg: if the toolbar display 2/7 which means i am right now in page 2, i need to capture the page number information. Sounds simple but i am not able to get a .dll/function that exposes this property.
Any help would be grateful.Thanks.
I wouldn't think this would be possible, there's no way to control PDFs with JavaScript in the browser, which is what you'd need to do.
This article suggests the same: http://codingforums.com/showthread.php?t=43436.
Content of link:
in short, no, you can't do that.
really don't think JS can read properties of PDFs, since PDFs are viewed in the browser thru a plugin, ie a viewport for another application (for want of a better explanation).
You may be better trying a different route, such as generating the pages as images and implementing your own paging. Depends on your content and requirements, of course. ABCPDF from http://www.websupergoo.com/ is free (with a link-back), not sure if that's any help for you.