I am trying to save a web page (just like we do in browsers) along with all its content and
formatting. I tried WebClient, WebRequest examples but they can only download the text part and sometimes javascript. But no css and images etc.
Is there any api for this in .Net, or any 3rd party api for .net?
It is possible, I think it because a lot applications are running for offline reading, and they show the saved pages with the same formatting and styling.
How it is done?
Any ideas ?
EDIT 1:
Web pages can be parsed and saved using HtmlAgilityPack. But is there any way to get the main article and other contents like ads, other external links separated. Is there any way to differentiate between the contents which are relevant and which are not?
(I am sorry, if this question is not clear).
Also can any one give some suggestion that how these offline reading applications (like read later/pocket etc) save a web page and format it.
Is there any way to do the same in C#?
You can download a Page text as Html, then parse it and get <link rel="stylesheet" type="text/css" href="..."> or <img src="..."/> elements and download link of attributes like href or src separately.
HtmlAgilityPack is a reliable and useful library for parsing Htmls.
You can use Wget
https://www.gnu.org/software/wget/manual/html_node/Recursive-Download.html#Recursive-Download
You could have a look at trying to save the page as an mht file.
These files bundles the web page and all of its references, into a single compact file (.mht)
Stackoverflow topic about mht via c#
Note: MHT was introduced by Microsoft. Not all browsers comply with this format. Opera is the only other popular browser that has the MHT save. Firefox users though can call upon two add-ons to handle this file standard, Mozilla Archive Format & UnMHT. Both these add-ons can be installed and used to open and save complete webpages.
Related
I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).
I want to display content of word file in browser same like we display pdf file in browser. I don't want any plugin because if I use plugin I have to install for all browser. I want just one solution which works in all browser.
I have searched on google, but I found all link which directly download word file and open it.
Currently I am using object tag for displaying pdf file but it is not working for word file. It is showing message: The plug-in is not supported.
Using a browser plug-in (such as the free Word Viewer) is by far the easiest method, and arguably the most correct - however, there are some alternatives if you really don't want to do this:
Convert the Word document to another format (e.g. HTML/PDF) on-the-fly before the response is sent. For Word 97-2003 documents, you can do this with VSTO/Automation. For Word 2007+ documents, you can use the OpenXML SDK (although you will have to write the conversion algorithm yourself).
Use an XSL stylesheet to transform the Word markup (docx) into html/css. You can do this server-side or, potentially, with client-side scripting (JavaScript). Some useful resources here and here.
Great question. In principle, browsers only really tend to support viewing websites (e.g. html). Most, however, also support viewing PDFs, and, as you've correctly identified, you could use plugins to extend the behaviour. Crucially, though, some browsers provide document viewing with a javascript-based viewer.
I wasn't aware of it before you asked, but there are apparently javascript implementations of non-PDF document readers--for example, ViewerJS--that seem to directly support .odt. With a little digging, you might be able to find an implementation/plugin for a javascript viewer that supports .docx. However, I can't recommend one from personal experience at the moment. I would recommend searching for javascript document viewers though.
I have an ASP .NET web page which lists e-mail attachments. These attachments are your typical .docx, .pdf, .jpg, .tiff etc formats.
I'm looking for a solution (perhaps a component?) that will allow me to view the contents of these attachments in a scrollable panel for review by the user.
We have decided against the option of downloading the file and viewing it locally - so that's not an option.
Any ideas will be very helpful.
As I mentioned in the comments, Accusoft's Prizm Content Connect software has an HTML5 viewer for over 300 different formats, but has a heavy price tag on it.
Considering the priority of file formats I need to support, I settled on the solutions below:
PDF - Free; pdfobject provides a lightweight javascript option that embeds into the page.
Image Formats - Free; A simple native image control can be used.
Docx/Doc - License; Aspose.Words provides a component to build and view Word documents in WinForms and ASP .NET
I am developing desktop application. I would like to grab remote html source. But remote page widely rendered by javascript after page load.
I'm searching for a few days but i could not find anything useful. I've studied and tried to apply the following suggestions. But I can get only base html codes.
WebBrowser Threads don't seem to be closing
Get the final generated html source using c# or vb.net
View Generated Source (After AJAX/JavaScript) in C#
Is there any way to get all data like html console's approach of firebug?
Thank you in advance.
What are you trying to do? A web browser will do more than just grab the HTML. It needs to be parsed (which will likely download further files) and rendered. You could use the WebKit C# wrapper [http://webkitdotnet.sourceforge.net/] - I have used this previously to get thumbnails of web pages.
Does anyone know of a component (open source or 3rd party) that would allow you to export a fully rendered HTML page to PDF in c#? We have a page that has its DOM modified with jquery but the methods we have tried (ABCpdf.NET, WebClient, etc) don't register any DOM changes from javascript in the PDF. We need to programmatically export that rendered HTML (post-jquery) to PDF on the fly.
ExpertPDF HtmlToPdf Converter v7.0
I was looking for something similar many months ago and as far as I can remember, it's not possible with any free third-party controls. There are paid ones available. The closest you can get is iTextSharp. It will allow you to export the contents of specific html tads and user controls but it's a bit of a pain to deal with
I'm never tried is but there's an open source solution called wkhtmltopdf that renders a PDF from HTML/JavaScript/CSS using the WebKit engine. This post talks a little bit about using it. If it works I'd like to know because I've heard this request a couple of times here.