How to show appropriate page on mobile - c#

I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.

Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'

Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).

Related

Read something from website in code behind

Hello i want to ask something ... Is there a way to read some information from website that i do not own from a code behind
Like i want to read title of every page in some web site ... Can i do it and how ?
Not a way of hacking just to read the clear text no html code want to read
I don't know what to do or how to do it i need an ideas
And is there a way to search for specific word in several website and an api to use it for search for a website
You still have to read the HTML since that's how the title is transmitted.
Use the HttpWebRequest class to make a request to the web server and the HttpWebResponse to get the response back and the GetResponseStream() method to the response. Then you need to parse it in some way.
Look at the HTMLAgilityPack in order to parse the HTML. You can use this to get the title element out of the HTML and read it. You can then get all the anchor elements within the page and determine which ones you want to visit next that are on their site to scan the titles.
There is powerful HTML parser available for .Net that you can use with XPATH to read HTML pages,
HTML Agility pack
Or
you can use built in WebClient class to get data from page as string and then do string manipulation.

Getting rendered html final source

I am developing desktop application. I would like to grab remote html source. But remote page widely rendered by javascript after page load.
I'm searching for a few days but i could not find anything useful. I've studied and tried to apply the following suggestions. But I can get only base html codes.
WebBrowser Threads don't seem to be closing
Get the final generated html source using c# or vb.net
View Generated Source (After AJAX/JavaScript) in C#
Is there any way to get all data like html console's approach of firebug?
Thank you in advance.
What are you trying to do? A web browser will do more than just grab the HTML. It needs to be parsed (which will likely download further files) and rendered. You could use the WebKit C# wrapper [http://webkitdotnet.sourceforge.net/] - I have used this previously to get thumbnails of web pages.

How to process an HTML email for previewing on a web page?

I'm working on a testing tool in asp mvc. One of the features is to preview html emails.
However, these emails are going to have things like doctypes and css tags. What are my options to properly display these emails without screwing up the html on my page?
There is no need to keep the formatting and css, just the text and links that comes with it. Any ideas?
You could load the whole content into an iFrame.
<div id="preview">
<!-- YOUR ACTUAL PREVIEW SITE -->
<iframe src="/path/to/newsletter"></iframe>
<!-- YOUR ACTUAL PREVIEW SITE -->
</div>
Assuming this are your own HTML email so you don't need to worry about security loading in IFrame will give you the most close rendering.
Note that IFrame will not strip out external links as most e-mail viewers do, so while view will be very close to your HTML it may not reflect what users will see.
Other options:
if you just care of text/links - parse HTML with Html Agility Pack and extract text/links, than show as you feel necessary
if you care about security issues and look of the mail - try to search for libraries that filter out "unsafe" HTML (like external links, script,...) or use Html Agility Pack and filter out everything but content that you consider absolutely safe.

Saving a Web Page with all its content in C#

I am trying to save a web page (just like we do in browsers) along with all its content and
formatting. I tried WebClient, WebRequest examples but they can only download the text part and sometimes javascript. But no css and images etc.
Is there any api for this in .Net, or any 3rd party api for .net?
It is possible, I think it because a lot applications are running for offline reading, and they show the saved pages with the same formatting and styling.
How it is done?
Any ideas ?
EDIT 1:
Web pages can be parsed and saved using HtmlAgilityPack. But is there any way to get the main article and other contents like ads, other external links separated. Is there any way to differentiate between the contents which are relevant and which are not?
(I am sorry, if this question is not clear).
Also can any one give some suggestion that how these offline reading applications (like read later/pocket etc) save a web page and format it.
Is there any way to do the same in C#?
You can download a Page text as Html, then parse it and get <link rel="stylesheet" type="text/css" href="..."> or <img src="..."/> elements and download link of attributes like href or src separately.
HtmlAgilityPack is a reliable and useful library for parsing Htmls.
You can use Wget
https://www.gnu.org/software/wget/manual/html_node/Recursive-Download.html#Recursive-Download
You could have a look at trying to save the page as an mht file.
These files bundles the web page and all of its references, into a single compact file (.mht)
Stackoverflow topic about mht via c#
Note: MHT was introduced by Microsoft. Not all browsers comply with this format. Opera is the only other popular browser that has the MHT save. Firefox users though can call upon two add-ons to handle this file standard, Mozilla Archive Format & UnMHT. Both these add-ons can be installed and used to open and save complete webpages.

Windows Forms WebBrowser control: DocumentText vs Document.Body.OuterHtml

I am trying to obtain html from the WebBrowser control, but it must include the value attributes of input elements on the page as well.
If I use webBrowser.DocumentText, I get the full HTML of the page as it was initially loaded. The input field values are not included.
If I use webBrowser.Document.Body.OuterHtml, I get the values, but not the other contents of (), which I need so I can get the stylesheet links, etc.
Is there a clean dependable way to obtain the full HTML of the DOM in its current state from the WebBrowser? I am passing the HTML to a library for it to be rendered to PDF, so suggestions for programmatically saving from the WebBrowser control to PDF will also be appreciated.
Thanks
There are some undocumented ways (changing registry, undocumented dll export) to print the document to XPS or PDF printers without parsing the page, that is, if your can afford to roll out required printer drivers to your customer's network.
If you want to parse the web page, documentElement.outerHTML should give you the full canonicalized document, but not the linked image, script or stylesheet files. You need to parse the page, enumerate elements and check element types and get resource urls before digging the WinInet cache or downloading for additional resources. To get the documentElement property, you need to cast HtmlDocument.DomDocument to mshtml.IHTMLDocument2 if you use Windows Forms, or cast WebBrowser.Document to mshtml.IHTMLDocument2 if you use WPF. If you need to wait before the Ajax code finishes execution, starting a timer when the DocumentComplete event is raised.
At this stage, I would parse the HTML DOM and get the necessary data in order to generate a report via a template, so you always have the option to generate other formats supported by the report engine, such as Microsoft Word. Very rarely I need to render the HTML as parsed, for example, printing a long table without adding customized header and footer on each page. That said, you can check Convert HTML to PDF in .NET and test which one of the suggested software/components works best with your target web site, if you do not have long tables.

Categories