Parse HTML Page, include all css styles

Parse HTML Page, include all css styles - c#

I want to send a complete html page as an email and want to include all the css styles into the email. Is there any library that creates me one html page with all the css styles correctly included. (Conside you can import css files which also have to be opened and included.)
Any help is appreciated.

The closest I came to this was having to send a control without includes, and I built this as a server control, read the css files ( and js files ) and write them out. However for an entire page, you might have more difficulty.
I do not believe that there is anything to do this. If you can read the entire page code, doing find and replaces may be your easiest answer. Find the csss tag, replace the inner contents with the values from the file in the tag.

You could try using this: http://martinnormark.com/move-css-inline-premailer-net
Im in the process of testing it myself in order to generate some Word documents with inline styles...i will add the results later...
Update:
It works althought Word 2010 applies some of the inline styles as it likes...didnt try with previous versions of word

Related

Find all labels and literals without meta:resourcekey tag in aspx page

I'm planning to write a console application for finding all aspx pages in a given main directory. After that it will open it with filestream and read it. If it finds any Label or literals without meta:resourcekey tag in a page, it will add it's Label ID to an excel file.
First of all, is there any open source tool doing something like this? Because I could not find it in google. I just need to get an advice for parsing aspx formats in a page. Can someone give an advice for this please? Or maybe some regular expressions can do this for me?

create table of contents with Pechkin html to pdf

I'm currently working with the Pechkin library for creating pdf-files based on html.
It all works great.
But I want to add one thing, a table of contents (TOC). But I can't get this working.
With only wkhtmltopdf it's easy to do:
wkhtmltopdf toc --xsl-style-sheet toc.xsl index.html index.pdf
But with Pechkin it won't work. I have already a bookmark (which works in Adobe Reader), but it's not a real TOC what I want.
I've tried to add
ObjectConfig().SetTocXsl("tocXslStyleSheetUri")
But it seems to have no effect.
I also tried to work with:
ObjectConfig().SetCreateToc(true);
This will create an empty pdf because this function is obsolete.
So I get a nice pdf-result, but only without a Table of Content. Does anyone of you know how I get the TOC appear in my pdf-file?
I also asked this question as an issue on github, but because they're not always that quick with reacting, or doesn't react at all, I also asked the question here.

What if you create the table of contents on a seperate page and give its url to the
ObjectConfig object . I can show you the code if required.

How to include all resources into one html file?

is there any c# library or any free tool which can convert a html file with many referenced resources into a one "all-in-one" html file?
The main task is to have only one file, it means I need to include
Javascript external files - this will probably mean replace all 'script' tags
with 'src' attribute by 'script' tags with content read from referenced file.
Images - replace src="picture.png" with data uri - something like src="data:image/png;base64,encodedContent..."
CSS files
may be i forgot something :)
This HTML file must be readable in all browsers, that's why I cannot use MHT file format (unreadable on Safari, iPad...)

You can use HTML Agility Pack to go read/write the html document. HTML Agility supports XPath so you can get a list of nodes you want to modify.
Using this, changing the attribute value of image tags should be easy. You can also get a list of external js references, read them and then update the script tag accordingly.

Windows Forms WebBrowser control: DocumentText vs Document.Body.OuterHtml

I am trying to obtain html from the WebBrowser control, but it must include the value attributes of input elements on the page as well.
If I use webBrowser.DocumentText, I get the full HTML of the page as it was initially loaded. The input field values are not included.
If I use webBrowser.Document.Body.OuterHtml, I get the values, but not the other contents of (), which I need so I can get the stylesheet links, etc.
Is there a clean dependable way to obtain the full HTML of the DOM in its current state from the WebBrowser? I am passing the HTML to a library for it to be rendered to PDF, so suggestions for programmatically saving from the WebBrowser control to PDF will also be appreciated.
Thanks

There are some undocumented ways (changing registry, undocumented dll export) to print the document to XPS or PDF printers without parsing the page, that is, if your can afford to roll out required printer drivers to your customer's network.
If you want to parse the web page, documentElement.outerHTML should give you the full canonicalized document, but not the linked image, script or stylesheet files. You need to parse the page, enumerate elements and check element types and get resource urls before digging the WinInet cache or downloading for additional resources. To get the documentElement property, you need to cast HtmlDocument.DomDocument to mshtml.IHTMLDocument2 if you use Windows Forms, or cast WebBrowser.Document to mshtml.IHTMLDocument2 if you use WPF. If you need to wait before the Ajax code finishes execution, starting a timer when the DocumentComplete event is raised.
At this stage, I would parse the HTML DOM and get the necessary data in order to generate a report via a template, so you always have the option to generate other formats supported by the report engine, such as Microsoft Word. Very rarely I need to render the HTML as parsed, for example, printing a long table without adding customized header and footer on each page. That said, you can check Convert HTML to PDF in .NET and test which one of the suggested software/components works best with your target web site, if you do not have long tables.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.

If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.

As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>

Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx

If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.

Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.

If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse HTML Page, include all css styles - c#

I want to send a complete html page as an email and want to include all the css styles into the email. Is there any library that creates me one html page with all the css styles correctly included. (Conside you can import css files which also have to be opened and included.) Any help is appreciated.

Related

Find all labels and literals without meta:resourcekey tag in aspx page

create table of contents with Pechkin html to pdf

How to include all resources into one html file?

Windows Forms WebBrowser control: DocumentText vs Document.Body.OuterHtml

How to create a word document using html written in C#

Categories

Resources