How to convert a System.IO.Packaging.Package to HTML? - c#

Microsoft Word interoperability classes will let you get at a property called WordOpenXML. This represents a package that will be stored - zipped up - in a .docx file and can be opened by Microsoft Word. However, is there a way to convert this Package to other formats, notably HTML?
I read in an answer to an old question that "Word 2007 has an API that you can use to convert to HTML. [...] You can find documentation around the API, but I remember that there is a convert to HTML function in the API." I'm not 100% sure which API that guy is talking about but perhaps it's System.IO.Packaging.Package or something similar. I can't seem to find any "convert to HTML function"; does anyone know how you can convert a Package format Word document into HTML?

The API in question is probably the Save method on the document; when a file type of HTML is chosen, Word transforms the document into HTML, and applies the appropriate styling.
Chances are, given that the docx format is XML, there is an XSLT transformation of some sort going on; this is just speculation, but it's not far-fetched, as XSLT is commonly used to create HTML from XML.
That said, what you are looking for probably does not reside in the Package class, nor should it. The Package class is used for creating packages of content, not with the transformation of that content.
However, there's nothing stopping you from providing the transformation of that content; you can get the XML that is the basis of the Word document and then apply your own XSLT which would produce the HTML that you want.

Related

Display MSWord file content in any browser

I want to display content of word file in browser same like we display pdf file in browser. I don't want any plugin because if I use plugin I have to install for all browser. I want just one solution which works in all browser.
I have searched on google, but I found all link which directly download word file and open it.
Currently I am using object tag for displaying pdf file but it is not working for word file. It is showing message: The plug-in is not supported.
Using a browser plug-in (such as the free Word Viewer) is by far the easiest method, and arguably the most correct - however, there are some alternatives if you really don't want to do this:
Convert the Word document to another format (e.g. HTML/PDF) on-the-fly before the response is sent. For Word 97-2003 documents, you can do this with VSTO/Automation. For Word 2007+ documents, you can use the OpenXML SDK (although you will have to write the conversion algorithm yourself).
Use an XSL stylesheet to transform the Word markup (docx) into html/css. You can do this server-side or, potentially, with client-side scripting (JavaScript). Some useful resources here and here.
Great question. In principle, browsers only really tend to support viewing websites (e.g. html). Most, however, also support viewing PDFs, and, as you've correctly identified, you could use plugins to extend the behaviour. Crucially, though, some browsers provide document viewing with a javascript-based viewer.
I wasn't aware of it before you asked, but there are apparently javascript implementations of non-PDF document readers--for example, ViewerJS--that seem to directly support .odt. With a little digging, you might be able to find an implementation/plugin for a javascript viewer that supports .docx. However, I can't recommend one from personal experience at the moment. I would recommend searching for javascript document viewers though.

Send Output direct to HTML or through XML first?

I am writing a parser to parse incoming text files. I have it to where it will parse everything accurately.
I have an option for it to output to text - this was done to check the accuracy of the parsing. I am currently implementing an option to write to a spreadsheet but it doesn't output everything yet.
I have a request to output as static HTML. Is it worth outputting to XML and then generating HTML from that?
I see C# has the XMLTransform class which looks like it would do what I need. Is using the XML designer in VS and writing the XSLT file easier than hand-coding all of the HTML output? I know Excel will import XML files, but it is a little messy and I don't get the formatting options I can get if I generate the .xls file directly
I would give you a qualified No.
It is generally not worth building XML then running it through an XSLT transformation to build HTML.
That said, I might consider such an option if I wanted to easily swap out transformations, such as if this is an app used by multiple clients and the generated HTML would be client dependent. Even then I'd investigate using a simple tokenized HTML template in which I just plugged in the data I wanted. However, if the transformation was sufficiently complex then, yes, I'd go the XSLT route.
The reason for the No is that by the conversion adds such a level of complexity that it is usually not worth the time involved.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.
If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.
As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>
Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx
If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.
Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.
If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

Is there a way to generate word documents dynamically without having word on the machine

I am planning on generating a Word document on the webserver dynamically. Is there good way of doing this in c#? I know I could script Word to do this but I would prefer another option.
I've worked at a company in the past that really wanted generated word documents, in the end they were perfectly satisfied with RTF docs that had a ".doc" extension. Word has no problem recognizing and opening them.
The RTF docs were generated with iText.net (free .net library), the API is pretty easy to use, performs extremely well, you don't need word on the machine, also, you could extend to generating PDF, HTML, and Text docs in the future with very little effort. After four years the solution I created is still in place, so that's a little testimony in iText.net's favor.
It looks like the official iText page suggests that iText Sharp is the best .Net choice right now, so that's another option
You'd be better off generating an rtf file, which word will know how to open.
If want to generate Office 2007 documents check the Open XML File Formats, they're simple zipped XML files, check this links:
Open XML File Formats: What is it, and how can I get started?
Introducing the Office (2007) Open XML File Formats
Edit: Check this project, can serve you as a good starting point:
DocumentMaker
Seems very simple and customizable, look this code snippet:
Paragraph p = new Paragraph();
p.Runs.Add(new Run("Text can have multiple format styles, they can be "));
p.Runs.Add(new Run("bold and italic",
TextFormats.Format.Bold | TextFormats.Format.Italic));
doc.Paragraphs.Add(p);
Word will quite happily open a HTML with a .doc extension. If you include an internal style sheet, you can have it fully formatted. There was previous post on this subject:
Export to Word Document in C#
Creating the old .DOC files (pre-Word 2007) is nigh-impossible without Word itself. The format is just too complex. Microsoft has released the format description, but it's enough to reduce a grown programmer to tears. There is a reason for that too (historical), but that doesn't make things better.
The new .DOCX would be easier, although quite a bit of hassle still. However depending on which Word versions you are targeting, there are some other options too.
For one, there is the classic .RTF. The format is pretty complex still, yet well documented and has strong support across many applications and platforms. And you might use some string-replacing into template files to make things easier (it's non-binary).
Then there are the "old" Word XML files. I think they worked starting with Word XP. Kinda the predecessors of .DOCX. I've used them, not bad. And the documentation is pretty OK.
Finally, the easy way that I would choose, is to make a simple HTML. Word can load HTML files just fine starting with version 2000. In the simplest way just change the extension of a HTML file to .DOC and you have it. You can also add a few word-specific tags and comments to make it look even better in Word. Use the Word's Save As...HTML option to see what they are.
There are third party libraries about that will do the job.
Doing a quick google came up with this one, for example.
I haven't tried any, so I can't give you specific advice, I'm afraid!
Let us know how you get on...
In Office 2007 Microsoft introduced a new file format called the Microsoft Open Office XML Format (.docx). This format is not compatible with older versions of Microsoft Word. Since this is XML you can create or read with out having a Word installed.
Here is the component that generates document based on the custom template. The documents are generated from the sharepoint list ... so the data is pulled from the list item into the document on the fly:
http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list
Hope that helps,
Yaroslav Pentsarskyy
Blog: www.sharemuch.com

Converting between document formats in C#

What is the best way to convert between HTML, XML, and XSL-FO in C#?
I already have the HTML (piped in from FCKEditor) and I'd like to print a PDF (I have an XSL->PDF converter). I just can't seem to find a library that will convert from HTML into anything XSL friendly.
A year or two back, I had to generate pdfs from a C++/C# program. In the end I settled on launching Apache's Java FOP as a separate process to do the conversion. The experience with xsl-fo was not a pleasant one. At the time, there didn't appear to be a single tool that had implemented xsl-fo completely. Tools tended to pick a subset of the specification and hack away at that. Given the sprawling complexity of xsl-fo, I'm starting to wonder if there will ever be a full implementation.
FOP tended to be buggy and considerable time was spent working around issues. XSLT and XPaths were difficult to learn. It took a few weeks before I was seeing past the verbosity and could quickly get things done. I don't think I ever quite got my head around xsl-fo though. It makes the html and css model look like a child's toy. Luckily, the pdfs generate, and don't have too many problems. :-)
Anyway, the task at hand: generating pdfs from xhtml output from FCKEditor.
I just can't seem to find a library that will convert from HTML into anything XSL friendly.
Heh. Yeah, that's 'cos there isn't one, and probably won't be an html to xsl-fo converter that's any good. Such a converter has a few things against it: complexity of browsers and complexity of xsl-fo. For such a converter to deal with an average html document, it needs the guts of a web browser: the layout, css support probably even JavaScript. Then it has to take the rendered page, and figure out what xsl-fo is needed to get something which looks similar, and fits within the paged constraints of xsl-fo.
It's like the problem with making a word viewer: without reimplementing a lot of word, it sucks most of the time because it doesn't look the same.
So... what can you do? Well, having a small subset of html to work with is a good start. Hopefully the output from FCKEditor is xhtml, as getting html into xml is a world of pain in itself (which tidy can be useful for). Next, unless some poor soul has already made an FCKEditor xhtml -> xsl-fo xslt for your xsl-fo implementation, you'll have to make one. That involves learning xsl-fo, xslt and xpath. In my experience it'll take a few weeks and will be a cobbled together solution.
To get started with xsl-fo I found the following links useful:
XSL-FOTutorial
XSL Standard
Apache FOP Compliance Page
XSL-FO: Ready for Prime Time? outlines the problem xsl-fo tries to solve
For three quick intros see a, b and c
So what's all this xsl-fo, xslt stuff and all the other things? The XSL-FO: Ready for Prime Time? lays it out as:
The Extensible Stylesheet Language Family (XSL) XSL is a family of recommendations for defining XML document transformation and presentation. It consists of three parts:
XSL Transformations (XSLT), a language for transforming XML
The XML Path Language (XPath), an expression language used by XSLT to access or refer to parts of an XML document. (XPath is also used by the XML Linking specification)
XSL Formatting Objects (XSL-FO), an XML vocabulary for specifying formatting semantics
My advice? Run. Find another away. Find another solution. Generate LaTeX files, and convert them into pdfs. Generate something else. Make word documents and print them using PDFCreator. Generate images. Control Firefox to print pages as pdfs. Find away to avoid needing pdfs at all. Anything, as long as it isn't fighting html, xsl-fo, FOP, xslt and xpath.
PS: Let me know if you need any help. :-)
I'd first try XSLT. When you're talking about formatting XML documents (and that's pretty much what you're talking about), that's the tool designed to do it.
From Wiki:
"The general idea behind XSL-FO's use
is that the user writes a document,
not in FO, but in an XML language.
XHTML, DocBook, and TEI are all
possibilities, but it could be any XML
language. Then, the user obtains an
XSLT transform, either by writing one
themselves or by finding one for the
document type in question. This XSLT
transform converts the XML into
XSL-FO."
You need an XSLT transform for HTML to XSL-FO. Not sure where to get one, but apparently the concept isn't alien.
Very informative exchange here. I have created a web application using ASP.NET and C#.NET for my IT contract business. One of the major goals of the web app is to generate customized resumes in various formats. I store my resume content in a SQL Server database and build the XML mostly raw in a C# method. I used XSLT to convert to HTML and with a little akwardness have finally got a basic presentable resume. My next goal is to get a printable version of the resume. I got a book on XML from the library and touched up the XSLT a little. Then I came to the XSL-FO chapter. That's when the iceberg hit. I wanted to take on the challenge of having a PDF option that would be a menu choice and do a tranform to XSLT to XSL-FO to PDF. Thing is all the book recommendations had references to commercial products. It is just not worth the money as PDF is not neccessary. I looked at Altova XMLSpy on a 30 day trail basis but as soon as I tried my first transform of a XSL-FO example file I got a message stating that I needed to download more software. That download was taking forever from their site so I gave up and removed the software. Free versions of the commmercial software from other vendors do not have the transform option. After reading the notes here I have decided to avoid the XSL-FO myself. I am going to try getting an MS Word version now and if my clients want to convert it to PDF they can pay for the PDF create version from Adobe.
This is a dead question but I would like to add for future readers that the current incarnation on FCKEditor (CKEditor now) is better at producing high quality XHTML (even a user-definable set of tags is possible).
I have gotten around similar issues by actually not using XSL-FO but using a (X)HTML to PDF converter that renders the PDF from your source without XSL Transforms. I validate the produced XHTML and fix the rare issues with HtmlAgilityPack - that way will get you a long way from non-semantic HTML complexities. There are many converters to choose from, my choice is wkhtmltopdf (If money is not an issue PrinceXML is a superior alternative - I would love to use it but it's simply too expensive).

Categories