Converting between document formats in C#

Converting between document formats in C# - c#

What is the best way to convert between HTML, XML, and XSL-FO in C#?
I already have the HTML (piped in from FCKEditor) and I'd like to print a PDF (I have an XSL->PDF converter). I just can't seem to find a library that will convert from HTML into anything XSL friendly.

A year or two back, I had to generate pdfs from a C++/C# program. In the end I settled on launching Apache's Java FOP as a separate process to do the conversion. The experience with xsl-fo was not a pleasant one. At the time, there didn't appear to be a single tool that had implemented xsl-fo completely. Tools tended to pick a subset of the specification and hack away at that. Given the sprawling complexity of xsl-fo, I'm starting to wonder if there will ever be a full implementation.
FOP tended to be buggy and considerable time was spent working around issues. XSLT and XPaths were difficult to learn. It took a few weeks before I was seeing past the verbosity and could quickly get things done. I don't think I ever quite got my head around xsl-fo though. It makes the html and css model look like a child's toy. Luckily, the pdfs generate, and don't have too many problems. :-)
Anyway, the task at hand: generating pdfs from xhtml output from FCKEditor.
I just can't seem to find a library that will convert from HTML into anything XSL friendly.
Heh. Yeah, that's 'cos there isn't one, and probably won't be an html to xsl-fo converter that's any good. Such a converter has a few things against it: complexity of browsers and complexity of xsl-fo. For such a converter to deal with an average html document, it needs the guts of a web browser: the layout, css support probably even JavaScript. Then it has to take the rendered page, and figure out what xsl-fo is needed to get something which looks similar, and fits within the paged constraints of xsl-fo.
It's like the problem with making a word viewer: without reimplementing a lot of word, it sucks most of the time because it doesn't look the same.
So... what can you do? Well, having a small subset of html to work with is a good start. Hopefully the output from FCKEditor is xhtml, as getting html into xml is a world of pain in itself (which tidy can be useful for). Next, unless some poor soul has already made an FCKEditor xhtml -> xsl-fo xslt for your xsl-fo implementation, you'll have to make one. That involves learning xsl-fo, xslt and xpath. In my experience it'll take a few weeks and will be a cobbled together solution.
To get started with xsl-fo I found the following links useful:
XSL-FOTutorial
XSL Standard
Apache FOP Compliance Page
XSL-FO: Ready for Prime Time? outlines the problem xsl-fo tries to solve
For three quick intros see a, b and c
So what's all this xsl-fo, xslt stuff and all the other things? The XSL-FO: Ready for Prime Time? lays it out as:
The Extensible Stylesheet Language Family (XSL) XSL is a family of recommendations for defining XML document transformation and presentation. It consists of three parts:
XSL Transformations (XSLT), a language for transforming XML
The XML Path Language (XPath), an expression language used by XSLT to access or refer to parts of an XML document. (XPath is also used by the XML Linking specification)
XSL Formatting Objects (XSL-FO), an XML vocabulary for specifying formatting semantics
My advice? Run. Find another away. Find another solution. Generate LaTeX files, and convert them into pdfs. Generate something else. Make word documents and print them using PDFCreator. Generate images. Control Firefox to print pages as pdfs. Find away to avoid needing pdfs at all. Anything, as long as it isn't fighting html, xsl-fo, FOP, xslt and xpath.
PS: Let me know if you need any help. :-)

I'd first try XSLT. When you're talking about formatting XML documents (and that's pretty much what you're talking about), that's the tool designed to do it.
From Wiki:
"The general idea behind XSL-FO's use
is that the user writes a document,
not in FO, but in an XML language.
XHTML, DocBook, and TEI are all
possibilities, but it could be any XML
language. Then, the user obtains an
XSLT transform, either by writing one
themselves or by finding one for the
document type in question. This XSLT
transform converts the XML into
XSL-FO."
You need an XSLT transform for HTML to XSL-FO. Not sure where to get one, but apparently the concept isn't alien.

Very informative exchange here. I have created a web application using ASP.NET and C#.NET for my IT contract business. One of the major goals of the web app is to generate customized resumes in various formats. I store my resume content in a SQL Server database and build the XML mostly raw in a C# method. I used XSLT to convert to HTML and with a little akwardness have finally got a basic presentable resume. My next goal is to get a printable version of the resume. I got a book on XML from the library and touched up the XSLT a little. Then I came to the XSL-FO chapter. That's when the iceberg hit. I wanted to take on the challenge of having a PDF option that would be a menu choice and do a tranform to XSLT to XSL-FO to PDF. Thing is all the book recommendations had references to commercial products. It is just not worth the money as PDF is not neccessary. I looked at Altova XMLSpy on a 30 day trail basis but as soon as I tried my first transform of a XSL-FO example file I got a message stating that I needed to download more software. That download was taking forever from their site so I gave up and removed the software. Free versions of the commmercial software from other vendors do not have the transform option. After reading the notes here I have decided to avoid the XSL-FO myself. I am going to try getting an MS Word version now and if my clients want to convert it to PDF they can pay for the PDF create version from Adobe.

This is a dead question but I would like to add for future readers that the current incarnation on FCKEditor (CKEditor now) is better at producing high quality XHTML (even a user-definable set of tags is possible).
I have gotten around similar issues by actually not using XSL-FO but using a (X)HTML to PDF converter that renders the PDF from your source without XSL Transforms. I validate the produced XHTML and fix the rare issues with HtmlAgilityPack - that way will get you a long way from non-semantic HTML complexities. There are many converters to choose from, my choice is wkhtmltopdf (If money is not an issue PrinceXML is a superior alternative - I would love to use it but it's simply too expensive).

Related

Is there a high fidelity way to convert HTML into PDF and DOCX?

I need to convert HTML files into PDF and DOCX respectively (just the HTML -> PDF part would good enough for now though).
Obviously I know there are some projects that help with what I want to achieve, I am currently using HTML-Renderer for the PDF part, and OpenXML for the DOCX.
I've tried HTML-Renderer but the fidelity of the conversion is not great, since I read somewhere I can't make headers and footers with HTML for multipage formats. furthermore the conversion scratches off the end of the text when it passes from one page to another.
As for the DOCX, I don't know what the best options are.
I want, if possible, to know what are good high fidelity ways to convert HTML to those formats, any helps is greatly appreciated.
I'm open to ideas/advice on how to make it myself, but right now I don't have the time to do so, so I would much rather use an existent NuGet/DLL/library.

You could consider shelling out to pandoc:
https://pandoc.org
For visual appeal, you might like the Eisvogel template:
https://github.com/Wandmalfarbe/pandoc-latex-template
...which although designed for Markdown, ought to work for well structured, semantic HTML as input to Pandoc too.

Create PDF from HTML form results in C#

I have a project where I need to create an HTML form (no problem) and then create a PDF file from the results using C#.
I have done this before in PHP using FPDF but this one needs to be C#. Ideally I want to put the code into a user control and then stick it in an Umbraco website.
Can anyone recommend a good way to do this? PDF doesn't need to be fancy, it'll just display text, we aim to create a generic purchase order based on what the customer wants from the form, which can then be emailed to them to print off on headed paper.
Thanks

There are a couple of recent problems with iTextSharp. The most annoying is that in the latest version they've deprecated the HTML parser. So now everything has to work through the XMLWorkerHelper singleton and parses through ParseXHtml. I find this a real pain, since HTML pages which aren't well formed appear fine on browser, parse OK in the old method and now crash out with an exception. So it necessitates an extra step to make sure your HTML is well formed (as XHTML) first. If you are generating your HTML from an ASPX page, then using Server.Execute() to get the stream, then this might be useful to you for iTextSharp:
http://jwcooney.com/2012/12/30/generate-a-pdf-from-an-asp-net-web-page-using-the-itextsharp-xmlworker-namespace/
Be mindful that iTextSharp has a distinct lack of any decent documentation of the modern changes (being mindful that the Java iText documents don't translate perfectly to C#), it makes the learning curve far too long and steep for any practical use in short spaces of time. I've basically given up on that platform, though may just create a baseline system to get something working lean whilst I then learn another framework.
As a result, I'm looking at PDFizer and PDFSharp libraries. If I have some success, I'll report back.

here is a library for converting HTML to PDF
http://pdfcrowd.com/web-html-to-pdf-net/

I like the PDFsharp library. Not sure how it would work for your needs, though.

Word/PDF Generation in C# with hundreds of pages is too slow

I'm having speed issue generating documentation in C#.
I am basically trying to create documents with 600+ pages. But the tools I have used handle this very slowly.
I first tried using DocX by Novacode. Creation of this document with 600+ pages takes upwards to 3 minutes. I learned that there could be an issue with the function "InsertDocument" so I tried to find a different solution.
I started looking into opening a HTML document into word. While this is a fast solution, images are not embedded into the document. The HTML syntax (src="data:image/png;base64,xxxx") is not supported in MS Word.
I could use URLs to the images, but then if the internet connection is down, the images would not display.
I then started looking into a HTML->PDF solution. iTextSharp is a little faster than the DocX solution, but still takes 1-2 minutes to generate this document.
I am simply out of ideas. I'm not sure a commercial product would be better, and I don't want to shell out that kind of cash, to just have the same speed issue.
Has anyone had experience with creating Word/PDF documents with 600+ pages in C# that is fairly quick (1-5 seconds).

If you are trying to do this from a web server, you should be careful about resources consumption of this process, since you may run out of memory for example quite easily.
If at some point you decide to consider commercial libraries, maybe you could give Amyuni PDF Creator .Net a try. Amyuni PDF Creator .Net provides a "page by page" mode that saves resources when processing exceptionally long PDF documents. The idea is to save each page to the output file as soon as it is generated, maybe keeping a few pages in memory in case they need to be modified.
Take a look on these links for more details:
SartSave Method
EndSave Method
Processing large PDF files
usual disclaimer applies

You should be able to create a rich formatted DOCX file with 600+ pages in that time frame, but for PDF file I'm not sure... it will probably depend on your document content.
Anyway, I'm able to create a rather large DOCX file with GemBox.Document in just few seconds (0-4 sec), and PDF file as well, but it does take a bit more time then DOCX output.
You can also convert HTML to DOCX or HTML to PDF really fast, but that can depend on the HTML content itself.
If possible, you should prefer having well written HTML content that's "printer-friendly", doesn't have too much nesting levels, has optimized images, has single CSS file, etc. Also, if you're providing an URL as an input path then I think it's better to have embedded base64 images then links in order to avoid making additional web requests.
Last, I don't think there is much difference in Flat OPC XML vs DOCX. Basically they both generate the same content, it's just that DOCX file is additionally zipped which is a neglectable performance penalty.

Send Output direct to HTML or through XML first?

I am writing a parser to parse incoming text files. I have it to where it will parse everything accurately.
I have an option for it to output to text - this was done to check the accuracy of the parsing. I am currently implementing an option to write to a spreadsheet but it doesn't output everything yet.
I have a request to output as static HTML. Is it worth outputting to XML and then generating HTML from that?
I see C# has the XMLTransform class which looks like it would do what I need. Is using the XML designer in VS and writing the XSLT file easier than hand-coding all of the HTML output? I know Excel will import XML files, but it is a little messy and I don't get the formatting options I can get if I generate the .xls file directly

I would give you a qualified No.
It is generally not worth building XML then running it through an XSLT transformation to build HTML.
That said, I might consider such an option if I wanted to easily swap out transformations, such as if this is an app used by multiple clients and the generated HTML would be client dependent. Even then I'd investigate using a simple tokenized HTML template in which I just plugged in the data I wanted. However, if the transformation was sufficiently complex then, yes, I'd go the XSLT route.
The reason for the No is that by the conversion adds such a level of complexity that it is usually not worth the time involved.

Creating an Excel SpreadsheetML in code. (Without Excel!)

With Excel 2003 and higher it is possible to use the SpreadsheetML format to generate Excel spreadsheets with just an XML stylesheet and XML data file. I've used this in some project and works quite nice, even though it's not easy to do.
From the Microsoft Download site I've downloaded the XSD's that make up SpreadsheetML and in my ignorance, I've tried to convert them to C# classes. Unfortunately, xsd.exe isn't very happy about these schema files so I tend to be stuck.
I don't need an alternative solution to SpreadsheetML since it works fine for my needs. It's just that my code would be a bit easier to maintain for my team members if it's not written in a complex stylesheet. (It sucks to be the only XSLT expert in your company.)
All I want to know if someone has successfully created Excel SpreadsheetML files with .NET without the use of third-party code and without XSLT. And if you do, how did you solve this?
(Or maybe I just have to discover how to add namespaces to XML elements within XML.Linq...)

A while ago I used the XmlDocument and friends to create a SpreadsheetML document with formulae, formats and so on, so it is possible if a bit fiddly.
This MSDN page is what you need to get started with using the namespace in LINQ.

I have used this library and there is even a tool to generate the C# code that you need from an exsisting excel file.
http://www.carlosag.net/Tools/ExcelXmlWriter/

I had started on a similar problem a few weeks back, but due to some impending issues I had to put it at the back burner.
Back then I referred to this http://www.codeproject.com/KB/aspnet/ExportClassLibrary.aspx?fid=113399&df=90&mpp=25&sort=Position&tid=2609600
I really couldn't get started with it but plan to get back on it soon. I hope the link helps.
cheers

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.