parse word document - c#

The word documents I want to parse will have a known format, defined by a word template. Users will use the word template to create the document. I need to parse data, including values from drop downs, from the word document using C#. This will be done on a SharePoint 2010 server. What is the recommended way to do this? I've seen people mention Open XML SDK 2.0; should I use that? If so, do I need to convert the .docx to XML, then parse it? In some cases, I will also have to write to the Word document, how should this be done?
Preferably a solution will support Word 2010 and 2007 but if tools for 2010 are significantly better I'd like to know that as well. Thanks.

The file extension for Office Open XML is .docx. The .docx file can be described as an archive of several different files. Files that handle what fonts, styles, objects will exist in the word document. Those files itself will be describe as XML.

Related

Simple Template Based PDF Reports in .Net

My customer gave me some Word and Powerpoint documents which specify how certain 'reports' generated by our product are supposed to look like.
That means, I need to modify those documents (replace placeholders etc.) and then I need to export them as PDF.
How would you solve this problem in C# ?
TL;DR: Editing the office document is no problem at all, but exporting that document to PDF (using Interop) allegedly causes issues when running it as a web server application. That's the whole problem here.
I agree that Interop is not suitable for document manipulation in server environment. I would approach this problem by preparing MS Word template documents with placeholders for data. Then I would use c# to load the data for the reports and merge the data with templates to get final documents (docx, pdf, xps or various image formats). There are 3rd party toolkits which make it quite easy. Here is the code used by one such toolkit needed for merging xml data with the template to get a pdf document:
XElement customers = XElement.Load("Customers.xml");
DocumentGenerator dg = new DocumentGenerator(customers);
DocumentGenerationResult result = dg.GenerateDocument("MyTemplate.docx", "MyReport.pdf");
You can of course also use free libraries and SDKs based on OpenXML but you should expect a steep learning curve, lots of debugging and lots of time invested.
Wkthmltopdf might be an option.
A completely different "report approach" could be, to save those office documents with the placeholders as mht (That's MHTML a web archive format). This could be done directly in MS Office or even programatically.
The placeholders could be easily exchanged by string search and replace. The mht files could directly be used to show the report instead of the PDF. A clear disadvantage of the mht format, is the HTML formatting. With PDF you have a clear and fix positioning.
We are using this kind of report creation. There are some flaws, but it works and the customer could edit the mht templates directly by right-click Open-With the prefered MS Office flavor.
You can use report generators, like FastReport.Net for solving your problems. It can assign different data for placeholders and also allow export to PDF.

Convert PDF document to Word document by programmatically without any third party tool (SSRS 2005)

I am using SQL Server Reporting Service 2005(SSRS 2005) to export report to Excel and PDF and VS2008. But now i want an option to Export to Word also, but it is not possible in SSRS 2005 report that i came to know after googling. Here problem is that I CAN'T USE SSRS 2008 REPORT. So i thought that i will follow the steps as....
-- Export to Word
1. Export to PDF
2. Convert that PDF to Word document
Even after so much of googling i didn't got the proper answer. I told once and even telling that i can't use any third party tools so don't give me wrong path.
There are many fundamental differences between PDF and Word making the approach you want highly undesirable as a general workflow. I'll give just one example: PDF typically does not store information about document structure - sentences, paragraphs, columns, tables... All it stores is the actual text at certain locations at a page. Word of course does have those concepts.
Is it possible to do what you want? Yes, to some extent. In the general case with guesswork and approximation. If you know which information you want to convert it might be possible to search for it in the PDF file generated by SSRS and then generate a Word file out of it. However, if SSRS allows export to text, XML, RTF or any other structure based file format (however slightly structure based), you'd have a much easier time.
If you insist on doing what you suggest here, you would have to:
1) Write code to take the PDF exported from SSRS and interpret it (find the textual content you want)
2) Recreate the necessary structural information from that information (what are paragraphs, where and what are the tables, what's the formatting etc...)
3) Write that into a file Word can read (or create a new Word document directly using automation).
This would be a considerable amount of work, but you have all of the necessary information as the PDF specification is freely downloadable from the Adobe web site and it contains all of the information you need.

How to create *.docx files from a template in C#

I have a working ASP.NET MVC web application to manage projects and customers. Now I want to generate a word file for some customers. In this file should be displayed some data about the customer. Every generated file should have the same data and the same design. So I want to craete a new Word Template with the fields and want to fill the placeholders programmatically.
My problem is that I couldn't find a clear way to do that. Does anybody know a good learning resources?
Try this page:
Building Office Open XML Files
Open XML files (docx) are ZIP packages containing XML files. In your case, I would create a copy of your original template, then use the System.IO.Packaging API to open the file and modify it. By opening, the correct XML file and replacing certain placeholders in XML, you should be able to achieve the result you want.
While trying to do the same I have found some libraries to create/edit DOC or DOCX in .Net
GemBox.Document
TemplateEngine.Docx
DocXTemplateEngine
Templater - Nuget
Spire.Doc - Nuget
for the new document generation from a template file (.dot) should be very easy, I think it's a parameter you specify in the word application file open or so, when you pass the path of the .dot file, telling word to create a new .doc based on that file and not edit the actual template document.
for form fields and bookmarks filling, lots of examples online on how to do it from C#, see here:
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge

Convert Open XML Excel files to HTML

I'm developing printing solution for MS Office 2007. Office automation is not right for me, because it requires Office to be installed. Open XML Document Viewer is solution for converting Word files (.docx) to HTML format by XSLT transform, but it works only for .docx. Can the same technology be used for Excel spreadsheets files?
You could use this article XSL transformation of SpreadsheetML to HTML as a starting point to develop your own transform. You can also look at the open source XSLTs in OpenXML/ODF Translator Add-ins for Office to get some ideas on things you may need to account for in any conversion outside of OOXML. The one thing to keep in mind is that SpreadsheetML is more similiar to PresentationML than it is to WordprocessingML in file structure inside the package (i.e. for every sheet, there is a seperate file).
If your doing this from .NET, I'd do this from LINQ instead of XSLT. I've done transforms from DrawingML into SVG and Linq makes it easy (in terms of similiar functionality to XSLT, staying within .NET, etc.)
If you're looking at Excel 97-03 (xls) or Excel 2007 (xlsx) files then I'd recommend FlexCel. I've used it, is very good and honestly quite cheap compared to it's competition.
Note that it doesn't fully support all formatting present in Excel 2007 yet I don't think. But it does have built in functionality to export to HTML.
You could write a SpreadsheetML parser. The schema is available online from Microsoft.
I wrote one a while back that covered data, structure and basic formatting to throw it throw a library and re-save it as an XLS file. Wasn't too difficult.

Is there a way to generate word documents dynamically without having word on the machine

I am planning on generating a Word document on the webserver dynamically. Is there good way of doing this in c#? I know I could script Word to do this but I would prefer another option.
I've worked at a company in the past that really wanted generated word documents, in the end they were perfectly satisfied with RTF docs that had a ".doc" extension. Word has no problem recognizing and opening them.
The RTF docs were generated with iText.net (free .net library), the API is pretty easy to use, performs extremely well, you don't need word on the machine, also, you could extend to generating PDF, HTML, and Text docs in the future with very little effort. After four years the solution I created is still in place, so that's a little testimony in iText.net's favor.
It looks like the official iText page suggests that iText Sharp is the best .Net choice right now, so that's another option
You'd be better off generating an rtf file, which word will know how to open.
If want to generate Office 2007 documents check the Open XML File Formats, they're simple zipped XML files, check this links:
Open XML File Formats: What is it, and how can I get started?
Introducing the Office (2007) Open XML File Formats
Edit: Check this project, can serve you as a good starting point:
DocumentMaker
Seems very simple and customizable, look this code snippet:
Paragraph p = new Paragraph();
p.Runs.Add(new Run("Text can have multiple format styles, they can be "));
p.Runs.Add(new Run("bold and italic",
TextFormats.Format.Bold | TextFormats.Format.Italic));
doc.Paragraphs.Add(p);
Word will quite happily open a HTML with a .doc extension. If you include an internal style sheet, you can have it fully formatted. There was previous post on this subject:
Export to Word Document in C#
Creating the old .DOC files (pre-Word 2007) is nigh-impossible without Word itself. The format is just too complex. Microsoft has released the format description, but it's enough to reduce a grown programmer to tears. There is a reason for that too (historical), but that doesn't make things better.
The new .DOCX would be easier, although quite a bit of hassle still. However depending on which Word versions you are targeting, there are some other options too.
For one, there is the classic .RTF. The format is pretty complex still, yet well documented and has strong support across many applications and platforms. And you might use some string-replacing into template files to make things easier (it's non-binary).
Then there are the "old" Word XML files. I think they worked starting with Word XP. Kinda the predecessors of .DOCX. I've used them, not bad. And the documentation is pretty OK.
Finally, the easy way that I would choose, is to make a simple HTML. Word can load HTML files just fine starting with version 2000. In the simplest way just change the extension of a HTML file to .DOC and you have it. You can also add a few word-specific tags and comments to make it look even better in Word. Use the Word's Save As...HTML option to see what they are.
There are third party libraries about that will do the job.
Doing a quick google came up with this one, for example.
I haven't tried any, so I can't give you specific advice, I'm afraid!
Let us know how you get on...
In Office 2007 Microsoft introduced a new file format called the Microsoft Open Office XML Format (.docx). This format is not compatible with older versions of Microsoft Word. Since this is XML you can create or read with out having a Word installed.
Here is the component that generates document based on the custom template. The documents are generated from the sharepoint list ... so the data is pulled from the list item into the document on the fly:
http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list
Hope that helps,
Yaroslav Pentsarskyy
Blog: www.sharemuch.com

Categories