I need a script (or other code, C#, etc.) that will fetch every paragraph/sentence containing a particular word in a set of Word 2007 documents and move them to a new Word document, recording the filename of the original (source) document they were extracted from.
What about using a document indexer, such as dtSearch to index your documents (word, pdf, etc), and then tap into their API to do your unique searches that way. From what it sounds that might be the fastest way to accomplish this. Granted indexers like dtSearch cost money (not a whole lot), but sometimes it may be worth the cost compared to the hours you will spend trying to write your own code to do the same thing.
Some articles that I have found that might lead you in the right direction if you don't want to use an indexer are:
http://omegacoder.com/?p=555
and
http://weblogs.asp.net/guystarbuck/archive/2008/05/13/automated-search-and-replace-in-multiple-word-2007-documents-with-c.aspx
Edit
To find a sentence that contains a specific word, you can try this link http://msdn.microsoft.com/en-us/library/bb546163.aspx
This might give you a start: http://msdn.microsoft.com/en-us/library/ff834910.aspx
Office Interop is an option but beware: it is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !
You will need to use some library to achieve what you want:
MS provides the OpenXML SDK V 2.0 (free)
Aspose.Words (commercial)
Related
I am searching from last two days but did not find any thing.
My requirement is to create a document viewer in my web application (C#.Net) and I don't want to use any third party tool for this. Can I convert the files in image or PDF or in any common formate which can be easly render on web page. I also can not use Introp object.
Any help will be highly appreciated
You mention in one of your comments that you'd like to write all the code yourself but don't know where to start. Here's how I would go about it...
First, you'll need to familiarize yourself with the Microsoft Office Format specification. You can find that here (there's a link to the technical specification). Office documents are actually a .zip file with an XML file inside along with any binary data representing attachments. Just renamed a .docx file as .zip and you'll be able to open it up and see the XML and any other supporting documents inside (same is true for xlsx, etc...).
Then you'll need to become intimately familiar with either PDF or HTML, as your job now will be to convert the various Office document structure into PDF or HTML structure, being sure to respect page layout, margins, order, etc...
As others have said, this is a large task which is why third party tools exist today. Also, each third party toolset has it's limitation as this is really hard to "get right" in all situations and there will be edge cases that work for one document and not another (because maybe they didn't use Microsoft Word to save the .docx, maybe they used OpenOffice and OpenOffice interpreted the standard slightly differently...)
If you cannot use COM/Interop technologies in your solution, you can take a look at the specialized 3rd party options. I see that you prefer not to use them, however, there are no existing built-in solutions in the .NET Framework. Check out my answer in a similar thread that describes how to accomplish exactly the same task using 3rd party libraries (for example, DevExpress, since I have experience with it). In addition, take a look at the Documents demo, where you can see how to create images/thumbnails from different types of MS Office documents.
I believe what you need is an intermediate representation of the documents which can be converted into an image for the viewer to display.
Lets me try to explain with the below diagram:
You can use tools like smallpdf or OfficeToPDF to do that. Just integrate them into your application.
Small PDF(https://smallpdf.com/library-detail)
officetopdf (https://officetopdf.codeplex.com/)
Our product is going to support Word(and PDF) report generation, and I'm investigating on which techniques to choose.
Currently what I know is Word automation and OpenXML SDK. There are pros & cons of each.
Do you have any experiences, suggestions or comments about these two or any other techniques? Or is there any third-party utilities/products(may be based on the previous two techniques or not) we can use? We want to analyze as many possible solutions as possible.
If you have the choice I'd go for OpenXML any day of the week.
It has quite a number of advantages over Office Automation.
The most interesting one for me is the fact that it can run on a server, where Office Automation can't (because you need an instance of office on the pc/server running your software). That brings us to my second point, it doesn't need an instance of Office to generate your documents, where Office automation needs one. (This is because office automation will run an instance of office in the background and perform all your actions on it).
Especially when we are talking about large documents or being able to generate quite a few at the same time, OpenXML will perform a lot better than Office Automation because of this.
To make a long story short, Office automation is a thing of the past, openXML is the future ;)
If you want to dive into OpenXML, take a peek here: OpenXML Developer
Good luck !
For PDF generation I used http://www.html-to-pdf.net in past. This provides good support and I assume can be used to generate word documents as well... Check out there website...
If you are using Web forms, I faced one issue with HTTPS - which I listed the solution here:
http://blogs.msdn.com/b/sajoshi/archive/2010/12/13/using-pdfconverter-http-www-html-to-pdf-net-with-https-in-asp-net-mvc.aspx
Docmosis offers a cloud service that can produce MS Word and PDF output via a simple api. The report or document templates are either Word or Open Office documents which can be edited and maintained by non-developers. Once uploaded to the system your application can then simply call the service and specify the data to inject into the document(s) as either JSON or XML. The result is then streamed back, emailed, or placed in storage for access later. Output can be doc, pdf, or html.
The service offers a wide range of templating features and so supports quite complex reporting requirements.
The best thing we found was that cosmetic changes to the output could be handled by the document authors and not the developers which saved us heaps of valuable time (not to mention saving the sanity of our developers).
www.docmosis.com
If you want to build your documents in code, the OpenXML SDK is definitely the way to go. It is a very well designed API that makes full use of LINQ type syntax. One you're up to speed on it you will find it very powerful and easy to use.
With that said, you then have all the logic of your document in code. And change requires a change in your code and that tends to become a pain over time. If you want a system where you design the document in Word you've got a couple of choices - and Word automation is the worst. Even Microsoft says don't do Office automation on a server.
One of the best choices where you design in Word is Windward Reports (disclaimer - I'm the CTO there). With Windward you get the power and ease of Word for your design and new documents or revisions of existing documents don't require a change in code. Other products that take this approach are XpertDoc and SoftArtisans (although both of them do have a code component with each template).
I need a way of generating a word document (from a template or something) and inserting an image at a specific place. Does anyone have any pointers on the best way to do this?
I worked on a project that used Office Automation in .NET 1.1 a few years ago, and it was really unspeakably poor. I'm assuming OA has either been improved or been superceeded by a better solution, but I'm not finding much advice on google.
Edit to clarify, this will be running on a web or sharepoint server
Alternatively, and if you do not need to generate word documents that will work with versions prior to Word 2007, you could use the OpenXML SDK to create your Word Document. It's all managed code and way easier in my opinion to use than OA.
Having done something very similar, I would advise against it. Office Automation within a server environment is buggy. Further the COM Interop requires an interactive user, i.e. there is no 'headless' mode.
Use OpenXML as suggested by Gimly, this would be cleaner approach.
Following line would add the image to word document.
wordDoc.InlineShapes.AddPicture(filePath, ref link, ref save, ref range);
Here, link should be false and save should be true. Range should be the location where you need to add image.
This link should help out dealing with the Interop.
I am planning on generating a Word document on the webserver dynamically. Is there good way of doing this in c#? I know I could script Word to do this but I would prefer another option.
I've worked at a company in the past that really wanted generated word documents, in the end they were perfectly satisfied with RTF docs that had a ".doc" extension. Word has no problem recognizing and opening them.
The RTF docs were generated with iText.net (free .net library), the API is pretty easy to use, performs extremely well, you don't need word on the machine, also, you could extend to generating PDF, HTML, and Text docs in the future with very little effort. After four years the solution I created is still in place, so that's a little testimony in iText.net's favor.
It looks like the official iText page suggests that iText Sharp is the best .Net choice right now, so that's another option
You'd be better off generating an rtf file, which word will know how to open.
If want to generate Office 2007 documents check the Open XML File Formats, they're simple zipped XML files, check this links:
Open XML File Formats: What is it, and how can I get started?
Introducing the Office (2007) Open XML File Formats
Edit: Check this project, can serve you as a good starting point:
DocumentMaker
Seems very simple and customizable, look this code snippet:
Paragraph p = new Paragraph();
p.Runs.Add(new Run("Text can have multiple format styles, they can be "));
p.Runs.Add(new Run("bold and italic",
TextFormats.Format.Bold | TextFormats.Format.Italic));
doc.Paragraphs.Add(p);
Word will quite happily open a HTML with a .doc extension. If you include an internal style sheet, you can have it fully formatted. There was previous post on this subject:
Export to Word Document in C#
Creating the old .DOC files (pre-Word 2007) is nigh-impossible without Word itself. The format is just too complex. Microsoft has released the format description, but it's enough to reduce a grown programmer to tears. There is a reason for that too (historical), but that doesn't make things better.
The new .DOCX would be easier, although quite a bit of hassle still. However depending on which Word versions you are targeting, there are some other options too.
For one, there is the classic .RTF. The format is pretty complex still, yet well documented and has strong support across many applications and platforms. And you might use some string-replacing into template files to make things easier (it's non-binary).
Then there are the "old" Word XML files. I think they worked starting with Word XP. Kinda the predecessors of .DOCX. I've used them, not bad. And the documentation is pretty OK.
Finally, the easy way that I would choose, is to make a simple HTML. Word can load HTML files just fine starting with version 2000. In the simplest way just change the extension of a HTML file to .DOC and you have it. You can also add a few word-specific tags and comments to make it look even better in Word. Use the Word's Save As...HTML option to see what they are.
There are third party libraries about that will do the job.
Doing a quick google came up with this one, for example.
I haven't tried any, so I can't give you specific advice, I'm afraid!
Let us know how you get on...
In Office 2007 Microsoft introduced a new file format called the Microsoft Open Office XML Format (.docx). This format is not compatible with older versions of Microsoft Word. Since this is XML you can create or read with out having a Word installed.
Here is the component that generates document based on the custom template. The documents are generated from the sharepoint list ... so the data is pulled from the list item into the document on the fly:
http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list
Hope that helps,
Yaroslav Pentsarskyy
Blog: www.sharemuch.com
See title...
No.
You can use WordML (Word XML)
Word 2007 version
You can create Word 2007 documents using its XML format without the need of installing Word in your server.
This can be a starting point.
I've already +1'd Mitch's reply, but as an aside: Word isn't even supported for use in service applications; it is designed to be user-interactive. So installing Word, even if it worked, wouldn't leave you in a great place.
If you're just generating the documents from scratch the solutions so far proposed work well. My situation was that I had an existing template that I needed to use and substitute in my own text in a few places (mail merge, if you will). This was several years ago - prior to Office 2007 - but we ended up going with the Aspose library of components for this. I've used the Words and Cells (Excel) components to generate documents from templates and spreadsheets on the fly to download from web sites. The interfaces are a little clunky and can be inconsistent between the various products. The installer, frankly, is awful, but the products work pretty well and made it much easier to do what needed to be done.
Word recognizes rtf as intrinsic, and if your intended document can be constructed as whatever.rtf - which for all of its fancy formatting is plain ASCII markup - then you shd be able to write the document without Word installed.
To get the picture, create an example document and save it as an rtf file. Then view that file with an ascii text editor (like Notepad). You'll have to learn rtf syntax, but there's at least one handbook around on that.
AS
Just to add another potential solution for you, OfficeWriter is a Word/Excel API that lets you create documents and spreadsheets in ASP.NET without using Office:
http://www.officewriter.com