Without getting into a large debate as to the merits of doing so, can some one provide direction as to using a VSTO Application Level AddIn (Word 2007) to oepn a MS Word Document from either a database or a web service?
Thank you
Jacob,
Are you suggesting
PC/AddIn Queries Server for a Document
Server returns document to PC/AddIn
PC/AddIn saves document locally (as temp file)
PC/AddIn uses word Open document functionality to open the file locally
Then
PC/AddIn Save these file locally
PC/AddIn Uploads the file back to the server
That doesn't sound quite so hard... In fact it is the type of solution that has a level of simplicity that makes writing / debugging easy.
What advantage does one have using the above methodology as oppoased to WebDAV? Apparently webDav is what alfresco uses...
Another question though, Does word not have the functionality to open documents from a stream in its API?
T
As Jacob noted, you could save the blob as a tmp file, and then open it in the normal way. This is the easiest, though if you need to write the edits back, you'll also need to think about locking.
If you need to worry about those things, WebDAV starts to look more interesting. You could open via WebDAV if you can make your server-side support this, and let Word do the rest (although the document may be read only, depending on client config and server).
Finally, if it is a docx, you could avoid the tmp file, by inserting a Flat OPC version into a new Word using InsertXML. This is a bit more complicated (since you have to make the Flat OPC XML, though there is code for this in an MSDN blog post somewhere), but if you find yourself using InsertXML for other reasons, this might be attractive.
Related
I'm comfortable generating Word documents using Aspose.Word (which can also save as a PDF) but I've recently been asked to do the same thing using a PDF as the starter template. We recently bought Aspose.Total and whilst Aspose.Pdf looks like it can do some manipulations it doesn't look to be all that flexible/easy (like adding a big line of text and getting it to wrap, and shifting other content down the page if it takes up more space).
What would be the best way of using a PDF as a template for what is basically a bit like a mail merge from a database? Should I turn it into a PDF form and merge it from an XML data source? Is this even viable or would such a form still have a limitation on spacing (so that longer lines/paragraphs of text won't reflow the document where necessary)?
From what I can tell it doesn't look like InDesign can be manipulated in c# even via a COM object (which would be nasty on a web server anyway).
If I recreated the InDesign/PDF as a Word document I'm sure I could work wonders, but you know what these publishing types are like, who think Word documents are the tool of the devil. These PDFs are never going to a professional printer anyway; they're just brochures for a client to download from a web page (based on information in a database) for printing/use at home.
You have indeed many solutions for such a web to print project. Choosing one is a matter of budget, requirements and users count. Placing dynamic contents can be done at the simpliest with PDF forms fillable with xml data.
On the other hand you can work with InDesign Server and output PDF based on InDesign templates. That's generally a good choice when a large amount of users needs to get rich pdf files in parallel. But the costs are heavy.
You can also envision A pitstop server or Callas PDFToolBox Server to place dynamic texts based on variables as supplied by you. The good point here is that you don't need much coding here. Those apps are ready to use.
You can at last consider command line tools. A few of them may have some useful commands such as pdfTk or cPdf to merge texts.
I know this is a little subjective, but I'm looking into the following situation:
I need to produce a number of documents automatically from data in a SQL Server database. There will be an MVC3 app sat on the database to allow data entry etc. and (probably) a "Go" button to produce the documents.
There needs to be some business logic about how these documents are created, named and stored (e.g. "Parent" documents get one name and go in one folder, "Child" documents get a computed name and go in a sub-folder.
The documents can either be PDF or Doc(x) (or even both), as long as the output can be in EN-US and AR-QA (RTL text)
I know there are a number of options from SSRS, Crystal Reports, VSTO, "manual" PDF in code, word mail merge, etc... and we already have an HTML to PDF tool if thats any use?
Does anyone have any real world advice on how to go about this and what the "best" (most pragmatic) approach would be? The less "extras" I need to install and configure on a server the better - the faster the development the better (as always!!)
Findings so far:
Word Mail Merge (or VSTO)
Simply doesn't offer the simplicity, control and flexibility I require - shame really. Would be nice to define a dotx and be able to pass in the data to it on an individual basis to generate the docx. Only way I could acheive this (and I may be wrong here) was to loop through controls/bookmarks by name and replace the values...messy.
OpenXML
Creating documents based on dotx templates, even using OpenXML is not as simple as (IMHO) it should be. You have to replace each Content control by name, so maintenance isn't the simplest task.
SSRS
On the face of it this is a good solution (although it needs SQL Enterprise), however it gets more complicated if you want to dynamically produce the folders and documents. Data driven subscription gets very close to what I want though.
Winnovative HTML to PDF Convertor*
This is the tool we already have (albeit a .Net 2.0 version). This allows me to generate the HTML pages and convert those to PDF. A good option for me since I can run this on an MVC3 website adn pass the parameters into the controllers to generate the PDF's. This gives me much finer-grained control over the folder and naming structures - the issue with this method is simply generating the pages in the correct way. A bonus is that it automatically gives me a "preview"...basiclly just the HTML page!
Office OpenXML is a nice and simple way of generating office files. XSLT's can be strong tool to format your content. This technology will not let you create pdf's.
Fast development without using any third party components will be difficult. But if you do consider using a report server, make sure to check out BIRT or Jasper.
To generate pdf's I have been using the deprecated Report.net. It has many ports to different languages and is still sufficient to make simple pdf's. Report.net on sourceforge
I dont think SQL Server itself can produce pdf files. What you can do is, as you mentioned, install an instance of SSRS and create a report that produces the information you need. Then you can create a subscription to deliver your report to where you want, when you want.
Here is an example of a simple subscription:
Go for SSRS only if you are OK with setting it up on a server and there is a definite need for schedule reporting and complex reports.
If you have code for manual PDF/docx generation, I would suggest to go ahead with it. Hopefully the complexity of its code is not a matter to you.
I have used both in separate scenarios. We used excel classes and objects from .NET for a minimal reporting from a web application.
But went for an elaborate reporting scheme for a system which required 1000s of reports to be generated in a scheduled manner and delivered to selected set of people.
Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.
Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command line tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?
-or-
Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?
Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene
If it isn't absolutely necessary to stream directly from SQL Server into your app, the hard part is parsing the PDF or DOC file formats.
The iTextSharp library will give you access to the innards of a PDF file:
http://itextsharp.sourceforge.net/
Here's a commercial product that claims to parse Word docs:
Aspose.Words
Edited to add:
I think you're also asking if there are ways to make SQL Server Full-text Indexing do the work for you by adding IFilters. This sounds like a good idea. I haven't done this myself, but MS has apparently supported a Word filter for a long time, and now Adobe has released a (free) PDF filter. There's a lot of information here:
Filter Central
10 Ways to Optimize SQL Server Full-text Indexing
SQL Server Full Text Search: Language Features - a little out of date but easy to understand.
SQL Server Full-Text Search feature uses IFilters for extracting plain text from PDF or Office file formats. You can install IFilters on your server or if your code is running on the same machine as SQL Server you're already have it.
Here is an article which shows how to use IFilters from .NET: http://www.codeproject.com/KB/cs/IFilter.aspx
You could from your C# application open the .doc file and save it as text and put both the text and .doc document into the database.
If you are using SQL 2008, then you could consider using the new FILESTREAM feature.
Your data is stored in a varbinary(max) column, but you can also access the raw data via a regular Win32 handle.
Here's some sample code showing how to get the handle.
I had this same issue... I solved it by adding the following to my application:
EPocalipse.IFilter.dll (for everything -but- Office 2007
documents, due to 64x Windows issues)
OpenXML SDK 2.0 (for Office 2007 Documents)
I use these to grab the plain text and then store it in the database alongside the binary data. Keep in mind that I am certainly not an expert, so there may be a better way to do this, but this works for everything but "Quick Save" pre-2007 Word Documents, which apparently aren't read by iFilters. I just have my users resave the document if that error occurs, and everything works fine.
Let me know if you'd like some sample code... I would post it right now, but it's a bit long.
I currently have a website (ASP.NET 3.5, IIS 7.0) that allows users to upload Excel files for processing.
Should I be concerned with viruses and malicious code being executed when the document is opened?
We are currently using the .NET Office.Interop assemblies to fetch the information from the document. The information isn't exactly tabular and requires a little bit of interrogation to get it into the required format.
Once the document has been uploaded it will be stored in the database, only when the document is inspected is it written to disk.
Are there any recommendations that would provide a secure implementation?
Using the xlsx (Open XML) file format will be safer than using xls or xlsm since xlsx workbooks cannot contain macros.
You might consider using a pure .NET component which does not use COM Interop or any native calls and does not require FullTrust. SpreadsheetGear for .NET is an example of such a component.
Disclaimer: I own SpreadsheetGear LLC
Sanitizing is the only way to be sure. Since it's not simply form input, you want to take extra precautions. The simplest method I can imagine is to nuke any binary-indicators, like control-characters.
As far as best practices, you can't really tell your users "Please don't hack me", so you have to have a certain level of trust (or give up on Excel files)... I would say if the first pass picks up any binary flags, incinerate it and throw a fairly obtuse error like "error in file format", etc.
But of course, your users will murder you they ever get that error for a good file.
I am just playing around and trying to make a very simple CMS. Right now I what I do right now is I use "FtpWebRequest" to get the file that they want to change around and stick it into a jquery plugin call html area(rich html editor).
Now I am wondering how could I allow them to add images that are not already hosted? Ie not on imageshack or something.
I am guessing I need to somehow upload the file and then store it somewhere but not sure how to do all that.
Thanks
A common approach for CMS systems that need to work in low-trust environments (like shared hosting) is to use the FileUpload control, and save the uploaded file as a binary (BLOB) in a database. This avoids dealing with the headache of disk access rights on the web server.
If you're using SQL Server, here's a great article on the database side of things (storing images as BLOBs).
The .NET side of things is pretty straightforward. The FileUpload.PostedFile property has all the information about the uploaded file, including a byte stream of its data.