Architecture to Save word documents with search options

Architecture to Save word documents with search options - c#

We are building an internal application where users have the option to save word documents in the system,But the issue is the users should have the ability to search for these documents by keywords.
We use asp.net,c# and Sqlserver 2008.I was wondering to save these documents in a Varchar field and then searching these fields for keywords or do i need to use full text search using Solr/Lucene.
I would like to know if this is the efficient design for this purpose.
Thanks in advance !

If you have to store word documents in the database and you wish to be able to search them by some classic keywords then use a Virtual Path Provider, each time the document is saved put some keywords in a dB field and search using those keywords. This method will get around the DB Copy that John3136 mentioned.
If you need to be able to search on the content of the documents you wont be able to do that if the files are saved as blobs, so for this purpose it may make more sense to save the documents as XML Word 2003 and configure a Full Text search to ignore angle brackets, eg:
Regex.Replace(dBFieldOfWordXMLData, #"<[^>]*>", string.Empty);
I think the most efficient way is to use a Virtual Path Provider, MSDN Articles and Sharepoint documents use Virtual Path Provider and they are searchable. I've done some research on what the most efficient solution would be came across EpiServer CMS on Azure: http://episerverazurevpp.codeplex.com/

Without more details this is impossible to answer sensibly. A few things to consider:
Are you saying save the whole doc into a varchar field in a DB? That doesn't really sound smart - you have the whole problem of keeping the DB copy in sync with the disc copy (not to mention the whole idea of a DB copy in the first place...)
You mention keywords: If there are a limited number of keywords then it is fairly easy to write an office interop app that searches a word doc for keywords. You could either do this on save and keep a DB of which docs contain which words, or you could do it "on the fly" (i.e. an app that searches a whole folder full of docs for the ones containing a particular word) - it all depends on how many docs you are likely to have, required performance etc.
Could you do something with the document properties (add your own custom property corresponding to a keyword) and search for files with that property?

Related

Search for keywords in Word documents and index them

I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.
On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.
So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.
Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.
Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?
PS. They are .doc and .docx files.

Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.

There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.
You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.

Indexing is available within Windows and can deal with Word documents :
http://windows.microsoft.com/en-US/windows7/Improve-Windows-searches-using-the-index-frequently-asked-questions
make a windows highlight search in c#
If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#

You could try SmartFinder APP available on Microsoft Store.
It's developed with Java and Apache Lucene library.
You can search the text and immediately have an extract of the document with the searched words highlighted in the results.
You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).
This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1

How to produce documents (docx or pdf) from SQL Server?

I know this is a little subjective, but I'm looking into the following situation:
I need to produce a number of documents automatically from data in a SQL Server database. There will be an MVC3 app sat on the database to allow data entry etc. and (probably) a "Go" button to produce the documents.
There needs to be some business logic about how these documents are created, named and stored (e.g. "Parent" documents get one name and go in one folder, "Child" documents get a computed name and go in a sub-folder.
The documents can either be PDF or Doc(x) (or even both), as long as the output can be in EN-US and AR-QA (RTL text)
I know there are a number of options from SSRS, Crystal Reports, VSTO, "manual" PDF in code, word mail merge, etc... and we already have an HTML to PDF tool if thats any use?
Does anyone have any real world advice on how to go about this and what the "best" (most pragmatic) approach would be? The less "extras" I need to install and configure on a server the better - the faster the development the better (as always!!)
Findings so far:
Word Mail Merge (or VSTO)
Simply doesn't offer the simplicity, control and flexibility I require - shame really. Would be nice to define a dotx and be able to pass in the data to it on an individual basis to generate the docx. Only way I could acheive this (and I may be wrong here) was to loop through controls/bookmarks by name and replace the values...messy.
OpenXML
Creating documents based on dotx templates, even using OpenXML is not as simple as (IMHO) it should be. You have to replace each Content control by name, so maintenance isn't the simplest task.
SSRS
On the face of it this is a good solution (although it needs SQL Enterprise), however it gets more complicated if you want to dynamically produce the folders and documents. Data driven subscription gets very close to what I want though.
Winnovative HTML to PDF Convertor*
This is the tool we already have (albeit a .Net 2.0 version). This allows me to generate the HTML pages and convert those to PDF. A good option for me since I can run this on an MVC3 website adn pass the parameters into the controllers to generate the PDF's. This gives me much finer-grained control over the folder and naming structures - the issue with this method is simply generating the pages in the correct way. A bonus is that it automatically gives me a "preview"...basiclly just the HTML page!

Office OpenXML is a nice and simple way of generating office files. XSLT's can be strong tool to format your content. This technology will not let you create pdf's.
Fast development without using any third party components will be difficult. But if you do consider using a report server, make sure to check out BIRT or Jasper.
To generate pdf's I have been using the deprecated Report.net. It has many ports to different languages and is still sufficient to make simple pdf's. Report.net on sourceforge

I dont think SQL Server itself can produce pdf files. What you can do is, as you mentioned, install an instance of SSRS and create a report that produces the information you need. Then you can create a subscription to deliver your report to where you want, when you want.
Here is an example of a simple subscription:

Go for SSRS only if you are OK with setting it up on a server and there is a definite need for schedule reporting and complex reports.
If you have code for manual PDF/docx generation, I would suggest to go ahead with it. Hopefully the complexity of its code is not a matter to you.
I have used both in separate scenarios. We used excel classes and objects from .NET for a minimal reporting from a web application.
But went for an elaborate reporting scheme for a system which required 1000s of reports to be generated in a scheduled manner and delivered to selected set of people.

how to index a folder using lucene.net

I am trying to develop a search engine in asp.net using lucene.net. I go through many tutorials and pages to get the appropriate results but i couldn't.
Actually I have a folder with some files(doc,ppt,pdf,excel etc..) and i want to search within that folder only for contents and if the results are not found within that folder then ask user to search on web.
for example i have a folder with thousands of files # C:\test
and if user searched for "miller" then it should search into every document. if results are found then it should display results like that
Searched text file no of occurences
miller C:\test\1\file.doc 5
miller C:\test\1\11\new.doc 2
please help me i am not getting appropriate results .

Lucene / Lucene.NET is just an indexing engine, you still have to extract the text from the file types that you want to support yourself -on Windows you can use the IFilter interface for many file types, if you have Acrobat Reader 7+ installed there should be built in support for IFilter for PDF files. As for the indexing part itself there are many, many samples out there.
Also see this thread What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Search filters with Lucene.NET

I'm using Lucene.Net to create a website to search books, articles, etc, stored as PDFs. I need to be able to filter my search results based on author name, for example. Can this be done with just Lucene? Or do I need a DB to store the filter fields for each document?
Also, what's the best way to index my documents? I'll have about 50 documents to start with and periodically I'll have to add a bunch of documents to the index--may be through a web form. Should I use a DB to store the document paths?
Thanks.

Here is a list of what you need to do IMO:
Extract raw text from PDF - please see this question which recommends iTextSharp for this purpose.
For each PDF document, create a Lucene.net document that has several fields: author, title, document text and whatever you want to search. It is recommended to also have a unique id field per document. I suggest you also store a field with the path to the original PDF document.
After indexing all the documents, you will have a Lucene index you can search by fields.
You can add new documents by repeating step 2. It is easier to do this offline - incremental updates are tough.

Lucene has a couple of different Analyzers that can scrub out the noise and do "stemming" which is helpful when you want to do fulltext searching, but you're still going to need to store the PDF itself somewhere. Lucene.Net is happy to build an index on the file system, and you could add a field to the Document it builds called something like "PATH" with the path to the document.

C# - Templated Printing from Object(s)

I'm in need of a solution to print or export (pdf/doc) from C#. I want to be able to design a template with place holders, bind an object (or xml) to this template, and get out a finished document.
I'm not really sure if this is a reporting solution or not.
I also don't want to have to roll my own printing / graphics code -- I'd like all display concerns handled in a template.
I initially think of this as something Crystal Reports can do (although I've never used CR), but I'm not sure if I'm abusing the system here -- I'm not really interested in binding ADO.NET datasets at the moment (screw datasets). Can Crystal deal with binding to objects?
Does SSRS or WPF play in this field too?

A subset of WPF-P is XPS which can be used to present your objects via databinding.
One of the best choices if you are already using WPF.
Google Keywords: XPS, FixedDocument, FlowDocument, WPF Printing

Might read through this thread:
http://groups.google.com/group/nhusers/browse_thread/thread/e2c2b8f834ae7ea8
Seems a lot of people like iTextSharp
http://itextsharp.sourceforge.net/

For Word docs, look into Word's Mail Merge feature and Word automation. I did this recently in a form letter printing project. Basically what I did was create a Word template file (file extension .dot) and in this template file I defined MergeFields in a standard form letter. My application queries a database for the records it needs to print and then for each record it returns it matches fields in the database with these merge fields and sends the result (the merged doc) to the printer.
It's working really well and if I had a link that gave a definitive explanation, I'd provide it (check back here, I'll see if I can't find the most useful ones). Hopefully I've provided enough keywords to let you find your own resources. I can go into more detail if you need.
I've never had to export PDF files but for a project I'm working on now I'll have to. For a free solution my research has lead to iTextSharp (like Will Shaver points out) but I've only done the initial investigations and I have found a few pay solutions I might end up resorting to.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.