Search for keywords in Word documents and index them - c#

I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.
On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.
So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.
Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.
Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?
PS. They are .doc and .docx files.

Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.

There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.
You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.

Indexing is available within Windows and can deal with Word documents :
http://windows.microsoft.com/en-US/windows7/Improve-Windows-searches-using-the-index-frequently-asked-questions
make a windows highlight search in c#
If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#

You could try SmartFinder APP available on Microsoft Store.
It's developed with Java and Apache Lucene library.
You can search the text and immediately have an extract of the document with the searched words highlighted in the results.
You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).
This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1

Related

Architecture to Save word documents with search options

We are building an internal application where users have the option to save word documents in the system,But the issue is the users should have the ability to search for these documents by keywords.
We use asp.net,c# and Sqlserver 2008.I was wondering to save these documents in a Varchar field and then searching these fields for keywords or do i need to use full text search using Solr/Lucene.
I would like to know if this is the efficient design for this purpose.
Thanks in advance !
If you have to store word documents in the database and you wish to be able to search them by some classic keywords then use a Virtual Path Provider, each time the document is saved put some keywords in a dB field and search using those keywords. This method will get around the DB Copy that John3136 mentioned.
If you need to be able to search on the content of the documents you wont be able to do that if the files are saved as blobs, so for this purpose it may make more sense to save the documents as XML Word 2003 and configure a Full Text search to ignore angle brackets, eg:
Regex.Replace(dBFieldOfWordXMLData, #"<[^>]*>", string.Empty);
I think the most efficient way is to use a Virtual Path Provider, MSDN Articles and Sharepoint documents use Virtual Path Provider and they are searchable. I've done some research on what the most efficient solution would be came across EpiServer CMS on Azure: http://episerverazurevpp.codeplex.com/
Without more details this is impossible to answer sensibly. A few things to consider:
Are you saying save the whole doc into a varchar field in a DB? That doesn't really sound smart - you have the whole problem of keeping the DB copy in sync with the disc copy (not to mention the whole idea of a DB copy in the first place...)
You mention keywords: If there are a limited number of keywords then it is fairly easy to write an office interop app that searches a word doc for keywords. You could either do this on save and keep a DB of which docs contain which words, or you could do it "on the fly" (i.e. an app that searches a whole folder full of docs for the ones containing a particular word) - it all depends on how many docs you are likely to have, required performance etc.
Could you do something with the document properties (add your own custom property corresponding to a keyword) and search for files with that property?

Code to read Word docs

I need a script (or other code, C#, etc.) that will fetch every paragraph/sentence containing a particular word in a set of Word 2007 documents and move them to a new Word document, recording the filename of the original (source) document they were extracted from.
What about using a document indexer, such as dtSearch to index your documents (word, pdf, etc), and then tap into their API to do your unique searches that way. From what it sounds that might be the fastest way to accomplish this. Granted indexers like dtSearch cost money (not a whole lot), but sometimes it may be worth the cost compared to the hours you will spend trying to write your own code to do the same thing.
Some articles that I have found that might lead you in the right direction if you don't want to use an indexer are:
http://omegacoder.com/?p=555
and
http://weblogs.asp.net/guystarbuck/archive/2008/05/13/automated-search-and-replace-in-multiple-word-2007-documents-with-c.aspx
Edit
To find a sentence that contains a specific word, you can try this link http://msdn.microsoft.com/en-us/library/bb546163.aspx
This might give you a start: http://msdn.microsoft.com/en-us/library/ff834910.aspx
Office Interop is an option but beware: it is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !
You will need to use some library to achieve what you want:
MS provides the OpenXML SDK V 2.0 (free)
Aspose.Words (commercial)

File system searching (C#) against specified file list

My client wants to add a file system searching feature in a B/S application based on C#. It is a little special that the search shall be in a scope of specified file list but not a whole directory with just certain file extension.
I did some research on Microsoft Office Sharepoint Server Search Service, but couldn't get a clue whether it supports searching against specific files. I'm now using it to search PDF files, but not the same case of what I'm asking for.
Can anyone give me some suggestions what 3rd party search service/engine I should take for the requirement?
Thanks.
Elaine
I assume you are wanting full text indexing of a certain set of files?
Java has the best selection of libraries for this, but there are C# ports as well.
I highly recommend Lucene for indexing and retrieval.:
http://incubator.apache.org/lucene.net/
If this is on a server, it might be easiest to run a Solr instance and use C# as the client:
http://crazorsharp.blogspot.com/2010/01/full-text-search-using-solr-lucene-and.html
Lucene has many examples on indexing different document types, but if you use Solr, it will handle that for you.

how to index a folder using lucene.net

I am trying to develop a search engine in asp.net using lucene.net. I go through many tutorials and pages to get the appropriate results but i couldn't.
Actually I have a folder with some files(doc,ppt,pdf,excel etc..) and i want to search within that folder only for contents and if the results are not found within that folder then ask user to search on web.
for example i have a folder with thousands of files # C:\test
and if user searched for "miller" then it should search into every document. if results are found then it should display results like that
Searched text file no of occurences
miller C:\test\1\file.doc 5
miller C:\test\1\11\new.doc 2
please help me i am not getting appropriate results .
Lucene / Lucene.NET is just an indexing engine, you still have to extract the text from the file types that you want to support yourself -on Windows you can use the IFilter interface for many file types, if you have Acrobat Reader 7+ installed there should be built in support for IFilter for PDF files. As for the indexing part itself there are many, many samples out there.
Also see this thread What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Is there a way to generate word documents dynamically without having word on the machine

I am planning on generating a Word document on the webserver dynamically. Is there good way of doing this in c#? I know I could script Word to do this but I would prefer another option.
I've worked at a company in the past that really wanted generated word documents, in the end they were perfectly satisfied with RTF docs that had a ".doc" extension. Word has no problem recognizing and opening them.
The RTF docs were generated with iText.net (free .net library), the API is pretty easy to use, performs extremely well, you don't need word on the machine, also, you could extend to generating PDF, HTML, and Text docs in the future with very little effort. After four years the solution I created is still in place, so that's a little testimony in iText.net's favor.
It looks like the official iText page suggests that iText Sharp is the best .Net choice right now, so that's another option
You'd be better off generating an rtf file, which word will know how to open.
If want to generate Office 2007 documents check the Open XML File Formats, they're simple zipped XML files, check this links:
Open XML File Formats: What is it, and how can I get started?
Introducing the Office (2007) Open XML File Formats
Edit: Check this project, can serve you as a good starting point:
DocumentMaker
Seems very simple and customizable, look this code snippet:
Paragraph p = new Paragraph();
p.Runs.Add(new Run("Text can have multiple format styles, they can be "));
p.Runs.Add(new Run("bold and italic",
TextFormats.Format.Bold | TextFormats.Format.Italic));
doc.Paragraphs.Add(p);
Word will quite happily open a HTML with a .doc extension. If you include an internal style sheet, you can have it fully formatted. There was previous post on this subject:
Export to Word Document in C#
Creating the old .DOC files (pre-Word 2007) is nigh-impossible without Word itself. The format is just too complex. Microsoft has released the format description, but it's enough to reduce a grown programmer to tears. There is a reason for that too (historical), but that doesn't make things better.
The new .DOCX would be easier, although quite a bit of hassle still. However depending on which Word versions you are targeting, there are some other options too.
For one, there is the classic .RTF. The format is pretty complex still, yet well documented and has strong support across many applications and platforms. And you might use some string-replacing into template files to make things easier (it's non-binary).
Then there are the "old" Word XML files. I think they worked starting with Word XP. Kinda the predecessors of .DOCX. I've used them, not bad. And the documentation is pretty OK.
Finally, the easy way that I would choose, is to make a simple HTML. Word can load HTML files just fine starting with version 2000. In the simplest way just change the extension of a HTML file to .DOC and you have it. You can also add a few word-specific tags and comments to make it look even better in Word. Use the Word's Save As...HTML option to see what they are.
There are third party libraries about that will do the job.
Doing a quick google came up with this one, for example.
I haven't tried any, so I can't give you specific advice, I'm afraid!
Let us know how you get on...
In Office 2007 Microsoft introduced a new file format called the Microsoft Open Office XML Format (.docx). This format is not compatible with older versions of Microsoft Word. Since this is XML you can create or read with out having a Word installed.
Here is the component that generates document based on the custom template. The documents are generated from the sharepoint list ... so the data is pulled from the list item into the document on the fly:
http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list
Hope that helps,
Yaroslav Pentsarskyy
Blog: www.sharemuch.com

Categories