how to index a folder using lucene.net - c#

I am trying to develop a search engine in asp.net using lucene.net. I go through many tutorials and pages to get the appropriate results but i couldn't.
Actually I have a folder with some files(doc,ppt,pdf,excel etc..) and i want to search within that folder only for contents and if the results are not found within that folder then ask user to search on web.
for example i have a folder with thousands of files # C:\test
and if user searched for "miller" then it should search into every document. if results are found then it should display results like that
Searched text file no of occurences
miller C:\test\1\file.doc 5
miller C:\test\1\11\new.doc 2
please help me i am not getting appropriate results .

Lucene / Lucene.NET is just an indexing engine, you still have to extract the text from the file types that you want to support yourself -on Windows you can use the IFilter interface for many file types, if you have Acrobat Reader 7+ installed there should be built in support for IFilter for PDF files. As for the indexing part itself there are many, many samples out there.
Also see this thread What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

Related

Search for keywords in Word documents and index them

I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.
On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.
So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.
Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.
Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?
PS. They are .doc and .docx files.
Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.
There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.
You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.
Indexing is available within Windows and can deal with Word documents :
http://windows.microsoft.com/en-US/windows7/Improve-Windows-searches-using-the-index-frequently-asked-questions
make a windows highlight search in c#
If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#
You could try SmartFinder APP available on Microsoft Store.
It's developed with Java and Apache Lucene library.
You can search the text and immediately have an extract of the document with the searched words highlighted in the results.
You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).
This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1

Architecture to Save word documents with search options

We are building an internal application where users have the option to save word documents in the system,But the issue is the users should have the ability to search for these documents by keywords.
We use asp.net,c# and Sqlserver 2008.I was wondering to save these documents in a Varchar field and then searching these fields for keywords or do i need to use full text search using Solr/Lucene.
I would like to know if this is the efficient design for this purpose.
Thanks in advance !
If you have to store word documents in the database and you wish to be able to search them by some classic keywords then use a Virtual Path Provider, each time the document is saved put some keywords in a dB field and search using those keywords. This method will get around the DB Copy that John3136 mentioned.
If you need to be able to search on the content of the documents you wont be able to do that if the files are saved as blobs, so for this purpose it may make more sense to save the documents as XML Word 2003 and configure a Full Text search to ignore angle brackets, eg:
Regex.Replace(dBFieldOfWordXMLData, #"<[^>]*>", string.Empty);
I think the most efficient way is to use a Virtual Path Provider, MSDN Articles and Sharepoint documents use Virtual Path Provider and they are searchable. I've done some research on what the most efficient solution would be came across EpiServer CMS on Azure: http://episerverazurevpp.codeplex.com/
Without more details this is impossible to answer sensibly. A few things to consider:
Are you saying save the whole doc into a varchar field in a DB? That doesn't really sound smart - you have the whole problem of keeping the DB copy in sync with the disc copy (not to mention the whole idea of a DB copy in the first place...)
You mention keywords: If there are a limited number of keywords then it is fairly easy to write an office interop app that searches a word doc for keywords. You could either do this on save and keep a DB of which docs contain which words, or you could do it "on the fly" (i.e. an app that searches a whole folder full of docs for the ones containing a particular word) - it all depends on how many docs you are likely to have, required performance etc.
Could you do something with the document properties (add your own custom property corresponding to a keyword) and search for files with that property?

File system searching (C#) against specified file list

My client wants to add a file system searching feature in a B/S application based on C#. It is a little special that the search shall be in a scope of specified file list but not a whole directory with just certain file extension.
I did some research on Microsoft Office Sharepoint Server Search Service, but couldn't get a clue whether it supports searching against specific files. I'm now using it to search PDF files, but not the same case of what I'm asking for.
Can anyone give me some suggestions what 3rd party search service/engine I should take for the requirement?
Thanks.
Elaine
I assume you are wanting full text indexing of a certain set of files?
Java has the best selection of libraries for this, but there are C# ports as well.
I highly recommend Lucene for indexing and retrieval.:
http://incubator.apache.org/lucene.net/
If this is on a server, it might be easiest to run a Solr instance and use C# as the client:
http://crazorsharp.blogspot.com/2010/01/full-text-search-using-solr-lucene-and.html
Lucene has many examples on indexing different document types, but if you use Solr, it will handle that for you.

concatenating word documents and converting them to pdf

what is the best possible way to merge multiple documents and convert them to pdf. also we need to insert blank pages for every odd pages.
A fully supported, server side automated version of this (mostly baked into the the MS camp though) involves using the OpenXMLSDK to do any field inserts, then using Sharepoint's Word Automation Services (SP 2010) to convert the documents to PDF, and then pick your favorite PDF toolkit (iTextSharp for me) for any post processing (merging documents, inserting blank pages, or images that must be positioned relative to specific pages).
The reason for doing the document merge in PDF rather than OpenXML is simplicity - you don't have to deal with merging styles, headers etc.
The reason for doing the blank pages and image insertion is that OpenXML has no idea how to render the content, and so it has no idea where page breaks would occur naturally (you can still insert breaks like you would in Word though).
If you are using C# and you are OK with a server based solution then have a look at this post. It uses a .net friendly web services interface.
There is an optional SharePoint version available as well, but as you did not include a SharePoint tag I assume that won't be of interest to you.
Full disclosure, I wrote that post.

Lucene.net folder search

I am newbie in Lucene.net. I want to search a content from the folder which may have all type of files (.txt, .xls, .pdf, .exe, .ppt, .doc,...).
Suppose if I search any content, I want to list the filepath & content matched (it should be highlighted) inside the file if any.
Any sample code would be appreciated.
Note : I am want to use this result in C# class library.
I haven't used it myself, but you should look into using SOLR. AFAIK you cannot host it on a .NET server, but you can connect to it from .NET using solrSHARP.

Categories