I'm using Lucene.Net to create a website to search books, articles, etc, stored as PDFs. I need to be able to filter my search results based on author name, for example. Can this be done with just Lucene? Or do I need a DB to store the filter fields for each document?
Also, what's the best way to index my documents? I'll have about 50 documents to start with and periodically I'll have to add a bunch of documents to the index--may be through a web form. Should I use a DB to store the document paths?
Thanks.
Here is a list of what you need to do IMO:
Extract raw text from PDF - please see this question which recommends iTextSharp for this purpose.
For each PDF document, create a Lucene.net document that has several fields: author, title, document text and whatever you want to search. It is recommended to also have a unique id field per document. I suggest you also store a field with the path to the original PDF document.
After indexing all the documents, you will have a Lucene index you can search by fields.
You can add new documents by repeating step 2. It is easier to do this offline - incremental updates are tough.
Lucene has a couple of different Analyzers that can scrub out the noise and do "stemming" which is helpful when you want to do fulltext searching, but you're still going to need to store the PDF itself somewhere. Lucene.Net is happy to build an index on the file system, and you could add a field to the Document it builds called something like "PATH" with the path to the document.
Related
I am trying to make a word document in an asp.Net MVC application using OpenXML template document .
The main challenges for me are
How can i create a word document as an OpenXML template? In my word
document i have some paragraphs of texts and in every paragraphs i
have to fill information from data base like in word file there are
instances of text like
etc and these should be filled with actual data. But
i dont know how can i convert a normal word document as a OpenXML
template file .
How can i fill the values with data from db ? If i have a model say WordModel in hand with filled values of properties FirstName TotalAmount AmountUnit TotalCopies etc then how can i fill the details to template and allow user to download the file ?
What you want to do is called Mail Merge and means populating word documents that act as templates with your application data. There are some existing (commercial) solutions out there, but if you want to do it yourself using Open XML SDK, you need to set up some kind of tagging mechanism that you will use to tag certain parts of document where you want the data/text from the database should be placed. For Word documents you have the following options: Bookmarks, Content Controls, Merge Fields, or special text (<% FirstName %>). I would personally go with Content Controls as they offer the best user experience and they are pretty easy to parse and replace. So the templates would be ordinary word documents containing these tags and then you could use Open XML SDK to parse these templates, search the tags in them and replace them with your application data according to the tag's name/code/title. This a very abstract, high level picture of a mail merge processor. Of course system like this is not easy to implement and also note that using Open XML requires some learning. There was a similar question answered, but you can probably find many more - just google.
I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.
On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.
So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.
Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.
Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?
PS. They are .doc and .docx files.
Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.
There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.
You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.
Indexing is available within Windows and can deal with Word documents :
http://windows.microsoft.com/en-US/windows7/Improve-Windows-searches-using-the-index-frequently-asked-questions
make a windows highlight search in c#
If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#
You could try SmartFinder APP available on Microsoft Store.
It's developed with Java and Apache Lucene library.
You can search the text and immediately have an extract of the document with the searched words highlighted in the results.
You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).
This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1
We are building an internal application where users have the option to save word documents in the system,But the issue is the users should have the ability to search for these documents by keywords.
We use asp.net,c# and Sqlserver 2008.I was wondering to save these documents in a Varchar field and then searching these fields for keywords or do i need to use full text search using Solr/Lucene.
I would like to know if this is the efficient design for this purpose.
Thanks in advance !
If you have to store word documents in the database and you wish to be able to search them by some classic keywords then use a Virtual Path Provider, each time the document is saved put some keywords in a dB field and search using those keywords. This method will get around the DB Copy that John3136 mentioned.
If you need to be able to search on the content of the documents you wont be able to do that if the files are saved as blobs, so for this purpose it may make more sense to save the documents as XML Word 2003 and configure a Full Text search to ignore angle brackets, eg:
Regex.Replace(dBFieldOfWordXMLData, #"<[^>]*>", string.Empty);
I think the most efficient way is to use a Virtual Path Provider, MSDN Articles and Sharepoint documents use Virtual Path Provider and they are searchable. I've done some research on what the most efficient solution would be came across EpiServer CMS on Azure: http://episerverazurevpp.codeplex.com/
Without more details this is impossible to answer sensibly. A few things to consider:
Are you saying save the whole doc into a varchar field in a DB? That doesn't really sound smart - you have the whole problem of keeping the DB copy in sync with the disc copy (not to mention the whole idea of a DB copy in the first place...)
You mention keywords: If there are a limited number of keywords then it is fairly easy to write an office interop app that searches a word doc for keywords. You could either do this on save and keep a DB of which docs contain which words, or you could do it "on the fly" (i.e. an app that searches a whole folder full of docs for the ones containing a particular word) - it all depends on how many docs you are likely to have, required performance etc.
Could you do something with the document properties (add your own custom property corresponding to a keyword) and search for files with that property?
I've created a word document using openXML SDK in C++\CLI in which I've entered Bookmarks,
I need to open that word document and search for the bookmarks present in it and replace it with some text value.
Please suggest the above with sample code or any links which I can refer.
Thanks in advance
I suggest a lot more specific. A bookmark can have paragraphs, images, tables, textboxes, etc. all in it. It can also start in the middle of a table and end outside the table. So replacing what's inside it can be very problematic.
So I'm going to take a guess as to what you want and from that might have an answer for you. I am guessing you want something where you place tags in the document and then your program can replace those tags with data. Instead of bookmarks use fields. There are a number of mailmerge fields that work great for this.
If this will work for you, then for the actual code, Descendents is the main thing you need.
I have several Word templates and I wish to use these to dynamically create Word documents in my app. I wish to avoid using automation at all costs as this is no good. I know that I can use both HTML and XML to create word documents but I just don't know where to start with regards to using a template that may well have images in the footer or the header of a document.
I use the OpenXML SDK with Word 2007. After you get the hang of it, it's not so bad. I have several template docx files that I scan through to search and replace for placeholder strings with what I want, and then can stitch together multiple templates into one document if I want to. It's nice because I can start with docx files as the template and modify them while the whole time staying within the realm of the docx format. If an image is in the docx when you start modifying it, it'll be there after you re-save it after modification (provided you didn't programmatically remove it of course).
If you have more details with what you'll be doing, let us know.
You could use DocX. It's free, very easy to use, with nice tutorials and is feature reach. It works with only DOCX documents thou. Also development is currently on hold until the author will finish his semester. Here's detailed blog about it.
It has good example of using template in his Invoice Example.
MigraDoc http://www.pdfsharp.net/MigraDocOverview.ashx is a free utility for exporting PDF/Word/HTML files. I've not worked with it using templates as yet however, you could use the DDL files to persists a layout for your files to be re-used.