I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed.
I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing.
Is my approach correct? Are there better ways to do this? Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. I'd say about 100-500 books and 1000-5000 songs.
Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc.
It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content.
Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. Our company has done this before, and it's actually not as big of a headache as it sounds.
NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document.
Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually.
Yes, there is a better approach. Try Solr and in particular check out facets. It will save you a lot of trouble.
If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. You can use this with Express versions, too. You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter).
You could try using MS Search Server Express Edition, one of the major benefits is that it is free.
http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none
Related
I am making a search engine in VisualStudio 2013. I am using Lucene.Net and I am wondering if it is possible to search in multiple tables within the database i have. I know how to search in multiple fields within a table, but I need to be able to search in multiple tables as well.
Is this possible in any way?
Yes, it is possible. The implementation will likely be unique to your needs so I cannot really help give you code to get started. Lucene uses the concept of documents, the structure of which is completely up to you. The more information you choose to store in those documents, the slower your search and indexing operations will be.
What you want to do is figure out what information users need to be able to search by, and what information you need to fetch relevant database information based upon those indexed fields. For example, you might index the title of a document or some/all of its body. If you query against those fields, then you will want to other information, likely table key values, that will allow you to fetch information relevant to that search. For example, you could store info to allow the fetching of related articles, or comments on the document, etc.
I hope that clears up how Lucene can be used, unfortunately the implementation details for your application are likely too specific to give more detailed answers.
I'm developing a ASP.Net MVC3 app which will have few hundred videos. I want to create a search system based on tags and other parameters like the user type that uploaded the video, the date of the video, video category, etc..
I have been looking around and Lucene.NET seems really good tool for full text search, but I don't know if it's the best solution for my project... I have read the tutorials and they recommend to keep the search index to a minimum but also that you should NOT hit your database for retrieving extra data that is not stored in the search index...
How this can be possible?
Lets put an example: I have a video row (as a concept, this is really held in different SQL tables) which has columns for the video id, the video name, the video file name, the full path, user id, user type, tags, creation date, video category, video subcategory, video location, etc... If I want to create a lucene search index I think I will have to put all the information in there so that later on I can query on every parameter, right?
This seems to me a duplicate of the SQL Database but with the overload of adding, editing and removing documents from lucene search index. Is this the standard scenario when using lucene? All the examples I have seen with lucene are based on a post id, post title and post body..
What do you think? Can you give me some light?
Yes, if you want to query multiple fields (including things like tags) from within lucene, you'll need to make that data available to lucene. It might sound like this is duplication, but it is not redundant duplication - it is restructuring the data into a very different layout - indexed for search.
It should work fine; it is pretty much how search works here on stackoverflow (which is using lucene.net to perform the search).
It should be noted, however, that a few hundred is not a large sample: frankly you could do that any way you like, and it'll take about the same amount of time. Writing a complex SQL query should work, as should full-text-search in the database (that is how stackoverflow's search used to work), as should filtering objects in-memory (at the few-hundred level, you could trivially just cache all the data excluding video frames in memory).
I have an application where I need to search in various text-based fields. The application is developed using NHibernate as an ORM.
I would like to implement Porter Stemming in searches, in order to be able to return relevant results even when the keyword matches a similar word, for example the description of a product contains memories while the search keyword is memory.
Can anyone suggest the best practices for such types of searches? The first idea that comes to mind is to store two version of the same field in database, for example:
Description
Description_Search
The Description column would be the text as entered by the website administrator, and is the text visible on the frontend.
The Description_Search would include the same text, but passed through a Porter-Stemming algorithm. Search queries would then be based on the Description_Search field, rather than Description.
Does this make sense? Is it a waste of space having to store two version of almost the same text?
Also, would Lucene.Net help in such a case? I am also looking into integrating Lucene.Net for full-text based searches but haven't yet looked into it in detail.
Thanks in advance!
There's no need to use two fields for this, one would be enough. A field has two "values", one stored that can be retrieved using Document.Get(...), and one indexed that's used for searching. It's not technically required to store the values either, a common solution is to store a id that's used to lookup the original content in a database. This would also allow you to lookup more information, like author information and document location.
Lucene.Net would help in this case, but it requires you to write the infrastructure yourself. You would need to take care of configuring analyzers (usually nothing to configure), and index your content. As mention in a comment, you could go for SQL Server's Full Text Search functionality, but that itself has some limitations (which may not affect you).
One big problem I've had using SQL Server's FTS, but works in Lucene.Net (this isn't really fair since you can do almost anything in Lucene.Net since you write the code that does it) is accent sensitivity. I've been unable to configure it using Swedish language rules, where åäö should be treated as real characters. Enabling accent sensitivity would do this, but it would also mean that diacritics is parsed as real characters, which means that ñ differs from n. (Imagine searching for a "jalapeno" and get no matches for "jalapeño"). Disabling accent sensitivity basically removes all diacritics, turning åäö into aao, and words turn out completely different.
Writing things in Lucene.Net (compared to SQL Server FTS) allows you to provide result highlighting (present which phrases in a document that matches the query), search for similar documents, spell-checking, custom result boosting, facets, and other things that would enhance your users' search experience.
I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike
I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".
It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx
This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.
Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.
You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.
Is there a good tag search system that i can use in my C# .NET prototype that will also run on ASP.NET ?
When you say "tag search system" I am going to assume that you mean the ability in a social network to allow your users to tag content thereby bubbling up the things that are most popular in your site by way of a tag cloud. Also allowing your users to navigate by way of tagged content, etc. ??
I like to create a SystemObjects table which holds the various tables in my system that might have tags applied to it (or comments, or ratings, etc.) thereby allowing me to have a generic tagging system that can span my entire database. Then I would also have a SystemObjectTags table that would have a reference to the SystemObjectID (telling me which table has the record that I am interested in) as well as the SystemObjectRecordID (which tells me which specific record I am interested in). From there I have all of my detail data with regards to the tags and what was tagged. I then like to keep a running list of the tag specific information in a Tags table which keeps the unique tag (string value of a tag) the TagID (which the SystemObjectTags table references), the count of that tags usage across the system (a summed value of all uses), etc. If you have a high traffic site this data should be kept in your cache so that you don't hit the data too frequently.
With this subsystem in place you can then move to the search capabilities. You have all the data that you need with these three tables to easily be able to perform filtering, searching, etc. However, you might find that there is so much data in here and that the tables are so generic that your searches are not as fast as a more optimized table structure might be. For this reason I suggest that you use a Lucene.NET index to hold all of your searchable data. Lucene.NET provides a very fast read time and provides far more flexibility in search algorithms than SQL Servers freetext stuff does.
This would then allow you to provide filtering of your content by tags, searching for content by tag, tag counts, etc. Lucene.net is a big scary topic though! Be prepared to do some reading to get your past the basics.
An option we are using is to put our "tags" in the Meta Keywords on the page, and then we use Bing for our search.
http://msdn.microsoft.com/en-us/library/dd251056.aspx
Our architect said it best. "Let the search engines do what they do best. Search."
You can limit the search to your site only, pull back the results and display them yourself...on your own page with your own formatting.
The only downside is that until your site is live and has been indexed, you can't fully test your search.