I am making a search engine in VisualStudio 2013. I am using Lucene.Net and I am wondering if it is possible to search in multiple tables within the database i have. I know how to search in multiple fields within a table, but I need to be able to search in multiple tables as well.
Is this possible in any way?
Yes, it is possible. The implementation will likely be unique to your needs so I cannot really help give you code to get started. Lucene uses the concept of documents, the structure of which is completely up to you. The more information you choose to store in those documents, the slower your search and indexing operations will be.
What you want to do is figure out what information users need to be able to search by, and what information you need to fetch relevant database information based upon those indexed fields. For example, you might index the title of a document or some/all of its body. If you query against those fields, then you will want to other information, likely table key values, that will allow you to fetch information relevant to that search. For example, you could store info to allow the fetching of related articles, or comments on the document, etc.
I hope that clears up how Lucene can be used, unfortunately the implementation details for your application are likely too specific to give more detailed answers.
Related
I have an application where I need to search in various text-based fields. The application is developed using NHibernate as an ORM.
I would like to implement Porter Stemming in searches, in order to be able to return relevant results even when the keyword matches a similar word, for example the description of a product contains memories while the search keyword is memory.
Can anyone suggest the best practices for such types of searches? The first idea that comes to mind is to store two version of the same field in database, for example:
Description
Description_Search
The Description column would be the text as entered by the website administrator, and is the text visible on the frontend.
The Description_Search would include the same text, but passed through a Porter-Stemming algorithm. Search queries would then be based on the Description_Search field, rather than Description.
Does this make sense? Is it a waste of space having to store two version of almost the same text?
Also, would Lucene.Net help in such a case? I am also looking into integrating Lucene.Net for full-text based searches but haven't yet looked into it in detail.
Thanks in advance!
There's no need to use two fields for this, one would be enough. A field has two "values", one stored that can be retrieved using Document.Get(...), and one indexed that's used for searching. It's not technically required to store the values either, a common solution is to store a id that's used to lookup the original content in a database. This would also allow you to lookup more information, like author information and document location.
Lucene.Net would help in this case, but it requires you to write the infrastructure yourself. You would need to take care of configuring analyzers (usually nothing to configure), and index your content. As mention in a comment, you could go for SQL Server's Full Text Search functionality, but that itself has some limitations (which may not affect you).
One big problem I've had using SQL Server's FTS, but works in Lucene.Net (this isn't really fair since you can do almost anything in Lucene.Net since you write the code that does it) is accent sensitivity. I've been unable to configure it using Swedish language rules, where åäö should be treated as real characters. Enabling accent sensitivity would do this, but it would also mean that diacritics is parsed as real characters, which means that ñ differs from n. (Imagine searching for a "jalapeno" and get no matches for "jalapeño"). Disabling accent sensitivity basically removes all diacritics, turning åäö into aao, and words turn out completely different.
Writing things in Lucene.Net (compared to SQL Server FTS) allows you to provide result highlighting (present which phrases in a document that matches the query), search for similar documents, spell-checking, custom result boosting, facets, and other things that would enhance your users' search experience.
I m designing a mongo db schema for a site like stackoverflow. There are questions and users.
Users can add questions to their favorite list and they can search for a question within the favorite list.
I have 2 collections, as Users and Questions. Problem is how to store favorites. There are 2 options
Store a list of favorite question Id s with a user
Store a list of user ids (of the users who added this question to their favorites), with the question.
Which approach should I take ? Remember I need to search favorites of a user too.
For an estimate of db/record sizes, assume number of questions, users db operations that the stackoverflow has
For more info;
This app is an asp.net mvc written in c# and hope to use Lucene.NET for search
Thanks in advance
Have a separate collection for UserFavories is better approach. Because size of the favorites is unknown at any time, and its keep on growing
UserFavories
-UserID (BSON Objectid)
- id of the user who posted
- Name of the user who posted
- Name of the question
- Question id
- url to the question
We think storing Userid, Question Id is enough to find the favorites most of the time. But in non sql, its better to store the very relevant info along with the ids (avoid joins). In this case you store id & name of the user who posted the question and name,id & url of the question, so you easily display the favorites by just querying this doc alone, like this
its not an exact way of doing this, but it ll give you an idea..
If you designing site like SO and want to achieve same performance you for sure need denormalize your data. So, i suggest to store user favorite questions id within user and store and store user id's withing question. During favorite operation you will need insert data in two places (user, question) but you will be able to retrieve quick user/question favorites back.
BTW: If you will use lucene with mongodb you will run into problems with relevance loading from mongodb.
If you need real full text search you can try RavenDB. It also great nosql database and it natively support Lucene syntax.
Edit:
When you designing site like SO keep in the mind:
denormalization
async request processing
background jobs
If you want to display the number of favorite flags for each question, you should probably store them with the question to avoid searching through the user database.
Please excuse my english, I'm still trying to master it.
I've started to learn MongoDB (coming from a C# background) and I like the idea of what is MongoDB. I have some issues with examples on the internet.
Take the popular blog post / comments example. Post has none or many Comments associated with it. I create Post object, add a few Comment objects to the IList in Post. Thats fine.
Do I add that to just a "Posts" Collection in MonoDB or should I have two collections - one is blog.posts and blog.posts.comments?
I have a fair complicated object model, easiest way to think of it is as a Banking System - ours is mining. I tried to highlight tables with square brackets.
[Users] have one or many [Accounts] that have one or many [Transactions] which has one and only one [Type]. [Transactions] can have one or more [Tag] assigned to the transaction. [Users] create their own [Tags] unique to that user account and we sometimes need to offer reporting by those tags (Eg. for May, tag drilling-expense was $123456.78).
For indexing, I would have thought seperating them would be good but I'm worried it is bad practice this thinking from old RBDMS days.
In a way, its like the blog example. I'm not sure if I should have 1 [Account] Collection and persist all information there, or have an intermediate step that splits it up to seperate collections.
The other related query is, when you persist back and forth, do you usually get back everything associated with that record - even if not required or do you limit?
It depends.
It depends on how many of each of these type of objects you expect to have. Can you fit them all into a single MongoDB document for a given User? Probably not.
It depends on the relationships - is user-Account a one-to-many or a many-to-many relationship? If it's one to many and the number of Accounts is small you might chose to put them in an IList on a User document.
You can still model relationships in MongoDB with separate collections BUT there are no joins in the database so you have to do that in code. Loading a User and then loading their Accounts might be just fine from a performance perspective.
You can index INTO arrays on documents. Don't think of an Index as just being an index on a simple field on a document (like SQL). You can use, say, a Tag collection on a document and index into the tags. (See http://www.mongodb.org/display/DOCS/Indexes#Indexes-Arrays)
When you retrieve or write data you can do a partial read and a partial write of any document. (see http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields)
And, finally, when you can't see how to get what you want using collections and indexes, you might be able to achieve it using map reduce. For example, to find all the tags currently in use sorted by their frequency of use you would map each document emitting the tags used in it, and then you would reduce that set to get the result you want. You might then store the result of that map reduce permanently and only up date it when you need to.
One further concern: You mention calculating totals by tag. If you want accounting-quality transactional consistency, MongoDB might not be the right choice for you. "Eventual-consistency" is the name of the game for NoSQL data stores and they generally aren't a good fit for financial transactions. For example, it doesn't matter if one user sees a blog post with 3 comments while another sees 4 because they hit different replica copies that aren't in sync yet, but for a financial report, that kind of consistency does matter - your report might not add up!
I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed.
I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing.
Is my approach correct? Are there better ways to do this? Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. I'd say about 100-500 books and 1000-5000 songs.
Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc.
It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content.
Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. Our company has done this before, and it's actually not as big of a headache as it sounds.
NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document.
Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually.
Yes, there is a better approach. Try Solr and in particular check out facets. It will save you a lot of trouble.
If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. You can use this with Express versions, too. You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter).
You could try using MS Search Server Express Edition, one of the major benefits is that it is free.
http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none
Is there a good tag search system that i can use in my C# .NET prototype that will also run on ASP.NET ?
When you say "tag search system" I am going to assume that you mean the ability in a social network to allow your users to tag content thereby bubbling up the things that are most popular in your site by way of a tag cloud. Also allowing your users to navigate by way of tagged content, etc. ??
I like to create a SystemObjects table which holds the various tables in my system that might have tags applied to it (or comments, or ratings, etc.) thereby allowing me to have a generic tagging system that can span my entire database. Then I would also have a SystemObjectTags table that would have a reference to the SystemObjectID (telling me which table has the record that I am interested in) as well as the SystemObjectRecordID (which tells me which specific record I am interested in). From there I have all of my detail data with regards to the tags and what was tagged. I then like to keep a running list of the tag specific information in a Tags table which keeps the unique tag (string value of a tag) the TagID (which the SystemObjectTags table references), the count of that tags usage across the system (a summed value of all uses), etc. If you have a high traffic site this data should be kept in your cache so that you don't hit the data too frequently.
With this subsystem in place you can then move to the search capabilities. You have all the data that you need with these three tables to easily be able to perform filtering, searching, etc. However, you might find that there is so much data in here and that the tables are so generic that your searches are not as fast as a more optimized table structure might be. For this reason I suggest that you use a Lucene.NET index to hold all of your searchable data. Lucene.NET provides a very fast read time and provides far more flexibility in search algorithms than SQL Servers freetext stuff does.
This would then allow you to provide filtering of your content by tags, searching for content by tag, tag counts, etc. Lucene.net is a big scary topic though! Be prepared to do some reading to get your past the basics.
An option we are using is to put our "tags" in the Meta Keywords on the page, and then we use Bing for our search.
http://msdn.microsoft.com/en-us/library/dd251056.aspx
Our architect said it best. "Let the search engines do what they do best. Search."
You can limit the search to your site only, pull back the results and display them yourself...on your own page with your own formatting.
The only downside is that until your site is live and has been indexed, you can't fully test your search.