Smart string search for small collections

Smart string search for small collections - c#

I have a pretty small collection of string values in memory (around 8400 records with an average of 10 words each):
What I'm trying to find out if there's a library or something that, when I search for strings within that collection, it returns the matching values according to it, and it could also include some kind of weight to the results.
This is what I'm trying to do; let's say that I have these records in a List in memory:
Department Store General Manager
General and Operations Manager
General Manager
Restaurant Generally Managers
Restaurant General Manager
Let's say that I'm working on a method that receives a search string and it will analize that collection in order to retrieve the results:
List<string> SearchJotitles("General Manager")
I want something that will return all the records that contain the words General AND Manager. So far it should be easy: I could do it with regular expressions.
But the tricky part is that I want to apply some weighing rules saying :
"OK: the third record is a bigger match cause it's an EXACT match." "The first and last record should be next cause they have the two words with no distance between them". "The second record should be next cause it has the two exact words but in different order" "The 4th record should be last cause it has a partial match of both words"
THAT's the kind of logic I want to apply.
I know there are some libraries like Lucene.NET or Sphinx: I'm not discarding them; I'm just not convinced if they're worth using for such small in-memory collection.
In the worst-case scenario, I'll work in a IComparer implementation of the entities, but I want to know if there's something I could already use out there.
Thanks and regards,

In this particular example volume of records is small but it still does not decrease complexity of full-text search.
If you have only 5 records it might be a good idea to implement simple Levenshtein distance(or find implementation online), tokenise all phrases and do your custom matching algorithm (word distance, maybe synonyms etc).
On the other hand using Lucene.NET gives you that out of the box. You can use RAMDirectory to store the index in memory. And what it's most important you don't have to spend hours trying to figure out why your custom algorithm does not work as it should. Why reinvent the wheel?
Alternative?
Are you using any sql database in you application? Maybe it's worth leveraging full-text search built into modern SQL databases, of course if you use one.

Related

best practices for text-based searches in database

I have an application where I need to search in various text-based fields. The application is developed using NHibernate as an ORM.
I would like to implement Porter Stemming in searches, in order to be able to return relevant results even when the keyword matches a similar word, for example the description of a product contains memories while the search keyword is memory.
Can anyone suggest the best practices for such types of searches? The first idea that comes to mind is to store two version of the same field in database, for example:
Description
Description_Search
The Description column would be the text as entered by the website administrator, and is the text visible on the frontend.
The Description_Search would include the same text, but passed through a Porter-Stemming algorithm. Search queries would then be based on the Description_Search field, rather than Description.
Does this make sense? Is it a waste of space having to store two version of almost the same text?
Also, would Lucene.Net help in such a case? I am also looking into integrating Lucene.Net for full-text based searches but haven't yet looked into it in detail.
Thanks in advance!

There's no need to use two fields for this, one would be enough. A field has two "values", one stored that can be retrieved using Document.Get(...), and one indexed that's used for searching. It's not technically required to store the values either, a common solution is to store a id that's used to lookup the original content in a database. This would also allow you to lookup more information, like author information and document location.
Lucene.Net would help in this case, but it requires you to write the infrastructure yourself. You would need to take care of configuring analyzers (usually nothing to configure), and index your content. As mention in a comment, you could go for SQL Server's Full Text Search functionality, but that itself has some limitations (which may not affect you).
One big problem I've had using SQL Server's FTS, but works in Lucene.Net (this isn't really fair since you can do almost anything in Lucene.Net since you write the code that does it) is accent sensitivity. I've been unable to configure it using Swedish language rules, where åäö should be treated as real characters. Enabling accent sensitivity would do this, but it would also mean that diacritics is parsed as real characters, which means that ñ differs from n. (Imagine searching for a "jalapeno" and get no matches for "jalapeño"). Disabling accent sensitivity basically removes all diacritics, turning åäö into aao, and words turn out completely different.
Writing things in Lucene.Net (compared to SQL Server FTS) allows you to provide result highlighting (present which phrases in a document that matches the query), search for similar documents, spell-checking, custom result boosting, facets, and other things that would enhance your users' search experience.

MongoDB relationships for objects

Please excuse my english, I'm still trying to master it.
I've started to learn MongoDB (coming from a C# background) and I like the idea of what is MongoDB. I have some issues with examples on the internet.
Take the popular blog post / comments example. Post has none or many Comments associated with it. I create Post object, add a few Comment objects to the IList in Post. Thats fine.
Do I add that to just a "Posts" Collection in MonoDB or should I have two collections - one is blog.posts and blog.posts.comments?
I have a fair complicated object model, easiest way to think of it is as a Banking System - ours is mining. I tried to highlight tables with square brackets.
[Users] have one or many [Accounts] that have one or many [Transactions] which has one and only one [Type]. [Transactions] can have one or more [Tag] assigned to the transaction. [Users] create their own [Tags] unique to that user account and we sometimes need to offer reporting by those tags (Eg. for May, tag drilling-expense was $123456.78).
For indexing, I would have thought seperating them would be good but I'm worried it is bad practice this thinking from old RBDMS days.
In a way, its like the blog example. I'm not sure if I should have 1 [Account] Collection and persist all information there, or have an intermediate step that splits it up to seperate collections.
The other related query is, when you persist back and forth, do you usually get back everything associated with that record - even if not required or do you limit?

It depends.
It depends on how many of each of these type of objects you expect to have. Can you fit them all into a single MongoDB document for a given User? Probably not.
It depends on the relationships - is user-Account a one-to-many or a many-to-many relationship? If it's one to many and the number of Accounts is small you might chose to put them in an IList on a User document.
You can still model relationships in MongoDB with separate collections BUT there are no joins in the database so you have to do that in code. Loading a User and then loading their Accounts might be just fine from a performance perspective.
You can index INTO arrays on documents. Don't think of an Index as just being an index on a simple field on a document (like SQL). You can use, say, a Tag collection on a document and index into the tags. (See http://www.mongodb.org/display/DOCS/Indexes#Indexes-Arrays)
When you retrieve or write data you can do a partial read and a partial write of any document. (see http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields)
And, finally, when you can't see how to get what you want using collections and indexes, you might be able to achieve it using map reduce. For example, to find all the tags currently in use sorted by their frequency of use you would map each document emitting the tags used in it, and then you would reduce that set to get the result you want. You might then store the result of that map reduce permanently and only up date it when you need to.
One further concern: You mention calculating totals by tag. If you want accounting-quality transactional consistency, MongoDB might not be the right choice for you. "Eventual-consistency" is the name of the game for NoSQL data stores and they generally aren't a good fit for financial transactions. For example, it doesn't matter if one user sees a blog post with 3 comments while another sees 4 because they hit different replica copies that aren't in sync yet, but for a financial report, that kind of consistency does matter - your report might not add up!

Algorithm for Search page

I am creating a search page where we can find the product by entering the text.
ex: Brings on the night.
My query bring the records which contain atleast word from this.
Needs:
1. First row should contains the record with the given sentence.
2. second row next most matching.
3. Third row next matching ...etc
How to achieve this. Is there any algorithm for this. It will be more helpful if anyone share your idea.
Edit:
Sample search Order:
1. Brings on the night
2. Whoever Brings the Night
3. Night Baseball Brings
4. Night ride
5. Night Round
6. Brings flower
Geetha

Building a search engine is a very complex undertaking, dealing with ambiguity, human language, typos, and much more. You should try to use whatever comes with your database engine. SQL Server and SQLite have them out of the box and most other databases probably have similar capabilities. These engines aren't particularly good, but they should suffice for simple scenarios. For more serious work, try Lucene, which comes in various flavors for different programming languages.

Have you tried full-text search?
http://msdn.microsoft.com/en-us/library/ms142583.aspx

As a really simple solution you could use sql's LIKE operator. Instead of
select object_name from table_name where parameter = something
You would do
select object_name from table_name where parameter LIKE something
This might work for very simple scenarios

Some pointers
- try your RDBMS full text search or investigate solutions such as Lucene/Solr
- there are implementations of distance (Levenshtein) in SQL, for not so trivial hand made ranking
- n-grams (bigrams, trigrams) can do a lot, see for example all the options in postgres internal search compared to mysql or MSSQL
Internal RDBMS searches (postgres might be an exception) usually have too little options, implementing your own is usually too hard or RDBMS would not let you do it (efficiently).

In Java you have Lucene
There is also a port for it in php (Zend Lucene).
You also have a port to C# Lucene .NET
Just by changing your db models you can integrate it into the search engine.
Have a look. I've used Lucene in the past and it's always been very effective and efficient.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

intelligent database search

The issue is there is a database with around 20k customer records and I want to make a best effort to avoid duplicate entries. The database is Microsoft SQL Server 2005, the application that maintains that database is Microsoft Dynamics/SL. I am creating an ASP.NET webservice that interacts with that database. My service can insert customer records into the database, read records from it, or modify those records. Either in my webservice, or through MS Dynamics, or in Sql Server, I would like to give a list of possible matches before a user confirms a new record add.
So the user would submit a record, if it seems to be unique, the record will save and return a new ID. If there are possible duplications, the user can then resubmit with a confirmation saying, "yes, I see the possible duplicates, this is a new record, and I want to submit it".
This is easy if it is just a punctuation or space thing (such as if you are entering "Company, Inc." and there is a "Company Inc" in the database, But what if there is slight changes such as "Company Corp." instead of "Company Inc" or if there is a fat fingered misspelling, such as "Cmpany, Inc." Is it even possible to return records like that in the list? If it's absolutely not possible, I'll deal with what I have. It just causes more work later on, if records need to be merged due to duplications.

The specifics of which algorithm will work best for you depends greatly on your domain, so I'd suggest experimenting with a few different ones - you may even need to combine a few to get optimal results. Abbreviations, especially domain specific ones, may need to be preprocessed or standardized as well.
For the names, you'd probably be best off with a phonetic algorithm - which takes into account pronunciation. These will score Smith and Schmidt close together, as they are easy to confuse when saying the words. Double Metaphone is a good first choice.
For fat fingering, you'd probably be better off with an edit distance algorithm - which gives a "difference" between 2 words. These would score Smith and Smoth close together - even though the 2 may slip through the phonetic search.
T-SQL has SOUNDEX and DIFFERENCE - but they are pretty poor. A Levenshtein variant is the canonical choice, but there's other good choices - most of which are fairly easy to implement in C#, if you can't find a suitably licensed implementation.
All of these are going to be much easier to code/use from C# than T-SQL (though I did find double metaphone in a horrendous abuse of T-SQL that may work in SQL).
Though this example is in Access (and I've never actually looked at the code, or used the implementation) the included presentation gives a fairly good idea of what you'll probably end up needing to do. The code is probably worth a look, and perhaps a port from VBA.

Look into SOUNDEXing within SQL Server. I believe it will give you the fuzziness of probable matches that you're looking for.
SOUNDEX # MSDN
SOUNDEX # Wikipedia

If it's possible to integrate Lucene.NET into your solutionm you should definetly try it out.

You could try using Full Text Search with FreeText (or FreeTextTable) functions to try to find possible matches.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.