Algorithm for Search page

Algorithm for Search page - c#

I am creating a search page where we can find the product by entering the text.
ex: Brings on the night.
My query bring the records which contain atleast word from this.
Needs:
1. First row should contains the record with the given sentence.
2. second row next most matching.
3. Third row next matching ...etc
How to achieve this. Is there any algorithm for this. It will be more helpful if anyone share your idea.
Edit:
Sample search Order:
1. Brings on the night
2. Whoever Brings the Night
3. Night Baseball Brings
4. Night ride
5. Night Round
6. Brings flower
Geetha

Building a search engine is a very complex undertaking, dealing with ambiguity, human language, typos, and much more. You should try to use whatever comes with your database engine. SQL Server and SQLite have them out of the box and most other databases probably have similar capabilities. These engines aren't particularly good, but they should suffice for simple scenarios. For more serious work, try Lucene, which comes in various flavors for different programming languages.

Have you tried full-text search?
http://msdn.microsoft.com/en-us/library/ms142583.aspx

As a really simple solution you could use sql's LIKE operator. Instead of
select object_name from table_name where parameter = something
You would do
select object_name from table_name where parameter LIKE something
This might work for very simple scenarios

Some pointers
- try your RDBMS full text search or investigate solutions such as Lucene/Solr
- there are implementations of distance (Levenshtein) in SQL, for not so trivial hand made ranking
- n-grams (bigrams, trigrams) can do a lot, see for example all the options in postgres internal search compared to mysql or MSSQL
Internal RDBMS searches (postgres might be an exception) usually have too little options, implementing your own is usually too hard or RDBMS would not let you do it (efficiently).

In Java you have Lucene
There is also a port for it in php (Zend Lucene).
You also have a port to C# Lucene .NET
Just by changing your db models you can integrate it into the search engine.
Have a look. I've used Lucene in the past and it's always been very effective and efficient.

Related

Smart string search for small collections

I have a pretty small collection of string values in memory (around 8400 records with an average of 10 words each):
What I'm trying to find out if there's a library or something that, when I search for strings within that collection, it returns the matching values according to it, and it could also include some kind of weight to the results.
This is what I'm trying to do; let's say that I have these records in a List in memory:
Department Store General Manager
General and Operations Manager
General Manager
Restaurant Generally Managers
Restaurant General Manager
Let's say that I'm working on a method that receives a search string and it will analize that collection in order to retrieve the results:
List<string> SearchJotitles("General Manager")
I want something that will return all the records that contain the words General AND Manager. So far it should be easy: I could do it with regular expressions.
But the tricky part is that I want to apply some weighing rules saying :
"OK: the third record is a bigger match cause it's an EXACT match." "The first and last record should be next cause they have the two words with no distance between them". "The second record should be next cause it has the two exact words but in different order" "The 4th record should be last cause it has a partial match of both words"
THAT's the kind of logic I want to apply.
I know there are some libraries like Lucene.NET or Sphinx: I'm not discarding them; I'm just not convinced if they're worth using for such small in-memory collection.
In the worst-case scenario, I'll work in a IComparer implementation of the entities, but I want to know if there's something I could already use out there.
Thanks and regards,

In this particular example volume of records is small but it still does not decrease complexity of full-text search.
If you have only 5 records it might be a good idea to implement simple Levenshtein distance(or find implementation online), tokenise all phrases and do your custom matching algorithm (word distance, maybe synonyms etc).
On the other hand using Lucene.NET gives you that out of the box. You can use RAMDirectory to store the index in memory. And what it's most important you don't have to spend hours trying to figure out why your custom algorithm does not work as it should. Why reinvent the wheel?
Alternative?
Are you using any sql database in you application? Maybe it's worth leveraging full-text search built into modern SQL databases, of course if you use one.

Parsing existing "complex" SQL statements and converting into calls to custom API calls

I have a situation where I have several hundreds of complex excel spreadsheets, each with multiple pivot tables running queries against a sql database. I need to be able to convert these sql queries into function calls against a proprietary data store. This is complicated at many levels, but the part I am asking about now, and seems likely to have been addressed before in computer science, is how to "parse" the sql statements into a well defined structure that I can work with programmatically.
An example of my starting point:
SELECT vwFlowDataBest.MeasurementDate, vwFlowDataBest.LocationType, vwFlowDataBest.ScheduledVolume, tblPoints.Zone, tblPoints.Name AS SOME_ALIAS_FOR_NAME, vwFlowDataBest.PointID, tblCustomerType.Name, vwFlowDataBest.OperationallyAvailable, tblPoints.County, tblPoints.State, tblConnectingParty.Name
FROM Pipe2Pipe.dbo.tblConnectingParty tblConnectingParty, Pipe2Pipe.dbo.tblCustomerType tblCustomerType, Pipe2Pipe.dbo.tblPipelines tblPipelines, Pipe2Pipe.dbo.tblPoints tblPoints, Pipe2Pipe.dbo.vwFlowDataBest vwFlowDataBest
WHERE tblCustomerType.ID = tblPoints.CustomerTypeID AND tblPipelines.ID = vwFlowDataBest.PipelineID AND tblPoints.ID = vwFlowDataBest.PointID AND tblPoints.ConnectingPartyID = tblConnectingParty.ID AND ((tblPipelines.ID=16) AND (vwFlowDataBest.ScheduledVolume<>0) AND (tblPoints.Zone In ('mid 1','mid 2','mid 3','mid 4','mid 5','mid 6','mid 7')) AND (tblCustomerType.ID=16) AND (vwFlowDataBest.MeasurementDate>={ts '2010-05-15 00:00:00'}) AND (tblPipelines.ID<155))
So for this statement, I need to programatically handle the SELECT portion, the FROM portion, and the WHERE portion, and the subordinates within each. Complications of this are things such as aliases, differentiating between a join between tables and a plain old value filter in the where clause, the grouping (brackets) within the where clause, and other issues. Dealing with the complexities of Excel pivot tables is entirely outside the scope of this question, I can figure that out.
For now, I don't mind not supporting certain sql functions, such as "group by", "having", etc...for my problem, those are small enough that if necessary I can handle those manually. But if there's a known way to handle that as well, I'd be most happy.
My feeling is that I can probably get 70% of the way there (for my problem) just by splitting the sql statement into 3 parts, and then further breaking each of those down into their logical subordinate parts and then deal with them accordingly. But as I write this I can already see holes in my plan...this feels like a tarpit of complexity and edge cases.
I can't imagine I'm the first person to want to do such a thing, so my question is, are there old, proven approaches to this sort of problem, existing libraries, innovative approaches I could take, or any suggestions in general to apply to this task?

You seem to need a SQL parser (or at least part of one). It may be overkill for your purposes (more complete than you need), but there's a PL/SQL parser for ANTLR that might be useful.
Edit: I didn't really read that grammar as carefully as I should have before I posted the link. Doing a bit of looking, it doesn't really parse select statements at all -- it just recognizes where one is, and skips across it.
The ANTLR grammars page lists several more SQL grammars though (for the variants supported/used by MySQL, Oracle, etc.) Since you have C# and such in the tags, it's probably fair to guess you want to parse the MS SQL Server variant. There's a grammar strictly for its select statement that may be a reasonable fit for your needs.

RDF integration with C#

I'm working on a project where i had been asked to do a semantic search. The scenario is a database with a table containing 3 pieces of information, Doctor Name, Patient Name, and Date of Visit. I had been asked to create a form that contains 3 fields: Doctor, Patient and Date. So when a user wants to search for a patient's corresponding doctor or doctors for corresponding patients or their dates, they can just enter any of the fields to retrieve information form the database. I had done the coding in C# using Regular Expressions for string manipulation and information retrieval. But the main task is that the search should work using RDF and URI.
Now that I had worked on most part of the coding can someone help me how to create the search using RDF and URI, is there any solution for this, how can I integrate RDF in C#, is there any documentation.
But as per my supervisor's requirements he had asked me to build a search that works with RDF, I mean the details of patients (e.g. Patient's Name), Doctor's Name and Date would be in a form of URI which locates the details of patients, doctors and date information in the database so if anyone is trying to search for any information like doctor or patient can just enter their name in the corresponding field and retrieve the information. I'm attaching 2 snapshots of my code for your understanding.
Image 1: http://img29.imageshack.us/i/15035706.jpg
Image 2: http://img31.imageshack.us/img31/1117/86105845.jpg
The first image is where I enter all the details to the database and the second image is the search.
This is the overall idea about my project, can you advice me how this can be done?
I would be really grateful to you if someone could help me on this as soon as possible.

Doing an RDF and URI based search is going to be dependent on whether your data is in RDF in the first place. If it's not you've either got to convert it from its current form into RDF on the fly or permanently. To do it on the fly you could use a technology like D2R which maps relational databases to RDF http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/
There's some other Semantic Web C# stuff about like Rowlex http://rowlex.nc3a.nato.int/ which is more OWL based or there's my own dotNetRDF library http://www.dotnetrdf.org but that's only just about to be a first Alpha release so I wouldn't recommend it for any production systems yet. SemWeb as Alex mentions is pretty good and scales particularly well - only disadvantage is that it's .Net 2.0 so you need a separate library if you want to do LINQ with it
A question about your question...
Your question is unclear about what you mean by semantic search, are you sure you're actually meaning to do an RDF search or did someone just specify "semantic search" in the spec and you googled it and got articles about RDF? Semantic search doesn't necessarily imply a need for RDF, it could be that you actually want to do natural language search.
By this I mean that it could be that you want the ability to search for things like "patients of Dr Smith" and that your search engine should be able to interpret this as a search for patients where the doctor field corresponds to Dr Smith.
Equally I could be wrong and you could indeed be attempting to build something that sounds very like TimBL's example from his 2001 Scientific American article on the Semantic Web.
Edit
So as you do want to do proper RDF search then I would advise that you put your data into a Triple Store rather than a database and preferably use a Triple Store that provides SPARQL query so you can convert the inputs on your query form into a SPARQL query and query the Triple store with that.
Maybe take a look at Talis http://www.talis.com or Virtuoso http://www.openlinksw.com/virtuoso/
If you decide to use SemWeb then you could just use the Triple Store that it provides.

You may be able to do what you need using LinqToRdf. LinqToRdf exposes two LINQ query providers (i.e. you will need .NET 3.5+) including one that produces standards compliant SPARQL queries.
Here's a typical LinqToRdf Query, which if you're familiar with LINQ to SQL, should be totally natural:
MusicDataContext ctx = new MusicDataContext(#"http://localhost/linqtordf/SparqlQuery.aspx");
var q = (from t in ctx.Tracks
where t.Year == "2006" &&
t.GenreName == "History 5 | Fall 2006 | UC Berkeley"
orderby t.FileLocation
select new {t.Title, t.FileLocation}).Skip(10).Take(5);
foreach (var track in q)
{
Console.WriteLine(track.Title + ": " + track.FileLocation);
}

I suggest you try RDFSharp (http://rdfsharp.codeplex.com/) because, as far as I can understand from your question, you probably need to quickly setup an RDF application capable of performing elementary triple-based searches like SUBJECT="xxx";PREDICATE=NULL;OBJECT="yyy".
Feel free to try it, of course there exist more powerful tools but for your scenario I believe it is the most simple to apply.

Using semantic web technologies for the scenario you describe is overkill. However, if you are interested in a mature .NET library for working with Semantic Web standards in .NET and SQL definitely take a look at Intellidimension's offerings.

A C# library for RDF which seems to be becoming quite popular in the community of LinqToRDF. The project was originated by Andrew Matthews and has been going since 2007b I think. The software is up on Google COde and can be found here:
LinkToRDF
Together with the library, there's also something called "LinqToRDF designer" which fits into Visual Studio and allows you model RDF graphically.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

intelligent database search

The issue is there is a database with around 20k customer records and I want to make a best effort to avoid duplicate entries. The database is Microsoft SQL Server 2005, the application that maintains that database is Microsoft Dynamics/SL. I am creating an ASP.NET webservice that interacts with that database. My service can insert customer records into the database, read records from it, or modify those records. Either in my webservice, or through MS Dynamics, or in Sql Server, I would like to give a list of possible matches before a user confirms a new record add.
So the user would submit a record, if it seems to be unique, the record will save and return a new ID. If there are possible duplications, the user can then resubmit with a confirmation saying, "yes, I see the possible duplicates, this is a new record, and I want to submit it".
This is easy if it is just a punctuation or space thing (such as if you are entering "Company, Inc." and there is a "Company Inc" in the database, But what if there is slight changes such as "Company Corp." instead of "Company Inc" or if there is a fat fingered misspelling, such as "Cmpany, Inc." Is it even possible to return records like that in the list? If it's absolutely not possible, I'll deal with what I have. It just causes more work later on, if records need to be merged due to duplications.

The specifics of which algorithm will work best for you depends greatly on your domain, so I'd suggest experimenting with a few different ones - you may even need to combine a few to get optimal results. Abbreviations, especially domain specific ones, may need to be preprocessed or standardized as well.
For the names, you'd probably be best off with a phonetic algorithm - which takes into account pronunciation. These will score Smith and Schmidt close together, as they are easy to confuse when saying the words. Double Metaphone is a good first choice.
For fat fingering, you'd probably be better off with an edit distance algorithm - which gives a "difference" between 2 words. These would score Smith and Smoth close together - even though the 2 may slip through the phonetic search.
T-SQL has SOUNDEX and DIFFERENCE - but they are pretty poor. A Levenshtein variant is the canonical choice, but there's other good choices - most of which are fairly easy to implement in C#, if you can't find a suitably licensed implementation.
All of these are going to be much easier to code/use from C# than T-SQL (though I did find double metaphone in a horrendous abuse of T-SQL that may work in SQL).
Though this example is in Access (and I've never actually looked at the code, or used the implementation) the included presentation gives a fairly good idea of what you'll probably end up needing to do. The code is probably worth a look, and perhaps a port from VBA.

Look into SOUNDEXing within SQL Server. I believe it will give you the fuzziness of probable matches that you're looking for.
SOUNDEX # MSDN
SOUNDEX # Wikipedia

If it's possible to integrate Lucene.NET into your solutionm you should definetly try it out.

You could try using Full Text Search with FreeText (or FreeTextTable) functions to try to find possible matches.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.