The issue is there is a database with around 20k customer records and I want to make a best effort to avoid duplicate entries. The database is Microsoft SQL Server 2005, the application that maintains that database is Microsoft Dynamics/SL. I am creating an ASP.NET webservice that interacts with that database. My service can insert customer records into the database, read records from it, or modify those records. Either in my webservice, or through MS Dynamics, or in Sql Server, I would like to give a list of possible matches before a user confirms a new record add.
So the user would submit a record, if it seems to be unique, the record will save and return a new ID. If there are possible duplications, the user can then resubmit with a confirmation saying, "yes, I see the possible duplicates, this is a new record, and I want to submit it".
This is easy if it is just a punctuation or space thing (such as if you are entering "Company, Inc." and there is a "Company Inc" in the database, But what if there is slight changes such as "Company Corp." instead of "Company Inc" or if there is a fat fingered misspelling, such as "Cmpany, Inc." Is it even possible to return records like that in the list? If it's absolutely not possible, I'll deal with what I have. It just causes more work later on, if records need to be merged due to duplications.
The specifics of which algorithm will work best for you depends greatly on your domain, so I'd suggest experimenting with a few different ones - you may even need to combine a few to get optimal results. Abbreviations, especially domain specific ones, may need to be preprocessed or standardized as well.
For the names, you'd probably be best off with a phonetic algorithm - which takes into account pronunciation. These will score Smith and Schmidt close together, as they are easy to confuse when saying the words. Double Metaphone is a good first choice.
For fat fingering, you'd probably be better off with an edit distance algorithm - which gives a "difference" between 2 words. These would score Smith and Smoth close together - even though the 2 may slip through the phonetic search.
T-SQL has SOUNDEX and DIFFERENCE - but they are pretty poor. A Levenshtein variant is the canonical choice, but there's other good choices - most of which are fairly easy to implement in C#, if you can't find a suitably licensed implementation.
All of these are going to be much easier to code/use from C# than T-SQL (though I did find double metaphone in a horrendous abuse of T-SQL that may work in SQL).
Though this example is in Access (and I've never actually looked at the code, or used the implementation) the included presentation gives a fairly good idea of what you'll probably end up needing to do. The code is probably worth a look, and perhaps a port from VBA.
Look into SOUNDEXing within SQL Server. I believe it will give you the fuzziness of probable matches that you're looking for.
SOUNDEX # MSDN
SOUNDEX # Wikipedia
If it's possible to integrate Lucene.NET into your solutionm you should definetly try it out.
You could try using Full Text Search with FreeText (or FreeTextTable) functions to try to find possible matches.
Related
I have a pretty small collection of string values in memory (around 8400 records with an average of 10 words each):
What I'm trying to find out if there's a library or something that, when I search for strings within that collection, it returns the matching values according to it, and it could also include some kind of weight to the results.
This is what I'm trying to do; let's say that I have these records in a List in memory:
Department Store General Manager
General and Operations Manager
General Manager
Restaurant Generally Managers
Restaurant General Manager
Let's say that I'm working on a method that receives a search string and it will analize that collection in order to retrieve the results:
List<string> SearchJotitles("General Manager")
I want something that will return all the records that contain the words General AND Manager. So far it should be easy: I could do it with regular expressions.
But the tricky part is that I want to apply some weighing rules saying :
"OK: the third record is a bigger match cause it's an EXACT match." "The first and last record should be next cause they have the two words with no distance between them". "The second record should be next cause it has the two exact words but in different order" "The 4th record should be last cause it has a partial match of both words"
THAT's the kind of logic I want to apply.
I know there are some libraries like Lucene.NET or Sphinx: I'm not discarding them; I'm just not convinced if they're worth using for such small in-memory collection.
In the worst-case scenario, I'll work in a IComparer implementation of the entities, but I want to know if there's something I could already use out there.
Thanks and regards,
In this particular example volume of records is small but it still does not decrease complexity of full-text search.
If you have only 5 records it might be a good idea to implement simple Levenshtein distance(or find implementation online), tokenise all phrases and do your custom matching algorithm (word distance, maybe synonyms etc).
On the other hand using Lucene.NET gives you that out of the box. You can use RAMDirectory to store the index in memory. And what it's most important you don't have to spend hours trying to figure out why your custom algorithm does not work as it should. Why reinvent the wheel?
Alternative?
Are you using any sql database in you application? Maybe it's worth leveraging full-text search built into modern SQL databases, of course if you use one.
I have an application where I need to search in various text-based fields. The application is developed using NHibernate as an ORM.
I would like to implement Porter Stemming in searches, in order to be able to return relevant results even when the keyword matches a similar word, for example the description of a product contains memories while the search keyword is memory.
Can anyone suggest the best practices for such types of searches? The first idea that comes to mind is to store two version of the same field in database, for example:
Description
Description_Search
The Description column would be the text as entered by the website administrator, and is the text visible on the frontend.
The Description_Search would include the same text, but passed through a Porter-Stemming algorithm. Search queries would then be based on the Description_Search field, rather than Description.
Does this make sense? Is it a waste of space having to store two version of almost the same text?
Also, would Lucene.Net help in such a case? I am also looking into integrating Lucene.Net for full-text based searches but haven't yet looked into it in detail.
Thanks in advance!
There's no need to use two fields for this, one would be enough. A field has two "values", one stored that can be retrieved using Document.Get(...), and one indexed that's used for searching. It's not technically required to store the values either, a common solution is to store a id that's used to lookup the original content in a database. This would also allow you to lookup more information, like author information and document location.
Lucene.Net would help in this case, but it requires you to write the infrastructure yourself. You would need to take care of configuring analyzers (usually nothing to configure), and index your content. As mention in a comment, you could go for SQL Server's Full Text Search functionality, but that itself has some limitations (which may not affect you).
One big problem I've had using SQL Server's FTS, but works in Lucene.Net (this isn't really fair since you can do almost anything in Lucene.Net since you write the code that does it) is accent sensitivity. I've been unable to configure it using Swedish language rules, where åäö should be treated as real characters. Enabling accent sensitivity would do this, but it would also mean that diacritics is parsed as real characters, which means that ñ differs from n. (Imagine searching for a "jalapeno" and get no matches for "jalapeño"). Disabling accent sensitivity basically removes all diacritics, turning åäö into aao, and words turn out completely different.
Writing things in Lucene.Net (compared to SQL Server FTS) allows you to provide result highlighting (present which phrases in a document that matches the query), search for similar documents, spell-checking, custom result boosting, facets, and other things that would enhance your users' search experience.
I have a situation where I have several hundreds of complex excel spreadsheets, each with multiple pivot tables running queries against a sql database. I need to be able to convert these sql queries into function calls against a proprietary data store. This is complicated at many levels, but the part I am asking about now, and seems likely to have been addressed before in computer science, is how to "parse" the sql statements into a well defined structure that I can work with programmatically.
An example of my starting point:
SELECT vwFlowDataBest.MeasurementDate, vwFlowDataBest.LocationType, vwFlowDataBest.ScheduledVolume, tblPoints.Zone, tblPoints.Name AS SOME_ALIAS_FOR_NAME, vwFlowDataBest.PointID, tblCustomerType.Name, vwFlowDataBest.OperationallyAvailable, tblPoints.County, tblPoints.State, tblConnectingParty.Name
FROM Pipe2Pipe.dbo.tblConnectingParty tblConnectingParty, Pipe2Pipe.dbo.tblCustomerType tblCustomerType, Pipe2Pipe.dbo.tblPipelines tblPipelines, Pipe2Pipe.dbo.tblPoints tblPoints, Pipe2Pipe.dbo.vwFlowDataBest vwFlowDataBest
WHERE tblCustomerType.ID = tblPoints.CustomerTypeID AND tblPipelines.ID = vwFlowDataBest.PipelineID AND tblPoints.ID = vwFlowDataBest.PointID AND tblPoints.ConnectingPartyID = tblConnectingParty.ID AND ((tblPipelines.ID=16) AND (vwFlowDataBest.ScheduledVolume<>0) AND (tblPoints.Zone In ('mid 1','mid 2','mid 3','mid 4','mid 5','mid 6','mid 7')) AND (tblCustomerType.ID=16) AND (vwFlowDataBest.MeasurementDate>={ts '2010-05-15 00:00:00'}) AND (tblPipelines.ID<155))
So for this statement, I need to programatically handle the SELECT portion, the FROM portion, and the WHERE portion, and the subordinates within each. Complications of this are things such as aliases, differentiating between a join between tables and a plain old value filter in the where clause, the grouping (brackets) within the where clause, and other issues. Dealing with the complexities of Excel pivot tables is entirely outside the scope of this question, I can figure that out.
For now, I don't mind not supporting certain sql functions, such as "group by", "having", etc...for my problem, those are small enough that if necessary I can handle those manually. But if there's a known way to handle that as well, I'd be most happy.
My feeling is that I can probably get 70% of the way there (for my problem) just by splitting the sql statement into 3 parts, and then further breaking each of those down into their logical subordinate parts and then deal with them accordingly. But as I write this I can already see holes in my plan...this feels like a tarpit of complexity and edge cases.
I can't imagine I'm the first person to want to do such a thing, so my question is, are there old, proven approaches to this sort of problem, existing libraries, innovative approaches I could take, or any suggestions in general to apply to this task?
You seem to need a SQL parser (or at least part of one). It may be overkill for your purposes (more complete than you need), but there's a PL/SQL parser for ANTLR that might be useful.
Edit: I didn't really read that grammar as carefully as I should have before I posted the link. Doing a bit of looking, it doesn't really parse select statements at all -- it just recognizes where one is, and skips across it.
The ANTLR grammars page lists several more SQL grammars though (for the variants supported/used by MySQL, Oracle, etc.) Since you have C# and such in the tags, it's probably fair to guess you want to parse the MS SQL Server variant. There's a grammar strictly for its select statement that may be a reasonable fit for your needs.
I am creating a search page where we can find the product by entering the text.
ex: Brings on the night.
My query bring the records which contain atleast word from this.
Needs:
1. First row should contains the record with the given sentence.
2. second row next most matching.
3. Third row next matching ...etc
How to achieve this. Is there any algorithm for this. It will be more helpful if anyone share your idea.
Edit:
Sample search Order:
1. Brings on the night
2. Whoever Brings the Night
3. Night Baseball Brings
4. Night ride
5. Night Round
6. Brings flower
Geetha
Building a search engine is a very complex undertaking, dealing with ambiguity, human language, typos, and much more. You should try to use whatever comes with your database engine. SQL Server and SQLite have them out of the box and most other databases probably have similar capabilities. These engines aren't particularly good, but they should suffice for simple scenarios. For more serious work, try Lucene, which comes in various flavors for different programming languages.
Have you tried full-text search?
http://msdn.microsoft.com/en-us/library/ms142583.aspx
As a really simple solution you could use sql's LIKE operator. Instead of
select object_name from table_name where parameter = something
You would do
select object_name from table_name where parameter LIKE something
This might work for very simple scenarios
Some pointers
- try your RDBMS full text search or investigate solutions such as Lucene/Solr
- there are implementations of distance (Levenshtein) in SQL, for not so trivial hand made ranking
- n-grams (bigrams, trigrams) can do a lot, see for example all the options in postgres internal search compared to mysql or MSSQL
Internal RDBMS searches (postgres might be an exception) usually have too little options, implementing your own is usually too hard or RDBMS would not let you do it (efficiently).
In Java you have Lucene
There is also a port for it in php (Zend Lucene).
You also have a port to C# Lucene .NET
Just by changing your db models you can integrate it into the search engine.
Have a look. I've used Lucene in the past and it's always been very effective and efficient.
In our organization we have the need to let employees filter data in our web application by supplying WHERE clauses. It's worked great for a long time, but we occasionally run into users providing queries that require full table scans on large tables or inefficient joins, etc.
Some clown might write something like:
select * from big_table where
Name in (select name from some_table where name like '%search everything%')
or name in ('a', 'b', 'c')
or price < 20
or price > 40
or exists (select 1 from some_other_table where col1 + col2 + col3 = 4)
or exists (select 1 from table_a, table+b)
Obviously, this is not a great way to query these tables with computed values, non-indexed columns, lots of OR's and an unrestricted join on table_a and table_b.
But for a user, this may make total sense.
So what's the best way, if any, to allow internal users to supply a query to the database while ensuring that it won't lock a dozen tables and hang the webserver for 5 minutes?
I'm guessing that's a programmatic way in c#/sql-server to get the execution plan for a query before it runs. And if so, what factors contribute to cost? Estimated I/O cost? Estimated CPU cost? What would be reasonable limits at which to tell the user that his query's no good?
EDIT: We're a market research company. We have thousands of surveys, each with their own data. We have dozens of researchers that want to slice that data in arbitrary ways. We have tools to let them construct "valid" filters using a GUI, but some "power users" want to supply their own queries. I realize this isn't standard or best practice, but how else can I let dozens of users query tables for the rows they want using arbitrarily complex conditions and ever-changing conditions?
The premise of your question states:
In our organization we have the need to let employees filter date in our web application by supplying WHERE clauses.
I find this premise to be flawed on its face. I can't imagine a situation where I would allow users to do this. In addition to the problems you have already identified, you are opening yourself up to SQL Injection attacks.
I would highly recommend reassessing your requirements to see if you can't build a safer, more focused way of allowing your users to search.
However, if your users really are sophisticated (and trusted!) enough to be supplying WHERE clauses directly, they need to be educated on what they can and can't submit as a filter.
You can try using the following:
SET SHOWPLAN_ALL ON
GO
SET FMTONLY ON
GO
<<< Your SQL code here >>>
GO
SET FMTONLY OFF
GO
SET SHOWPLAN_ALL OFF
GO
Then you can parse through what you've got. As to where to draw the line on various things, that's going to take some experience. There are some things to watch for, but nothing that is cut and dried. It's often more of an art to examine the query plans than a science.
As others have pointed out though, I think that your problem goes deeper than the technology implications. The fact that you let unqualified people access your database in such a way is the underlying problem. From past experience, I often see this in companies where they are too lazy or too inexperienced to properly capture their application's requirements. I'm not saying that this is necessarily the case with your corporate environment, but that's what I've seen.
In addition of trying to control what the users enter (which is a loosing battle, there will always be a new hire that will come up with an immaginative query), I'd look into Resource Governor, see Managing SQL Server Workloads with Resource Governor. You put the ad-hoc queries into a separate pool and cap the allocated resources. This way you can mitigate the problem by limiting the amount of damage a bad query can do to other tasks.
And you should also consider giving access to the data by other means, like Power Pivot and let users massage their data as hard as they want on their own Excel. Business power users love that, and the impact on the transaciton processign server is minimal.
Instead of allowing employees to directly write (append to) queries, and then trying to calculate the query cost before running it, why not create some kind of Advanced Search or filter feature that is NOT writing SQL you cannot control?
In very large Enterprise originations on internal application this is a common practice. Often during your design phase you will limit the criteria or put sensible limits on data ranges, but once the business gets hold of the app there will be calls from the business unit management to remove the restrictions. In my origination this is a management problem not an engineering issue.
What we did was profile all of the criteria and found the largest offenders, both users and what types of queries caused the most problems and put limitations on some of the queries. Also some very expensive queries that were used on a regular basis were added to the app and the app cached the results and ran the queries when load was low. We also created caned optimized queries for standard users and gave only specified users the ability to search for anything. Just a couple of ideas.
You could make a data model for your database and allow users to use SQL Reporting Services' Report Builder. Its GUI-based and doesn't require writing WHERE clauses, so there should be a limit to how much damage they can do.
Or you could warehouse a copy of the db for the purpose of user queries, update the db every hour or so, and let them go to town... :)
I have worked a few places where this also came up. What we ended up doing was NOT allowing users unconstrained access, and promising to have IT do their best to provide queries when needed. The issue was that the database is fairly complicated, and even if users could write grammatically and syntactically correct SQL, they don't necessarily understand the relationships between the tables. In other words, even if they could write their own SQL they would get the wrong answers. We convinced the users that the risk of making the wrong decision based on a flawed or incomplete understanding of the 200 tables in the database was too high. Better to get the right answer after a day than the wrong one instantly.
The other part of this is what does IT do when user A writes a query and gets 1 answer, then user B writes what he thinks is the same query and gets a different answer? Is it IT's job to find the differences? To fix both pieces of SQL? etc. The bottom line is that I would not allow them access. I would load the system with predefined queries, as others have mentioned, and try to train mgmt why that is the only way it will work in the long run.
If you have so much data and you want to provide to your customers the ability to analyse and view the information as they want to, I strongly recommand to thing about OLAP technologies.
I guess you've never heard of SQL Injection attacks? What if the user enters A DROP DATABASE command after the WHERE clause?
This is the reason that direct SELECT permission is almost never given to users in the vast majority of applications.
A far better approach would be to engineer your application around use cases so that you are able to cover a reasonable percentage of requirements with specifically designed filters/aggregation/layout options.
There are a myriad of ways to do this so some analysis of your specific problem domain will definitely be required together with research into viable methods.
Whilst direct SQL access is the most flexible for your users, long executing queries are likely to be just the start of your headaches. SQL injection is a big concern here, whether it's source is malicious or simply misguided.
(Chad mentioned this in a comment, but I think it deserves to be an answer.)
Maybe you should copy data that needs to be queried ad-hoc into a separate database, to isolate any problems from the majority of users.