How to do partial word searches in Lucene.NET? - c#

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.
My index contains documents with a Location field containing values like
Ohio
Dayton, Ohio
Dublin, Ohio
Columbus, Ohio
I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".
I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....
BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);
In short, I'd like to get this working like a LIKE clause similar to
SELECT * from Location where Name LIKE '%ohi%'
Can I do this with Lucene?

Try this query:
parser.Parse(query.Keywords.ToLower() + "*")

Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.
If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.

it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).
lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.
people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).
get down in to deep into understand lucene. it's good stuff. :-)

Related

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Matching a term that contains nested HTML

I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!

Match expressions in Strings

I have a database here with certain rules I need to apply to a a bunch of Strings, they're expressions that can occur within the Strings. They are expressed like
(word1 AND word2) OR (word3)
I can't hardcode those (because they may be changed in the database), so I thought about programmatically turning those expressions into Regex patterns.
Has anybody done such a task yet or has an idea on how to do this the best way?
I'm not wuite sure about how to deal with more complex expressions, how to take them apart and so on.
Edit: I'm using C# in VisualStudio / .NET.
The data is basically directory paths, a customer wants to get their documents organized, so the String I'm having are paths, the expressions in the DB could look like:
(office OR headquarter) AND (official OR confidential)
So if the file's directory path contains office and confidential, it should match.
Hope this makes it clearer.
EDIT2:
Heres some dummy examples:
The paths could look like:
c:\documents\official\johnmeyer\court\out\letter.doc
c:\documents\internal\appointments\court\in\september.doc
c:\documents\official\stevemiller\meeting\in\letter.doc
And the expressions like:
(meyer or miller) AND (court OR jail)
So this expression would match the 1st path/ file, but not the 2nd and 3rd one.
No answer, but a good hint:
The expressions you have are actual trees constructed by the parentheses. You need a stack machine to parse the text into a (binary) tree structure, where each node is an AND or OR element and the leaves are the words.
Afterwards, you can simply construct your regex in whatever language you need by walking the tree using depth first search and adding prefix and suffix data as needed before/after reading the subtree.
Consider an abstract class TreeNode having a method GenerateExpression(StringBuilder result).
Each actual TreeNode item will be either an CombinationTreeNode (with a CombinationMode And/Or) or an SearchTextTreeNode (with an SearchText property).
GenerateExpression(StringBuilder result) for CombinationTreeNode will look similar like that:
result.Append("(");
rightSubTree.GenerateExpression(result);
result.Append(") " + this.CombinationMode.ToString() + " (");
rightSubTree.GenerateExpression(result);
result.Append(")");
GenerateExpression(StringBuilder result) for SearchTextTreeNode is much easier:
result.Append(this.SearchText);
Of course, your code will produce a regular expression instead of the input text, as mine does.

how to create a parser for search queries

for example i'd need to create something like google search query parser to parse such expressions as:
flying hiking or swiming
-"**walking in boots **" **author:**hamish **author:**reid
or
house in new york priced over
$500000 with a swimming pool
how would i even go about start building something like it? any good resources?
c# relevant, please (if possible)
edit: this is something that i should somehow be able to translate to a sql query
How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.
But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).
The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.
There are few ways doing it, two of them:
Parsing using grammar (useful for complex language)
Parsing using regular expression and basic string manipulations (for simpler language)
According to your example, the language is very basic so splitting the string according to keyword can be the best solution.
string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "},
StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];
However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?
If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.
A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).
The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.
The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.
The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.
i think you should just do some string processing. There is no smart way of doing this.
So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.
I suggest you go with regexes.

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?
Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.
I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.
Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).
I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

Categories