We're designing a system which can accepts commands in this format
command context
The context is defined from a list of about 200 tuples of words such as:
physical therapy
cardiac
physician visit
hospital inpatient
hospital outpatient
etc.
We want the system to be able to correct user errors such as spelling mistakes but also to understand that "physical therapy" is the same as "physical therapist" AND also to accept synonyms
Finally, if it's not an exact match, it should ask the user to disambiguate between the best matches
This is how I'm thinking of doing it:
Stem both the context words and incoming queries
Delete/isolate command strings from the query
Check for and correct any anagrams (however: this only covers one category of spelling mistakes)
Look for an exact word match
Look for "close matches"
This doesn't feel like a neat solution, especially steps 3 and 5.
What's a better/easier way to do this? Any libraries to do it in C#, bonus.
Can Lucene do this perhaps? Any guidance appreciated.
Thanks!
It may be too imprecise for your purposes, but Soundex is a common algorithm for telling if two words "sound similar".
I think Lucene would be best applied only at steps 4 and 5, as Lucene currently only supports approximate matching in the "glob" sense (wildcard characters -- "?" for matching a single character and "*" for matching multiple characters).
There is a whole set of literature on approximate matching -- I would start with the agrep work and proceed from there (but in part that is because I'm familiar with agrep).
Related
Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures
I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.
In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?
So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.
not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.
Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.
I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.
I am working on an augmentative and alternative communication (AAC) program. My current goal is to store a history of input/spoken text and search for common phrase fragments or word n-grams. I am currently using an implementation based on the lzw compression algorithm as discussed at CodeProject - N-gram and Fast Pattern Extraction Algorithm. This approach although producing n-grams does not behave as needed.
Let's say for example that I enter "over the mountain and through the woods" several times. My desired output would be the entire phrase "over the mountain and through the woods". Using my current implementation the phrase is broken into trigrams and on each repeated entry one word is added. So on the first entry I get "over the mountain". On the second entry "over the mountain and", etc.
Let's assume we have the following text:
this is a test
this is another test
this is also a test
the test of the emergency broadcasting system interrupted my favorite song
My goal would be that if "this is a test of the emergency broadcasting system" were entered next that I could use that within a regex to return "this is a test" and "test of the emergency broadcasting system". Is this something that is possible through regex or am I'm walking the wrong path? I appreciate any help.
I have been unable to find a way to do what I need with regular expressions alone although the technique shown at Matching parts of a string when the string contains part of a regex pattern comes close.
I ended up using a combination of my initial system along with some regex as shown below.
flow chart http://www.alsmatters.org/files/phraseextractor.png
This parses the transcript of the first presidential debate (about 16,500 words) in about 30 seconds which for my purposes is quite fast.
From your use case it appears you do not want fixed-length n-gram matches, but rather a longest sequence of n-gram match. Just saw your answer to your own post, which confirms ;)
In python you can use the fuzzywuzzy library to match a set of phrases to a canonical/normalized set of phrases through an associated list of "synonym" phrases or words. The trick is segmenting your phrases appropriately (e.g. when do commas separate phrases and when do they join lists of related words within a phrase?)
Here's the structure of the python dict in RAM. Your data structure in C or a database would be similar:
phrase_dict = {
'alternative phrase': 'canonical phrase',
'alternative two': 'canonical phrase',
'less common phrasing': 'different canonical phrase',
}
from fuzzywuzzy.process import extractOne
phrase_dict[extractOne('unknown phrase', phrase_dict)[0]]
and that returns
'canonical phrase'
FuzzyWuzzy seems to use something like a simplified Levenshtein edit-distance... it's fast but doesn't deal well with capitalization (normalize your case first), word sounds (there are other libraries, like soundex, that can hash phrases by what they sound like), or word meanings (that's what your phrase dictionary is for).