regex that can handle horribly misspelled words - c#

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.

I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.

To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.

This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.

I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Related

how do I make my program guess for the correct word?

I am interested in doing some AI/algorithmic explorations. So I have this idea to make a simple application kind of like hang man, were I assign a word and leave some letters as clues. But instead of a user guessing for the word I want to make my application try to figure it out based on the clues I leave it. Does anyone know where I should start? thanks.
Create a database of words of the desired language (index wikipedia dumps).
That probably shouldn't exceed 1 million words.
Then you can simply query a database:
for example: fxxulxxs
--> SELECT * FROM T_Words WHERE word LIKE f__ul__s
--> fabulous
if there are more than 1 word in the return set, you need to return the one that is statistically the most used.
Another method would be to take a look at nhunspell
If you want to do it more analytically, you need to find a statistical method to correlate stems, endings and beginnings, or basically a measurement for word similarity.
Language research shows that you can easily read words when you only have the start and the ending. If you only have the middle, then it gets difficult.
You might want to check out some form of algorithm for measuring edit distance, such as Damerau-Levenshtein distance (wikipedia). That is typically used to find the one word among several that most closely matches some other given word.
It is used a lot for searching and comparison when processing DNA and Protein sequences, but might be useful in your case too.
The first step is to build a data structure containing all the valid words and which can be queried easily to retrieve all the words matching the current pattern. Then with this list of matching words you can compute the most frequent letter to get the next candidate. Another approach could be to find the letter which will give the smallest next matching words set.
next_guess(pattern, played_chars, dictionary)
// find all the word matching the pattern and not containing letters played
// not in the pattern
words_set = find_words_matching(pattern, played_chars, dictionary)
// build an array containing for each letter the frequency in the words set
letter_freq = build_frequency_array(words_set)
// build an array containing the size of the words set if ever the letter appears at least once
// in the word (I name it its power)
letter_power = build_power_array(words_set)
// find the letter minimizing a function (the AI part ?)
// the function could take last two arrays in account
// this is the AI part.
candidate = minimize(weighted_function, letter_freq, letter_power)

regular expression validtor

i have text box for phone number .i need to validate it.my requiremants are
Take only numeric more than 10digits
Take symbols like (,),-,
can any one help for this.i tried
^[\d{10,14} +\s +\( +\)-]+$
but not working.
You may take a look at the following article which will help you build such expression.
You haven't said what is wrong with your regex (why it's not working as expected) but I'm guessing that the issue is it matches far more than it should. I.e it will match 1 or more of all the characters in your set (rather than just between 10 and 14).
I think you're mistake is that you have put way too much in your character set. You've got the + symbol in there 3 times and it looks like your trying to use quantifiers from within the set as well, which is not allowed. Character sets are the equivalent of single character alternations. So, [abc] is the equivalent of a|b|c.
I'm assuming that you want the input to be between 10 and 14 numbers while still allowing any number (zero or more) of the following characters:
+()-,
As some others have suggested, you could just put the chars you want in a set and then specify the quantifier after it like this: ^[0-9()-,+]{10,14}$. This will almost get you there. Only problem with it is that it will allow between 10 and 14 of any of these characters, so it would successfully match this:
,,,,,++()---
Which clearly you don't want (do you?)
So, in order to better solve this problem, you'll need to be more specific about what is allowed and where in the subject it is allowed. Because i don't know exactly what you want to match, i can't take you much further.
Hopefully the information I've provided here should be good enough to get you started, and if you have more questions... well that's what we're all here for right, so ask away.
To help you out with learning, below are a few resources you might find useful (this is a small subset of what's available, so do go ahead and search for yourself):
Testing tools
Rubular (ruby)
GSkinner Regex Testser
RegexHero (dotnet)
Helpful info
Regular-Expressions.Info
Codeproject 30 Minute Tutorial

How to parse through data efficiently

I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?
So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.
not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.
Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

Categories