Algorithm for sentence analysis and tokenization - c#

I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency.
It should be something similar to http://www.codeproject.com/KB/recipes/Patterns.aspx
Do you have anything written in C#?

This is very simple to implement.
Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).
A forloop to enumerate all the n-gram out and use Dictionary<string, int> to get the count.

Related

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

how do I make my program guess for the correct word?

I am interested in doing some AI/algorithmic explorations. So I have this idea to make a simple application kind of like hang man, were I assign a word and leave some letters as clues. But instead of a user guessing for the word I want to make my application try to figure it out based on the clues I leave it. Does anyone know where I should start? thanks.
Create a database of words of the desired language (index wikipedia dumps).
That probably shouldn't exceed 1 million words.
Then you can simply query a database:
for example: fxxulxxs
--> SELECT * FROM T_Words WHERE word LIKE f__ul__s
--> fabulous
if there are more than 1 word in the return set, you need to return the one that is statistically the most used.
Another method would be to take a look at nhunspell
If you want to do it more analytically, you need to find a statistical method to correlate stems, endings and beginnings, or basically a measurement for word similarity.
Language research shows that you can easily read words when you only have the start and the ending. If you only have the middle, then it gets difficult.
You might want to check out some form of algorithm for measuring edit distance, such as Damerau-Levenshtein distance (wikipedia). That is typically used to find the one word among several that most closely matches some other given word.
It is used a lot for searching and comparison when processing DNA and Protein sequences, but might be useful in your case too.
The first step is to build a data structure containing all the valid words and which can be queried easily to retrieve all the words matching the current pattern. Then with this list of matching words you can compute the most frequent letter to get the next candidate. Another approach could be to find the letter which will give the smallest next matching words set.
next_guess(pattern, played_chars, dictionary)
// find all the word matching the pattern and not containing letters played
// not in the pattern
words_set = find_words_matching(pattern, played_chars, dictionary)
// build an array containing for each letter the frequency in the words set
letter_freq = build_frequency_array(words_set)
// build an array containing the size of the words set if ever the letter appears at least once
// in the word (I name it its power)
letter_power = build_power_array(words_set)
// find the letter minimizing a function (the AI part ?)
// the function could take last two arrays in account
// this is the AI part.
candidate = minimize(weighted_function, letter_freq, letter_power)

Regex to parse C/C++ functions declarations

I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).
I'm working from either headers or a list where the signatures take the form:
public: void __thiscall myClass::method(int, class myOtherClass * )
I have the following regex, which works for most functions:
(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)
There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.
I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?
C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.
But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.
Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.
Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.
The BNF specification looks like for a list of integers or words separated might look like :
using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;
// Inside the definition...
integer = +digit_p; // One or more digits.
word = +alpha_p; // One or more letters.
token = integer | word; // An integer or a word.
token_list = token >> *(+space_p >> token) // A token, followed by 0 or more tokens.
For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).
No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.
If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?

n-grams using regex

I am working on an augmentative and alternative communication (AAC) program. My current goal is to store a history of input/spoken text and search for common phrase fragments or word n-grams. I am currently using an implementation based on the lzw compression algorithm as discussed at CodeProject - N-gram and Fast Pattern Extraction Algorithm. This approach although producing n-grams does not behave as needed.
Let's say for example that I enter "over the mountain and through the woods" several times. My desired output would be the entire phrase "over the mountain and through the woods". Using my current implementation the phrase is broken into trigrams and on each repeated entry one word is added. So on the first entry I get "over the mountain". On the second entry "over the mountain and", etc.
Let's assume we have the following text:
this is a test
this is another test
this is also a test
the test of the emergency broadcasting system interrupted my favorite song
My goal would be that if "this is a test of the emergency broadcasting system" were entered next that I could use that within a regex to return "this is a test" and "test of the emergency broadcasting system". Is this something that is possible through regex or am I'm walking the wrong path? I appreciate any help.
I have been unable to find a way to do what I need with regular expressions alone although the technique shown at Matching parts of a string when the string contains part of a regex pattern comes close.
I ended up using a combination of my initial system along with some regex as shown below.
flow chart http://www.alsmatters.org/files/phraseextractor.png
This parses the transcript of the first presidential debate (about 16,500 words) in about 30 seconds which for my purposes is quite fast.
From your use case it appears you do not want fixed-length n-gram matches, but rather a longest sequence of n-gram match. Just saw your answer to your own post, which confirms ;)
In python you can use the fuzzywuzzy library to match a set of phrases to a canonical/normalized set of phrases through an associated list of "synonym" phrases or words. The trick is segmenting your phrases appropriately (e.g. when do commas separate phrases and when do they join lists of related words within a phrase?)
Here's the structure of the python dict in RAM. Your data structure in C or a database would be similar:
phrase_dict = {
'alternative phrase': 'canonical phrase',
'alternative two': 'canonical phrase',
'less common phrasing': 'different canonical phrase',
}
from fuzzywuzzy.process import extractOne
phrase_dict[extractOne('unknown phrase', phrase_dict)[0]]
and that returns
'canonical phrase'
FuzzyWuzzy seems to use something like a simplified Levenshtein edit-distance... it's fast but doesn't deal well with capitalization (normalize your case first), word sounds (there are other libraries, like soundex, that can hash phrases by what they sound like), or word meanings (that's what your phrase dictionary is for).

how to create a parser for search queries

for example i'd need to create something like google search query parser to parse such expressions as:
flying hiking or swiming
-"**walking in boots **" **author:**hamish **author:**reid
or
house in new york priced over
$500000 with a swimming pool
how would i even go about start building something like it? any good resources?
c# relevant, please (if possible)
edit: this is something that i should somehow be able to translate to a sql query
How many keywords do you have (like 'or', 'in', 'priced over', 'with a')? If you only have a couple of them I'd suggest going with simple string processing (regexes) too.
But if you have more than that you might want to look into implementing a real parser for those search expressions. Irony.net might help you with that (I found it extremely easy to use as you can express your grammar in a near bnf-form directly in code).
The Lucene/NLucene project have functionality for boolean queries and some other query formats as well. I don't know about the possibilities to add own extensions like author in your case, but it might be worthwile to check it out.
There are few ways doing it, two of them:
Parsing using grammar (useful for complex language)
Parsing using regular expression and basic string manipulations (for simpler language)
According to your example, the language is very basic so splitting the string according to keyword can be the best solution.
string sentence = "house in new york priced over $500000 with a swimming pool";
string[] values = sentence.Split(new []{" in ", " priced over ", " with a "},
StringSplitOptions.None);
string type = values[0];
string area = values[1];
string price = values[2];
string accessories = values[3];
However, some issues that may arise are: how to verify if the sentence stands in the expected form? What happens if some of the keywords can appear as part of the values?
If this is the case you encounter there are some libraries you can use to parse input using a defined grammar. Two of these libraries that works with .Net are ANTLR and Gold Parser, both are free. The main challenge is defining the grammar.
A grammar would work very well for the second example you gave but the first (any order keyword/command strings) would be best handled using Split() and a class to handle the various keywords and commands. You will have to do initial processing to handle quoted regions before the split (for example replacing spaces within quoted regions with a rare/unused character).
The ":" commands are easy to find and pull out of the search string for processing after the split is completed. Simply traverse the array looking.
The +/- keywords are also easy to find and add to the sql query as AND/AND NOT clauses.
The only place you might run into issues is with the "or" since you'll have to define how it is handled. What if there are multiple "or"s? But the order of keywords in the array is the same as in the query so that won't be an issue.
i think you should just do some string processing. There is no smart way of doing this.
So replace "OR" with your own or operator (e.g. ||). As far as i know there is no library for this.
I suggest you go with regexes.

Categories