Algorithm to find keywords and keyphrases in a string

Algorithm to find keywords and keyphrases in a string - c#

I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.
The string contains:
Technical information written in English (GB)
Words are mostly separated by spaces
A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
A keyphrase may contain a space, a comma or other punctuation
If two or more keywords appear together then it is likely a keyphrase e.g. "inverter drive"
The text also contains HTML but this can be removed beforehand if necessary
Non-keywords would be words like "and", "the", "we", "see", "look" etc.
Keywords are case-insensitive e.g. "Inverter" and "inverter" are the same keyword
The algorithm has the following requirements:
Operate in a batch-processing scenario e.g. run once or twice a day
Process strings varying in length from roughly 200 to 7000 characters
Process 1000 strings in less than 1 hour
Will execute on a server with moderately good power
Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
Does not rely on a list of predefined keywords or keyphrases
But can rely on a list of keyword exclusions e.g. "and", "the", "go" etc.
Ideally transferable to other languages e.g. doesn't rely on language-specific features e.g. metaprogramming
Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)
It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!
Just looking for advice and directions:
Should this be regarded as two separate algorithms?
Are there any established algorithms which I could follow?
Are my requirements feasible?
Many thanks.
P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.

The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.
To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T> (I'll be using C# here)
// Assuming that each line contains one stop word.
HashSet<string> stopWords =
new HashSet<string>(File.ReadLines("C:\stopwords.txt"), StringComparer.OrdinalIgnoreCase);
Later you can look if a keyword candidate is in the stop word list with
If (!stopWords.Contains(candidate)) {
// We have a keyword
}
HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.
Looking for the keywords can easily be done with Regex.
string text = ...; // Load text from DB
MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
RegexOptions.IgnoreCase);
foreach (Match match in matches) {
if (!stopWords.Contains(match.Value)) {
ProcessKeyword(match.Value); // Do whatever you need to do here
}
}
If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to #"\p{L}([:']?\p{L})*". The character class \p{L} contains all letters and letter modifiers.
The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.
Splitting the text into phrases involves searching for sentences ending with "." or "?" or "!" or ":". You should exclude dots and colons that appear within a word.
string[] phrases = Regex.Split(text, #"[\.\?!:](\s|$)");
This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.

Related

Matching strings between whitespaces without including them [duplicate]

I'm trying to come up with an example where positive look-around works but
non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.
Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the
first column started with ABC, and the last column had the value 'active'.
string ='''ABC1 1.1.1.1 20151118 active
ABC2 2.2.2.2 20151118 inactive
xxx x.x.x.x xxxxxxxx active'''
The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer.
So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.
pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution
pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround
If anyone would be kind enough to provide an example, I would be grateful.
Thanks.

The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.
One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:
Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.
Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)
Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:
(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])
vs
(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...

What is the best way to match substring from a big string to a huge list of keywords

Imagine you have millions of records containing text with average 2000 words (each), and also you have an other list with about 100000 items.
e.g: In the keywords list you a have item like "president Obama" and in one of the text records you have some thing like this: "..... president Obama ....", so i want to find this keyword in the text and replace it with some thing like this: "..... {president Obama} ...." to highlight the keyword in the text, the keywords list contains multi-noun word like the example.
What is the fastest way to this in such a huge list with millions of text records?
Notes:
Now I do this work in a greedy way, check word by word and match them, but it takes about 2 seconds for each text record, and I want some thing near zero time.
Also I know this is something like named-entity-recognition and I worked with many of the NER framework such as Gate and ..., but because I want this for a language which is not supported by the frameworks I want to to this manually.

Assumptions: Most keywords are single words, but there are som multi word keywords.
My suggestion.
Hash the keywords based on the first word. So "President","President Obama" and "President Clinton" will all hash to the same value.
Then search word-by-word by computing the hashes. On hash matches implement logic to check if you have a match on a multi word keyword.
Calculating the hashes will be the most expensive operation of this solution and should be linear in the length of the input string.

As for the exact keyword match:
10^6 * 2*10^3 words = billions of possible matches. Comparing this with 10^5 possible matches leads to over 10^6 * 2^3 * 10^5 = 2 * 10^14 operations (worst case: no match, probability no-match: big (because 100000 is small compared all possible words?).
and i want some thing near zero time
Not possible.
As for the NER, you must drop the keywords list and classify the grammar in categories you would like to highlight.
Things like:
verbs
adverbs
nouns
names
quantities
etc.
can be identified. After you have done that, you could define a special list containing special words by category. E.g.: President might be in such a (noun) list to highlight it with special properties. Because you'll end up with a much smaller special list, spitted into several catagories. You can decrease the number of operations needed.
(Just reallize, as you know all about NER you already know that.)
So,you could extract a NER like logic (or other non 100% match algorithm) for the language you're targeting.
Another try might be:
Put all your keywords in a hashtable or other (indexed) dictionary, check if the targeted word is existing in that hashtable. As it is indexed, it will be significant faster than the regular matching. You can store additional info for the keyword in the hashtable.

Verify that my Regex works as expected

It's my first regex for production code, until now I've always avoided to write them myself and now I'm a bit worried if it really works as it is expected to. I made a lot of attempts trying to break it, but I really don't want to rely on this, especially when I have zero experience.
My regex should match exactly this pattern
first character must be one of the letters (not case sensitive) - K,C,M,X,S,W
second character must be a digit from 0-9
a hyphen -
4 alphanumeric characters (A-Z or 0-9) (not case sensitive) and
one letter (A-Z) (not case sensitive).
And that's it. It can't be shorter, it can't be longer, it must match exactly this pattern. What I have for now is this:
string RegExPattern = #"^(K|C|M|X|S|W){1}[0-9]{1}[-]{1}[A-Z0-9]{4}[A-Z]{1}$";
if (!Regex.IsMatch(txtCode.Text, RegExPattern, RegexOptions.IgnoreCase))
{
MessageBox.Show("Fail");
return false;
}
Is there any tool, or some other way to verify the behavior of a regex and is this regex correct for the matching pattern I explained above?

Yes, that is correct.
However, all the {1} are redundant, you can make a set of the first character insted of using the | operator, and you don't need a set for the dash:
string RegExPattern = #"^[KCMXSW][0-9]-[A-Z0-9]{4}[A-Z]$";
There are tools for writing and testing regular expressions, but you can only use them to test any variations in the input that you can think of, and it seems that you have already done that.

Nice tool to verify and develop regular expressions: http://www.debuggex.com. Nevertheless I'd advise you to concrete your regular expressions with the bunch of unit tests.

The best tool is a suite of unit tests, or a single test that iterates over several dozen chunks of text.
Create a text file that has a whole bunch of lines of text that are similar to the data this pattern will be used against. Make sure that some lines match, and some lines that won't match different parts of the rule (eg: a pattern that matches everything but the first character, one that matches everything but the last character, one with only 2 or 3 characters rather than four, etc.
Then, write a small program that reads each line of text and runs your expression against it. Have it print the line numbers of the lines that match, and then compare that list of numbers against your expected results.

Simple lexical parser

I want to write a lexical parser for regular text.
So i need to detect following tokens:
1) Word
2) Number
3) dot and other punctuation
4) "..." "!?" "!!!" and so on
I think that is not trivial to write "if else" condition for each item.
So is there any finite state machine generators for c#?
I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.
i hope to found something like:
FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);

I suggest using Regular Expressions. Something like #"[a-zA-Z\-]+" will pick up words (a-z and dashes), while #"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - #"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).
Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.
EDIT:
Or see Justin's answer for the particular regexes.

We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."
Under those definitions, words would be anything matching the following regex:
#"\b(?!\d)\w+\b"
Note that this would also match unicode. Numbers would match the following:
#"\b\d+(?:\.\d+)?\b"
Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.
After matching those, you could probably get away with this for punctuation:
#"[^\w\d\s]+"

Split text file at sentence boundary

I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.
I thought about a script like s/...|. |[!?]+ |/\n/g (unescaped for better reading). But it does not remove the newlines from inside the sentences.
How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?
Either way (sed or C#) is fine. Thank you.

Regex is a good option that I was using for a long time.
A very good regex that worked fine for me is
string[] sentences = Regex.Split(sentence, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.
For example, if my text is,
U.S.A. is a wonderful nation. Most people feel happy living there.
The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.
This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.
Now if I apply the same input on this method, it will perfectly split text into two logical sentences.
You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.
For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.
Thanks

Sentence splitting is a non-trivial problem for which machine learning algorithms have been developed. But splitting on whitespace between [.\?!]+ and a capital letter [A-Z] might be a good heuristic. Remove the newlines first with tr, then apply the RE:
tr '\r\n' ' ' | sed 's/\([.?!]\)\s\s*\([A-Z]\)/\1\n\2/g'
The output should be one sentence per line. Inspect the output and refine the RE if you find errors. (E.g., mr. Ed would be handled incorrectly. Maybe compile a list of such abbreviations.)
Whether C# or sed is faster can only be determined experimentally.

You could use something like this to extract the sentences:
var sentences = Regex.Matches(input, #"[\w ,]+[\.!?]+")
foreach (Match match in sentences)
{
Console.WriteLine(match.Value);
}
This should match sentences containing words, spaces and commas and ending with (any number of) periods, exclamation and question marks.

You can check my tutorial http://code.google.com/p/graph-expression/wiki/SentenceSplitting
Basic idea is to have split chars and impossible pre/post condition at every split. Tjis simple heuristic works very well.

The task you're interested in is often referred to as 'sentence segmentation'. As larsmans said, it's a non-trivial problem, but heuristic approaches often perform reasonably well, at least for English.
It sounds like you're primarily interested in English, so the regex heuristics already presented may perform adequately for your needs. If you'd like a somewhat more accurate solution (at the cost of just a little more complexity), you might consider using LingPipe, an open-source NLP framework. I've had pretty good luck with LingPipe, the few times I've used it.
See http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html for a detailed tutorial on sentence segmentation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.