Determine POS tagging in English based on database files

Determine POS tagging in English based on database files - c#

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.
Description:
I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.
I enclose my perspective here.
private List<string> ParseInput(String allInput)
{
List<string> listSentence = new List<string>();
char[] delimiter = ".?!;".ToCharArray();
var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());
foreach (var s in sentences)
listSentence.Add(s);
return listSentence;
}
private void tenseReviewMenu_Click(object sender, EventArgs e)
{
string allInput = rtbInput.Text;
List<string> listWord = new List<string>();
List<string> listSentence = new List<string>();
HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));
char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };
listSentence = ParseInput(allInput);
foreach (string sentence in listSentence)
{
foreach (string word in sentence.Split(separator))
if (word.Trim() != "")
listWord.Add(word);
}
string testPOS = "";
foreach (string word in listWord)
{
if (nounList.Contains(word.ToLowerInvariant()))
testPOS += "noun ";
else if (verbList.Contains(word.ToLowerInvariant()))
testPOS += "verb ";
else if (adjectiveList.Contains(word.ToLowerInvariant()))
testPOS += "adj ";
else if (adverbList.Contains(word.ToLowerInvariant()))
testPOS += "adv ";
}
tbTest.Text = testPOS;
}
POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.

I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.
This problem can be tackled by following the steps below:
Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.
Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.
When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.
Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.
If your sentence doesn't match any sentence pattern, then you have two options:
a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.
b) Or, discard the input sentence as an invalid English sentence.
Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.
I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

Related

Parsing this special format file

I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?

A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle

This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}

This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.

Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:

RegEx for a Glossary Function

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.
"Manuals and labels" (describes these concepts in general)
"Delete Manuals and Labels" (describes this specific action)
"Learn more about adding manuals and labels" (again, more specific action)
I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:
Learn more about manuals and labels
I end up with
Learn more about <a href="#">manuals and labels</a>
Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:
Learn more about manuals and labels
Without the outer link I really need.
The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:
Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);
string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));
And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.
It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.
Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?

From your examples with nested links it sounds like you're making individual passes over the terms and performing multiple Regex.Replace calls. Since you're using a regex you should let it do the heavy lifting and put a nice pattern together that makes use of alternation.
In other words, you likely want a pattern like this: \b(term1|term2|termN)\b
var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[]
{
"Learn more about adding manuals and labels",
"Delete Manuals and Labels",
"manuals and labels"
};
var pattern = #"\b(" + String.Join("|", terms) + #")\b";
var replacement = #"$1";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);
Now, to address the issue of a corresponding href value for each term, you can use a dictionary and change the regex to use a MatchEvaluator that will return the custom format and look up the value from the dictionary. The dictionary also ignores case by passing in StringComparer.OrdinalIgnoreCase. I tweaked the pattern slightly by adding ?: at the start of the group to make it a non-capturing group since I am no longer referring to the captured item as I did in the first example.
var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "Learn more about adding manuals and labels", "2.html" },
{ "Delete Manuals and Labels", "3.html" },
{ "manuals and labels", "1.html" }
};
var pattern = #"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + #")\b";
var result = Regex.Replace(input, pattern,
m => String.Format(#"{1}", terms[m.Value], m.Value),
RegexOptions.IgnoreCase);
Console.WriteLine(result);

I would use an ordered dictionary like this, making sure the smallest term is last:
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
public class Test
{
public static void Main()
{
OrderedDictionary Links = new OrderedDictionary();
Links.Add("Learn more about adding manuals and labels", "2");
Links.Add("Delete Manuals and Labels", "3");
Links.Add("manuals and labels", "1");
string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
foreach (string termToFind in Links.Keys)
{
Regex _regex = new Regex(#"\b" + termToFind + #"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
text = _regex.Replace(text, #"$&");
}
Console.WriteLine(text);
}
}
ideone demo
The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.

First, you can prevent your Regex for manuals and labels from finding Learn more about manuals and labels by using a lookbehind. Modified your regex looks like this:
(?<!Learn more about )(manuals and labels)
But for your specific request i would suggest a different solution. You should define a rule or priority list for your regexs or both. A possible rule could be "always search for the regex first that matches the most characters". This however requires that your regexs are always fixed length. And it does not prevent one regex from consuming and replacing characters that would have been matched by a different regex (maybe even of the same size).
Of course you will need to add an additional lookbehind and lookahead to each of your regexs to prevent replacing strings that are inside of your replacing elements

Efficient and fast way to parse a string with different languages

I have a string something like (generated via Google Transliterate REST call, and transliterated into 2 languages):
" This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड
থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious
अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস "
Now Google Transliterate REST call allows FIVE words at a time, so I had to loop, add it to the list and then concatenate the string. That's why we see that each CHUNK (of each language) is of 5 words. The total number of words is 7 words, so first 5 (This world is beautiful and) lies before rest 2 (amazingly mysterious) later.
How do I most efficiently parse the sentence such that I get something like:
This world is beautiful and amazingly mysterious थिस वर्ल्ड इस बेऔतिफुल एंड अमज़िन्ग्ली म्य्स्तेरिऔस থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ আমাজিন্গ্লি ম্য্স্তেরীয়ুস
Since the length of sentence, and the number of languages it can be converted into can be dynamic, may be using lists of each language can work, and then concatenated later?
I used an approach where I transliterated each word, one at a time, it works well, but too slow as it increases the number of calls to the API.
Can someone help me with an efficient (and dynamic) implementation of such a scenario? Thanks a bunch!

One list per language is the way to go.

if you mean different character ASCII code by different languages, you can use this answer here:
Regular expression Spanish and Arabic words

Pay for google translate's API and then your length restriction goes up to 5,000 characters per request https://developers.google.com/translate/v2/faq
Also, yes, as Daniel has said - grouping the text by language will be necessary

I have tried a work out, correct me if i misinterpret your question
string statement = "This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস ";
string otherLangStmt = statement;
MatchCollection matchCollection = Regex.Matches(statement, "([a-zA-Z]+)");
string result = "";
foreach (Match match in matchCollection)
{
if (match.Groups.Count > 0)
{
result += match.Groups[0].Value + " ";
otherLangStmt = otherLangStmt.Replace(match.Groups[0].Value, string.Empty);
}
}
otherLangStmt = Regex.Replace(otherLangStmt.Trim(), "[\\s]", " ");
Console.WriteLine(result);
Console.WriteLine(otherLangStmt);

How to implement a simple String search

I want to implement a simple search in my application, based on search query I have.
Let's say I have an array containing 2 paragraphs or articles and I want to search in these articles for related subject or related keywords I enter.
For example:
//this is my search query
string mySearchQuery = "how to play with matches";
//these are my articles
string[] myarticles = new string[] {"article 1: this article will teach newbies how to start fire by playing with the awesome matches..", "article 2: this article doesn't contain anything"};
How can I get the first article based on the search query I provided above? Any idea?

This would return any string in myarticles that contains all of the words in mysearchquery:
var tokens = mySearchQuery.Split(' ');
var matches = myarticles.Where(m => tokens.All(t => m.Contains(t)));
foreach(var match in matches)
{
// do whatever you wish with them here
}

I'm sure you can fine a nice framework for string search, cause it's a wide subject, and got many search rules.
But for this simple sample, try splitting the search query with " ", for each word do a simple string search, if you find it, add 1 point to the paragraph search match, at the end return the paragraph with the most points...

parsing words in a continuous string

If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?
For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?
I hear that using the data structure Tries could help but maybe if someone could help with the pseudo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.

I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.
If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.
EDIT: It appeared that I was reinventing tries.

I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.
My solution was to:
Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.
Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.
It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.
For me, I also did a small verification of misspelled words into the comparation.
example of procedure (don't copy paste)
Dim bigString As String = "helloworld.thisisastackoverflowtest!"
Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
dictionary.Add("Hello")
dictionary.Add("World")
dictionary.Add("this")
dictionary.Add("is")
dictionary.Add("a")
dictionary.Add("stack")
dictionary.Add("over")
dictionary.Add("flow")
dictionary.Add("stackoverflow")
dictionary.Add("test")
dictionary.Add("!")
For Each word As String In dictionary
If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
word = word.ToLower 'make it case insentitive
Next
Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result
Dim i As Integer = 0 'start at the beginning
Dim Found As Boolean = False
Do
For Each word In dictionary
If bigString.IndexOf(word, i) > 0 Then
ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value
Found = True
End If
Next
If Found = True Then
i += ResultComparer(BestWordWithBestWeight).Length
Else
i += 1
End If
Loop

I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.

If you are sure you have all the words of the phrase in the dictionary, you can use that algo:
String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
foreach(item in myDictionary){
if(phrase.startsWith(item){
fullPhrase += item + " ";
phrase.remove(item);
break;
}
}
} while(phrase.length != 0);
There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.

This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.
This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)
words with string a Isis th
At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.
A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.

Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :
Class State
{
String matchedWord;
Map<char,State> mapChildren;
}
From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.
Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.

If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.
public static String[] StringToWords(String str, HashSet<string> words)
{
//Index of char - length of last valid word
int[] bps = new int[str.Length + 1];
for (int i = 0; i < bps.Length; i++)
bps[i] = -1;
for (int i = 0; i < str.Length; i++)
{
for (int j = i + 1; j <= str.Length ; j++)
{
if (bps[j] == -1)
{
//Destination cell doesn't have valid backpointer yet
//Try with the current substring
String s = str.Substring(i, j - i);
if (words.Contains(s))
bps[j] = i;
}
}
}
//Backtrack to recovery sequence and then reverse
List<String> seg = new List<string>();
for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])
seg.Add(str.Substring(bps[bp], bp - bps[bp]));
seg.Reverse();
return seg.ToArray();
}
Building a hastset with the word list from /usr/share/dict/words and testing with
foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
Console.WriteLine(s);
I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.
There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.