natural language query processing

natural language query processing - c#

I have a NLP (natural language processing application) running that gives me a tree of the parsed sentence, the questions is then how should I proceed with that.
What is the time
\-SBAR - Suborginate clause
|-WHNP - Wh-noun phrase
| \-WP - Wh-pronoun
| \-What
\-S - Simple declarative clause
\-VP - Verb phrase
|-VBZ - Verb, 3rd person singular present
| \-is
\-NP - Noun phrase
|-DT - Determiner
| \-the
\-NN - Noun, singular or mass
\-time
the application has a build in javascript interpreter, and was trying to make the phrase in to a simple function such as
function getReply() {
return Resource.Time();
}
in basic terms, what = request = create function, is would be the returned object, and the time would reference the time, now it would be easy just to make a simple parser for that but then we also have what is the time now, or do you know what time it is. I need it to be able to be further developed based on the english language as the project will grow.
the source is C# .Net 4.5
thanks in advance.

As far as I can see, using dependency parse trees will be more helpful. Often, the number of ways a question is asked is limited (I mean statistically significant variations are limited ... there will probably be corner cases that people ordinarily do not use), and are expressed through words like who, what, when, where, why and how.
Dependency parsing will enable you to extract the nominal subject and the direct as well as indirect objects in a query. Typically, these will express the basic intent of the query. Consider the example of tow equivalent queries:
What is the time?
Do you know what the time is?
Their dependency parse structures are as follows:
root(ROOT-0, What-1)
cop(What-1, is-2)
det(time-4, the-3)
nsubj(What-1, time-4)
and
aux(know-3, Do-1)
nsubj(know-3, you-2)
root(ROOT-0, know-3)
dobj(is-7, what-4)
det(time-6, the-5)
nsubj(is-7, time-6)
ccomp(know-3, is-7)
Both are what-queries, and both contain "time" as a nominal subject. The latter also contains "you" as a nominal subject, but I think expressions like "do you know", "can you please tell me", etc. can be removed based on heuristics.
You will find the Stanford Parser helpful for this approach. They also have this online demo, if you want to see some more examples at work.

Related

Fuzzy Entity Recognition

I am new to NLP. What I am trying to do (in c#) is given a list of custom entities, along lines of
> NAME|ENTITY TYPE|ID
> Cubbies|Baseball Team|CHI
> Chicago Cubs|Baseball Team|CHI
> Dubs|Basketball Team|GSW
> Golden State Warriors|Basketball Team|GSW
I am looking to take short sentences and tag fuzzy matches of these entities.
For example, parse
Jordan Bell is going to make Golden St. much better next year
into
Jordan Bell is going to make [Basketball Team|GSW] much better next year".
Ideally this would be in conjunction with generalized name recognition eg:
[Person:Jordan Bell] is going to make [Basketball Team:GSW] much better [Time:next year]".
Grateful for any help or direction. Thanks!

It's probably best to think of your problem in two parts: role labelling (Named Entity Recognition) and label unification (fuzzy matching).
For determining labels - that is, marking tokens in the sentences as team name, person, and so on - a Conditional Random Field (CRF) is a good model. CRF++ is a popular toolkit. The New York Times used CRF++ with some success on recipe data a few years back. Here's a bit from their article:
Since you're identifying the names of sports teams, you have two options for dealing with the fuzzy matching you described. You can do actual fuzzy matching using string similarity; this article explains how that was done in Python library Fuzzy Wuzzy at a high enough level it should be easy to re-implement.
Your other option is Named Entity Resolution, which is tying named entities (your labelled bits) to an external database. When you do this with Wikipedia it's called "Wikification", for example. This article describes someone using Wikipedia redirect information to recognize alternate names for companies - you could to the same thing by checking that Wikipedia redirects Cubbies to Chicago Cubs (it does).
Without knowing your data, it's hard to say whether fuzzy matching or Named Entity Resolution would be easier, so it's probably best to give them both a shot.
Sorry for not including resources explicitly for C# - that said, the techniques here are usually more important than the implementations.

Speech to text in c#

I have a c# program that lets me use my microphone and when I speak, it does commands and will talk back. For example, when I say "What's the weather tomorrow?" It will reply with tomorrows weather.
The only problem is, I have to type out every phrase I want to say and have it pre-recorded. So if I want to ask for the weather, I HAVE to say it like i coded it, no variations. I am wondering if there is code to change this?
I want to be able to say "Whats the weather for tomorrow", "whats tomorrows weather" or "can you tell me tomorrows weather" and it tell me the next days weather, but i don't want to have to type in each phrase into code. I seen something out there about e.Result.Alternates, is that what I need to use?

This cannot be done without involving linguistic resources. Let me explain what I mean by this.
As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).
Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.
Pattern matching
To avoid hard-coding the sentences like
Tell me about the weather.
What's the weather tomorrow?
Weather report!
you can instead define a pattern that matches any of those sentences:
if a sentence contains "weather", then output a weather report
This can be further refined in manifold ways, e.g. :
if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol
This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.
Measuring Similarity
If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.
In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as
I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0
Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).
The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).
On a more general note, there are many other considerations to be made of course.
For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:
I want you to output a weather report.
Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.
Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.
Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.

For that, the code won't be able to judge that! You need to split the command in text array! Such as
Tomorrow
Weather
What
This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).
It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.

How do I determine if two similar band names represent the same band?

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance

The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.

I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.

soundex may also be useful

In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

What's the best way to write a parser by hand? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
We've used ANTLR to create a parser for a SQL-like grammar, and while the results are satisfactory in most cases, there are a few edge cases that we need to fix; and since we didn't write the parser ourselves we don't really understand it well enough to be able to make sensible changes.
So, we'd like to write our own parser. What's the best way to go about writing a parser by hand? What sort of parser should we use - recursive descent has been recommended; is that right? We'll be writing it in C#, so any tutorials for writing parsers in that language would be gratefully received.
UPDATE: I'd also be interested in answers that involve F# - I've been looking for a reason to use that in a project.

I would highly recommend the F# language as your language of choice for parsing on the .NET Platform. It's roots in the ML family of languages means it has excellent support for language-oriented programming.
Discriminated unions and pattern-matching allow for a very succinct and powerful specification of your AST. Higher-order functions allow for definition of parse operations and their composition. First-class support for monadic types allows for state management to be handled implicitly greatly simplifying the composition of parsers. Powerful type-inference greatly aides the definition of these (complex) types. And all of this can be specified and executed interactively allowing you to rapidly prototype.
Stephan Tolksdorf has put this into practice with his parser combinator library FParsec
From his examples we see how naturally an AST is specified:
type expr =
| Val of string
| Int of int
| Float of float
| Decr of expr
type stmt =
| Assign of string * expr
| While of expr * stmt
| Seq of stmt list
| IfThen of expr * stmt
| IfThenElse of expr * stmt * stmt
| Print of expr
type prog = Prog of stmt list
the implementation of the parser (partially elided) is just as succinct:
let stmt, stmtRef = createParserForwardedToRef()
let stmtList = sepBy1 stmt (ch ';')
let assign =
pipe2 id (str ":=" >>. expr) (fun id e -> Assign(id, e))
let print = str "print" >>. expr |>> Print
let pwhile =
pipe2 (str "while" >>. expr) (str "do" >>. stmt) (fun e s -> While(e, s))
let seq =
str "begin" >>. stmtList .>> str "end" |>> Seq
let ifthen =
pipe3 (str "if" >>. expr) (str "then" >>. stmt) (opt (str "else" >>. stmt))
(fun e s1 optS2 ->
match optS2 with
| None -> IfThen(e, s1)
| Some s2 -> IfThenElse(e, s1, s2))
do stmtRef:= choice [ifthen; pwhile; seq; print; assign]
let prog =
ws >>. stmtList .>> eof |>> Prog
On the second line, as an example, stmt and ch are parsers and sepBy1 is a monadic parser combinator that takes two simple parsers and returns a combination parser. In this case sepBy1 p sep returns a parser that parses one or more occurrences of p separated by sep. You can thus see how quickly a powerful parser can be combined from simple parsers. F#'s support for overridden operators also allow for concise infix notation e.g. the sequencing combinator and the choice combinator can be specified as >>. and <|>.
Best of luck,
Danny

The only kind of parser that can be handwritten by a sane human being is a recursive-descent. That said, writing bottom-up parser by hand is still possible but is very undesirable.
If you're up for RD parser you have to verify that SQL grammar is not left-recursive (and eliminate the recursion if necessary), and then basically write a function for each grammar rule. See this for further reference.

Adding my voice to the chorus in favor of recursive-descent (LL1). They are simple, fast, and IMO, not at all hard to maintain.
However, take a good look at your language to make sure it is LL1. If you have any syntax like C has, like ((((type))foo)[ ]) where you might have to descend multiple layers of parentheses before you even find out if you are looking at a type, variable, or expression, then LL1 will be very difficult, and bottom-up wins.

Recursive descent will give you the simplest way to go, but I would have to agree with mouviciel that flex and bison and definitely worth learning. When you find out you have a mistake in your grammar, fixing a definition of the language in flex /bison will be a hell of a lot easier then rewriting your recursive descent code.
FYI the C# parser is written recursive descent and it tends to be quite robust.

Recursive Descent parsers are indeed the best, maybe only, parsers that can be built by hand. You will still have to bone-up on what exactly a formal, context-free language is and put your language in a normal form. I would personally suggest that you remove left-recursion and put your language in Greibach Normal Form. When you do that, the parser just about writes itself.
For example, this production:
A => aC
A => bD
A => eF
becomes something simple like:
int A() {
chr = read();
switch char
case 'a': C();
case 'b': D();
case 'e': F();
default: throw_syntax_error('A', chr);
}
And there aren't any much more difficult cases here (what's more difficult is make sure your grammar is in exactly the right form but this allows you the control that you mentioned).
Anton's Link is also seems excellent.

If you want to write it by hand, recursive decent is the most sensible way to go.
You could use a table parser, but that will be extremely hard to maintain.
Example:
Data = Object | Value;
Value = Ident, '=', Literal;
Object = '{', DataList, '}';
DataList = Data | DataList, Data;
ParseData {
if PeekToken = '{' then
return ParseObject;
if PeekToken = Ident then
return ParseValue;
return Error;
}
ParseValue {
ident = TokenValue;
if NextToken <> '=' then
return Error;
if NextToken <> Literal then
return Error;
return(ident, TokenValue);
}
ParseObject {
AssertToken('{');
temp = ParseDataList;
AssertToken('}');
return temp;
}
ParseDataList {
data = ParseData;
temp = []
while Ok(data) {
temp = temp + data;
data = ParseData;
}
}

I suggest that you don't write the lexer by hand - use flex or similar. The task of recognising tokens is not that hard to do by hand, but I don't think you'd gain much.
As others have said, recursive descent parsers are easiest to write by hand. Otherwise you have to maintain the table of state transitions for each token, which isn't really human-readable.
I'm pretty sure ANTLR implements a recursive descent parser anyway: there's a mention of it in an interview about ANTLR 3.0.
I've also found a series of blog posts about writing a parser in C#. It seems quite gentle.

At the risk of offending the OP, writing a parser for a large langauge like some specific vendor's SQL by hand when good parser generator tools (such as ANTLR) are available is simply crazy. You'll spend far more time rewriting your parser by hand than you will fixing the "edge cases" using the parser generator, and you'll invariably have to go back and revise the parser anyway as the SQL standards move or you discover you misunderstood something else. If you don't understand your parsing technology well enough, invest the time to understand it. It won't take you months to figure out how to deal with the edge cases with the parser generator, and you've already admitted it you are willing to spend months doing it by hand.
Having said that, if you are hell-bent on redoing it manually, using a recursive descent parser is the best way to do it by hand.
(NOTE: OP clarified below that he wants a reference implementation. IMHO you should NEVER code a reference implementation of a grammar procedurally, so recursive descent is out.)

There is no "one best" way. Depending on your needs you may want bottom up (LALR1) or recursive descent(LLk). Articles such as this one give personal reasons for prefering LALR(1) (bottom up) to LL(k). However, each type of parser has its benefits and drawbacks. Generally LALR will be faster as a finite-state_machine is generated as a lookup table.
To pick what's right for you examine your situation; familiarize yourself with the tools and technologies. Starting with some LALR and LL Wikipedia articles isn't a bad choice. In both cases you should ALWAYS start with specifying the grammar in BNF or EBNF. I prefer EBNF for its succinctness.
Once you've gotten your mind wrapped around what you want to do, and how to represent it as a grammar, (BNF or EBNF) try a couple of different tools and run them across representative samples of text to be parsed.
Anecdotally:
However, I've heard that LL(k) is more flexible. I've never bothered to find out for myself. From my few parser building experiences I have noticed that regardles if it's LALR or LL(k) the best way to pick what's best for your needs is to start with writing the grammar. I've written my own C++ EBNF RD parser builder template library, used Lex/YACC and had coded a small R-D parser. This was spread over the better part of 15 years, and I spent no more than 2 months on the longest of the three projects.

In C/Unix, the traditional way is to use lex and yacc. With GNU, the equivalent tools are flex and bison. I don't know for Windows/C#.

If I were you I would have another go at ANTLRv3 using the GUI ANTLRWorks which gives you a very convenient way of testing your grammar. We use ANTLR in our project and although the learning curve may be a bit steep in the beginning once you learn it is quite convenient. Also on their email newsletter there are a lot of people who are very helpful.
PS. IIRC they also have a SQL-grammar you could take a look at.
hth

Well if you don't mind using another compiler compiler tool like ANTLR I suggest you take a look at Coco/R
I've used it in the past and it was pretty good...

The reason you don't want a table-driven parser is that you will not be able to create sensible error-messages. That is ok for a generated language, but not one where humans are involved. The error messages produced by c-like language compilers provide ample evidence that people can adapt to anything, no matter how bad.

I would also vote for using an existing parser + lexer.
The only reason I can see in doing it by hand:
if you want something relatively simple (like validating/parsing some input)
to learn/understand the principles.

Look at gplex and gppg, lexer and parser generators for .NET. They work well, and are based on the same (and almost compatible) input as lex and yacc, which is relatively easy to use.

JFlex is a flex implementation for Java, and there is now a C# port of that project http://sourceforge.net/projects/csflex/. There also appears to be a C# port of CUP in progress, which can be found here: http://sourceforge.net/projects/datagraph/
I too would recommend avoiding hand crafting your own solution. I tried this once myself for a very simple language (part of a university project) and it was incredibly time consuming and difficult. It is also exceedingly hard to maintain and change once written.
Using an existing parser generator is the way to go, as the bulk of the hard work has been done and has been well tested over the years.

C#: How to parse arbitrary strings into expression trees?

In a project that I'm working on I have to work with a rather weird data source. I can give it a "query" and it will return me a DataTable. But the query is not a traditional string. It's more like... a set of method calls that define the criteria that I want. Something along these lines:
var tbl = MySource.GetObject("TheTable");
tbl.AddFilterRow(new FilterRow("Column1", 123, FilterRow.Expression.Equals));
tbl.AddFilterRow(new FilterRow("Column2", 456, FilterRow.Expression.LessThan));
var result = tbl.GetDataTable();
In essence, it supports all the standard stuff (boolean operators, parantheses, a few functions, etc.) but the syntax for writing it is quite verbose and uncomfortable for everyday use.
I wanted to make a little parser that would parse a given expression (like "Column1 = 123 AND Column2 < 456") and convert it to the above function calls. Also, it would be nice if I could add parameters there, so I would be protected against injection attacks. The last little piece of sugar on the top would be if it could cache the parse results and reuse them when the same query is to be re-executed on another object.
So I was wondering - are there any existing solutions that I could use for this, or will I have to roll out my own expression parser? It's not too complicated, but if I can save myself two or three days of coding and a heapload of bugs to fix, it would be worth it.

Try out Irony. Though the documentation is lacking, the samples will get you up and running very quickly. Irony is a project for parsing code and building abstract syntax trees, but you might have to write a little logic to create a form that suits your needs. The DLR may be the complement for this, since it can dynamically generate / execute code from abstract syntax trees (it's used for IronPython and IronRuby). The two should make a good pair.
Oh, and they're both first-class .NET solutions and open source.

Bison or JavaCC or the like will generate a parser from a grammar. You can then augment the nodes of the tree with your own code to transform the expression.
OP comments:
I really don't want to ship 3rd party executables with my soft. I want it to be compiled in my code.
Both tools generate source code, which you link with.

I wrote a parser for exaclty this usage and complexity level by hand. It took about 2 days. I'm glad I did it, but I wouldn't do it again. I'd use ANTLR or F#'s Fslex.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.