Sentence generator using Thesaurus - c#

I am creating an application in .NET.
I got a running application name http://www.spinnerchief.com/. It did what I needed it to do but but I did not get any help from Google.
I need functional results for my application, where users can give one sentence and then the user can get the same sentence, but have it worded differently.
Here is an example of want I want.
Suppose I put a sentence that is "Pankaj is a good man." The output should be similar to the following one:
Pankaj is a great person.
Pankaj is a superb man.
Pankaj is a acceptable guy.
Pankaj is a wonderful dude.
Pankaj is a superb male.
Pankaj is a good human.
Pankaj is a splendid gentleman

To do this correctly for any arbitrary sentence you would need to perform natural language analysis of the source sentence. You may want to look into the SharpNLP library - it's a free library of natural language processing tools for C#/.NET.
If you're looking for a simpler approach, you have to be willing to sacrifice correctness to some degree. For instance, you could create a dictionary of trigger words, which - when they appear in a sentence - are replaced with synonyms from a thesaurus. The problem with this approach is making sure that you replace a word with an equivalent part of speech. In English, it's possible for certain words to be different parts of speech (verb, adjective, adverb, etc) based on their contextual usage in a sentence.
An additional consideration you'll need to address (if you're not using an NLP library) is stemming. In most languages, certain parts of speech are conjugated/modified (verbs in English) based on the subject they apply to (or the object, speaker, or tense of the sentence).
If all you want to do is replace adjectives (as in your example) the approach of using trigger words may work - but it won't be readily extensible. Before you do anything, I would suggest that you clearly defined the requirements and rules for your problem domain ... and use that to decide which route to take.

For this, the best thing for you to use is WordNet and it's hyponym/hypernym relations. There is a WordNet .Net library. For each word you want to alternate, you can either get it's hypernym (i.e. for person, a hypernym means "person is a kind of...") or hyponym ("X is a kind of person"). Then just replace the word you are alternating.
You will want to make sure you have the correct part-of-speech (i.e. noun, adjective, verb...) and there is also the issue of senses, which may introduce some undesired alternations (sense #1 is the most common).

I don't know anything about .Net, but you should look into using a dictionary function (I'm sure there is one, or at least a library that streamlines the process if there isn't).
Then, you'd have to go through the string, and ommit words like "is" or "a". Only taking words you want to have synonyms for.
After this, its pretty simple to have a loop spit out your sentences.
Good luck.

Related

Implement Language Auto-Completion based on ANTLR4 Grammar

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.
Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/
As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).
This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

Culture Independent IsVowel

I need to create a function that will tell me if a character is a vowel or a consonant but I need it to be culture independent. In other words, using a string with "aeiou" isn't good enough because some languages use other vowels such as those with accents. Do I have to compile a list of all unicode characters that could be vowels or is there an easier way to do this?
I don't think this is possible. Very few languages have a one-to-one match between characters and sounds to begin with. Take iota - some will pronounce the first i as a vowel, others as a consonant.
The phonetic alphabet is supposed to help with this. See for instance:
http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
You would have to use the phonetic alphabet as an intermediary, and take the vowels from there. Then, however, you still have the problem of translating words into that phonetic alphabet. Some online dictionaries may be able to help you with that, but even then the same word will likely appear multiple times sometimes with different pronunciations, and I don't know if there are any that allow you to hook up through a webservice or if there are any offline options.
http://www.photransedit.com/online/text2phonetics.aspx (example with horrible full-screen ads)
This problem borders on the complexity of translation software, where you would really need some understanding of the context to understand which word you even need to look up and in what database.
So depending on your requirements, you may want to start as simple as you can, but take the above into account. To allow your application to gain precision later on, you could start with making a function that returns the IPA vowels, and then make a lookup table for letters and letter combinations matches them. Then later on you can look towards getting or creating better data.
You can use charts like these as input:
http://www.antimoon.com/how/pronunc-soundsipa.htm
Many language training books also have an overview. I've always liked the 'Teach Yourself ... ' series, as they always have an overview of the sounds of a language.

C# library for complex queries against in-memory strings

I am looking for something that will take a complex search string and allow me to test it against some text to determine whether the text meets the search criteria.
I would like to support query syntax similar to google/twitter (i.e. support for: and, or, not, exact string, wildcards, etc) and would also like it to handle plurals of words (maybe synonyms if I could have my cake and eat it). I guess what I want is the analysis and query aspects of a search engine without building and maintaining an index.
I really would like to avoid developing this, and thought that it seems like it might be a fairly common requirement. But I have been unable to identify anything in the .net world that specifically meets my needs.
I thought I might be able to use elements of Lucene.net to do this, but have no experience with it. So I would like to know if anybody out there has any ideas that might help or if they have done this before (and what they used). Would be happy to consider non-.NET solutions if integration is possible.
Any input is much appreciated.
Regards
Allen
Regex is exactly your solution.
The only thing you mentioned it doesn't support is synonyms and plurals obviously, because that is language depended. But I guess, you can easily get a list of synonyms, or exceptional plurals in English or something like that, and then write your Regex builder for those (really easy).
Regex is a shortcut for Regular Expressions, and is a well known engine, that exist in a lot of languages' libraries.
A nice site you can learn Regex from is http://www.regular-expressions.info/.
In dot net, all the Regex related classes are in System.Text.RegularExpressions. you can guess quite easily by yourself how to use it... (or just google C# REGEX or something)

C++ scanner (string-fu!)

I'm writing a scanner as part of a compiler.
I'm having a major headache trying to write this one portion:
I need to be able to parse a stream of tokens and push them one by one into a vector, ignoring whitespace and tokenizing special symbols (simple case, lets just consider parentheses and braces)
Example:
int main(){ }
should parse into 6 different tokens:
int
main
(
)
{
}
How would you go about solving this? I'm writing this in C++, but a java/C# solution would be appreciated as well.
Some points:
and no, I can't use Boost, I can't guarantee that the libraries will be
available to me. (don't ask...)
I don't want to use lex, or any other special tools. I've never done
this before and just want to try this once to say I've done it.
Stroustrup's book, The C++ Programming Language, has a great example in it about building a lexer/parser for a simple calculator program. It should serve as a good starting point to learn how to do what you want.
Buy a copy of Compilers: Principles, Techniques, and Tools (the Dragon Book). What you're attempting to write is a lexer, not a "scanner".
Why write your own - look at Lex.
If youmust have your own, you just read the input character by character and maintain some minimum state to accumulate identifiers.
The problem itself is not hard. If you can't solve it, you must be burned out, you just need a rest. Look at it again in the morning.
If you really want to learn something from this exercise, just start coding. It doesn't demand a lot of code, so you can fail repeatedly without blowing more than an afternoon.
At this point you'll have a good feel for the problem.
Then look in any random compilers book to see what the "usual" methods are, and you'll grok then immediately.
umm.. I'd just do a while loop with iterators testing each character for type, and only an alpha to non alpha change, dump the string if it's non empty. if it's a non alpha non white space character, I'd just push it onto the token stack, this is really a trivial parsing task. Shoot, I've been meaning to learn lexx/yacc, but the level of parsing you want is really easy. I wrote a html tokenizer once which is more complicated that this.. I mean you are just looking for names, white space and single non alphanumeric characters.. just do it.
If you want to write this from scratch, you could look into writing a finite state machine (states in an enum, a big switch/case block for state switching). You'd have to push the state to a stack since everything can be nested.
I know that this is not the ideal method; I'm just trying to directly address the question.

A "regex for words" (semantic replacement) - any example syntax and libraries?

I'm looking for syntatic examples or common techniques for doing regular expression style transformations on words instead of characters, given a procedural language.
For example, to trace copying, one would want to create a document with similar meaning but with different word choices.
I'd like to be able to concisely define these possible transformations that I can apply to a text stream.
Eg. "fast noun" to "rapid noun", but "go fast." wouldn't get transformed (no noun afterwards.
Or: "Alice will sing song" to "song will be sung by Alice"
I'd expect this to be done in grammatical checkers, such as detecting passive voice.
A C# implementation for this sort of language-processing would be really neat, but I think the bulk of any effort is coming up with the right rules - Keeping the rules clear and understandable seems like a place to begin.
You could try Jason Rennie > WordNet-QueryData-1.47 > WordNet::QueryData
One good place to start researching would be "Word Net" - it's a dictionary of semantics, grouping words together by similar meaning, and also recording the relationships between words in useful ways.
There are a bunch of software projects leveraging the Word Net corpus, one of them may be what you need.
If you aren't tied to a particular language, Haskell has Aarne Ranta's Grammatical Framework:
http://www.grammaticalframework.org/
which is explicitly designed to generate parsers, etc for natural language processing of this sort.
A good place to start would be SIL's CARLAStudio for its "Computer Assisted Related Language Adaptation" suite. Alternatively SIL's Adapt It. SIL has a huge range of linguistic analysis software, which is the direction you appear to be going. It's certainly a big jump from regular expressions, which don't care about the meaning, to something that can handle linguistic analysis.
If you want something more robust for natural language parsing/transforming, you could try the C# port of OpenNLP.
I am not aware of any syntaxes that exist for English language processing like you discuss. You would need to create your own DSL using one of the toolsets (such as Word Net) out there.

Categories