I faced with the 'quotation' term and I'm trying to figure out some real-life examples of usage of it. Ability of having AST for each code expression sounds awesome, but how to use it in real life?
Does anyone know such example?
F# and Nemerle quotations are both used for metaprogramming, but the approaches are different: Nemerle uses metaprogramming at compilation time to extend the language, while F# uses them at run time.
Nemerle
In Nemerle, quotations are used within macros for taking apart pieces of code and generating new ones. Much of the language itself is implemented this way. For example, here is an example from the official library — the macro implementing the when conditional construct. Nemerle does not have statements, so an if has to have an else part: when and unless macros provide shorthand for an if with an empty then and else parts, respectively. The when macro also has extended pattern-matching functionality.
macro whenmacro (cond, body)
syntax ("when", "(", cond, ")", body)
{
match (cond)
{
| <[ $subCond is $pattern ]> with guard = null
| <[ $subCond is $pattern when $guard ]> =>
match (pattern)
{
| PT.PExpr.Call when guard != null =>
// generate expression to replace 'when (expr is call when guard) body'
<[ match ($subCond) { | $pattern when $guard => $body : void | _ => () } ]>
| PT.PExpr.Call =>
// generate expression to replace 'when (expr is call) body'
<[ match ($subCond) { | $pattern => $body : void | _ => () } ]>
| _ =>
// generate expression to replace 'when (expr is pattern) body'
<[ match ($cond) { | true => $body : void | _ => () } ]>
}
| _ =>
// generate expression to replace 'when (cond) body'
<[ match ($cond : bool) { | true => $body : void | _ => () } ]>
}
}
The code uses quotation to handle patterns that look like some predefined templates and replace them with corresponding match expressions. For example, matching the cond expression given to the macro with:
<[ $subCond is $pattern when $guard ]>
checks whether it follows the x is y when z pattern and gives us the expressions composing it. If the match succeeds, we can generate a new expression from the parts we got using:
<[
match ($subCond)
{
| $pattern when $guard => $body : void
| _ => ()
}
]>
This converts when (x is y when z) body to a basic pattern-matching expression. All of this is automatically type-safe and produces reasonable compilation errors when used incorrectly. So, as you see quotation provides a very convenient and type-safe way of manipulating code.
Well, anytime you want to manipulate code programmatically, or do some metaprogramming, quotations make it more declarative, which is a good thing.
I've written two posts about how this makes life easier in Nemerle: here and here.
For real life examples, it's interesting to note that Nemerle itself defines many common statements as macros (where quotations are used). Some examples include: if, for, foreach, while, break, continue and using.
I think quotations have quite different uses in F# and Nemerle. In F#, you don't use quotations to extend the F# language itself, but you use them to take an AST (data representation of code) of some program written in standard F#.
In F#, this is done either by wrapping a piece of code in <# ..F# code.. #>, or by adding a special attribtue to a function:
[<ReflectedDefinition>]
let foo () =
// body of a function (standard F# code)
Robert already mentioned some uses of this mechanism - you can take the code and translate F# to SQL to query database, but there are several other uses. You can for example:
translate F# code to run on GPU
translate F# code to JavaScript using WebSharper
As Jordão has mentioned already quotations enable meta programming. One real world example of this is the ability to use quotations to translated F# into another language, like for example SQL. In this way Quotations server much the same purpose as expression trees do in C#: they enable linq queries to be translated into SQL (or other data-acess language) and executed against a data store.
Unquote is a real-life example of quotation usage.
Related
I have something that goes alongside:
method_declaration : protection? expression identifier LEFT_PARENTHESES (method_argument (COMMA method_argument)*)? RIGHT_PARENTHESES method_block;
expression
: ...
| ...
| identifier
| kind
;
identifier : IDENTIFIER ;
kind : ... | ... | VOID_KIND; // void for example there are more
IDENTIFIER : (LETTER | '_') (LETTER | DIGIT | '_')*;
VOID_KIND : 'void';
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
*The other rules on the method_declaration are not relavent for this question
What happens is that when I input something such as void Start() { }
and look at the ParseTree, it seems to think void is an identifier and not a kind, and treats it as such.
I tried changing the order in which kind and identifier are written in the .g4 file... but it doesn't quite seem to make any difference... why does this happen and how can I fix it?
The order in which parser rules are defined makes no difference in ANTLR. The order in which token rules are defined does though. Specifically, when multiple token rules match on the current input and produce a token of the same length, ANTLR will apply the one that is defined first in the grammar.
So in your case, you'll want to move VOID_KIND (and any other keyword rules you may have) before IDENTIFIER. So pretty much what you already tried except with the lexer rules instead of the parser rules.
PS: I'm somewhat surprised that ANTLR doesn't produce a warning about VOID_KIND being unmatchable in this case. I'm pretty sure other lexer generators would produce such a warning in cases like this.
I am making an interpreter in Antlr and I am having trouble with the following:
One of my rules states:
expression
: { .. }
| {... }
| identifier
| expression DOT expressioon
| get_index
| invoke_method
| { ... }
;
I thought this way I'd evaluate expressions such as "a.b[2].Run()" or whatever... the idea is being able to use both variables, indexing, and methods. All 3 syntaxes are in place... How do I chain them tho? Is my approach correct? or is there a better way? (The example I provided just randomly throws those in with no specification for the sake of clearance, it's not the actual grammar but I can assure you all other 3 grammar rules(identifier, get_index and invoke_method) are defined and are proper, identifier is a single one, not chained)
I am new to c#. I have a question about parsing a string. If i have a file that contains dome lines such as PC: SWITCH_A == ON or a string like PC: defined(SWITCH_B) && SWITCH_C == OFF. All the operators(==, &&, defined) are string here and all the switch names(SWITCH_A) and their values are identifiers(OFF). How do i parse these kind of string? Do i first have to tokenize them split them by new lines or white spaces and then make an abstract syntax tree for parsing them? Also do i need to store all the identifiers in a dictionary first? I have no idea about parsing can anyone help? an tell me with an example how to do it what should be the methods and classes that should be included? Thanks.
Unfortunately, Yes. You have to tokenize them if the syntax that you are parsing is something custom and not a standard syntax where a compiler already exists for parsing the source.
You could take advantage of Expression Trees. They are there in the .NET Framework for building and evaluating dynamic languages.
To start parsing the syntax you have to have a grammar document that describes all the possible cases of the syntax in each line. After that, you can start parsing the lines and building your expression tree.
Parsing any source code typically goes a character at a time since each character might change the entire semantics of the piece that is being parsed.
So, i suggest you start with a grammar document for the syntax that you have and then start writing your parser.
Make sure that there isn't anything already out there for the syntax you are trying to parse as these kind of projects tend to be error-prone and time consuming
Now since your high-level grammar is
Expression ::= Identifier | IntegerValue | BooleanExpression
Identifier and IntegerValue are constant literals in the source, so you need to start looking for a BooleanExpression.
To find a BooleanExpression you need to look for either BooleanBinaryExpression, BooleanUnaryExpression, TrueExpression or FalseExpression.
You can detect a BooleanBinaryExpression by look for the && or == operators and then taking the left and right operands.
To detect a BooleanUnaryExpression you need to look for the word defined and then parse the identifier in the parantheses.
And so on...
Notice that your grammar supports recursion in the syntax, look at the definition of the AndExpression or EqualsExpression, they point back to Expression
AndExpression ::= Expression '&&' Expression
EqualsExpression ::= Expression '==' Expression
You got a bunch of methods in the String Class in the .NET Framework to assist you in detecting and parsing your grammar.
Another alternative is that you can look for a parser generator that targets c#. For example, see ANTLR
I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction() as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?
Edit: In response to there not being any other identifiers in this language, I would take a different approach.
Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.
This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).
Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.
The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):
tokens {
USER_FUNCTION
}
Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.
public override IToken Emit() {
if (_type == IDENTIFIER && IsUserDefinedFunction())
_type = USER_FUNCTION;
return base.Emit();
}
My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).
Then the lexer defines:
USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';
In the parser listener, I strip the surrounding §'s from the function name.
I'm looking to write a Truth Table Generator as a personal project.
There are several web-based online ones here and here.
(Example screenshot of an existing Truth Table Generator)
I have the following questions:
How should I go about parsing expressions like: ((P => Q) & (Q => R)) => (P => R)
Should I use a parser generator like ANTLr or YACC, or use straight regular expressions?
Once I have the expression parsed, how should I go about generating the truth table? Each section of the expression needs to be divided up into its smallest components and re-built from the left side of the table to the right. How would I evaluate something like that?
Can anyone provide me with tips concerning the parsing of these arbitrary expressions and eventually evaluating the parsed expression?
This sounds like a great personal project. You'll learn a lot about how the basic parts of a compiler work. I would skip trying to use a parser generator; if this is for your own edification, you'll learn more by doing it all from scratch.
The way such systems work is a formalization of how we understand natural languages. If I give you a sentence: "The dog, Rover, ate his food.", the first thing you do is break it up into words and punctuation. "The", "SPACE", "dog", "COMMA", "SPACE", "Rover", ... That's "tokenizing" or "lexing".
The next thing you do is analyze the token stream to see if the sentence is grammatical. The grammar of English is extremely complicated, but this sentence is pretty straightforward. SUBJECT-APPOSITIVE-VERB-OBJECT. This is "parsing".
Once you know that the sentence is grammatical, you can then analyze the sentence to actually get meaning out of it. For instance, you can see that there are three parts of this sentence -- the subject, the appositive, and the "his" in the object -- that all refer to the same entity, namely, the dog. You can figure out that the dog is the thing doing the eating, and the food is the thing being eaten. This is the semantic analysis phase.
Compilers then have a fourth phase that humans do not, which is they generate code that represents the actions described in the language.
So, do all that. Start by defining what the tokens of your language are, define a base class Token and a bunch of derived classes for each. (IdentifierToken, OrToken, AndToken, ImpliesToken, RightParenToken...). Then write a method that takes a string and returns an IEnumerable'. That's your lexer.
Second, figure out what the grammar of your language is, and write a recursive descent parser that breaks up an IEnumerable into an abstract syntax tree that represents grammatical entities in your language.
Then write an analyzer that looks at that tree and figures stuff out, like "how many distinct free variables do I have?"
Then write a code generator that spits out the code necessary to evaluate the truth tables. Spitting IL seems like overkill, but if you wanted to be really buff, you could. It might be easier to let the expression tree library do that for you; you can transform your parse tree into an expression tree, and then turn the expression tree into a delegate, and evaluate the delegate.
Good luck!
I think a parser generator is an overkill. You could use the idea of converting an expression to postfix and evaluating postfix expressions (or directly building an expression tree out of the infix expression and using that to generate the truth table) to solve this problem.
As Mehrdad mentions you should be able to hand roll the parsing in the same time as it would take to learn the syntax of a lexer/parser. The end result you want is some Abstract Syntax Tree (AST) of the expression you have been given.
You then need to build some input generator that creates the input combinations for the symbols defined in the expression.
Then iterate across the input set, generating the results for each input combo, given the rules (AST) you parsed in the first step.
How I would do it:
I could imagine using lambda functions to express the AST/rules as you parse the tree, and building a symbol table as you parse, you then could build the input set, parsing the symbol table to the lambda expression tree, to calculate the results.
If your goal is processing boolean expressions, a parser generator and all the machinery that go with is a waste of time, unless you want to learn how they work (then any of them would be fine).
But it is easy to build a recursive-descent parser by hand for boolean expressions, that computes and returns the results of "evaluating" the expression. Such a parser could be used on a first pass to determine the number of unique variables, where "evaluation" means "couunt 1 for each new variable name".
Writing a generator to produce all possible truth values for N variables is trivial; for each set of values, simply call the parser again and use it to evaluate the expression, where evaluate means "combine the values of the subexpressions according to the operator".
You need a grammar:
formula = disjunction ;
disjunction = conjunction
| disjunction "or" conjunction ;
conjunction = term
| conjunction "and" term ;
term = variable
| "not" term
| "(" formula ")" ;
Yours can be more complicated, but for boolean expressions it can't be that much more complicated.
For each grammar rule, write 1 subroutine that uses a global "scan" index into the string being parsed:
int disjunction()
// returns "-1"==> "not a disjunction"
// in mode 1:
// returns "0" if disjunction is false
// return "1" if disjunction is true
{ skipblanks(); // advance scan past blanks (duh)
temp1=conjunction();
if (temp1==-1) return -1; // syntax error
while (true)
{ skipblanks();
if (matchinput("or")==false) return temp1;
temp2= conjunction();
if (temp2==-1) return temp1;
temp1=temp1 or temp2;
}
end
int term()
{ skipblanks();
if (inputmatchesvariablename())
{ variablename = getvariablenamefrominput();
if unique(variablename) then += numberofvariables;
return lookupvariablename(variablename); // get truthtable value for name
}
...
}
Each of your parse routines will be about this complicated. Seriously.
You can get source code of pyttgen program at http://code.google.com/p/pyttgen/source/browse/#hg/src It generates truth tables for logical expressions. Code based on ply library, so its very simple :)