I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction() as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?
Edit: In response to there not being any other identifiers in this language, I would take a different approach.
Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.
This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).
Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.
The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):
tokens {
USER_FUNCTION
}
Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.
public override IToken Emit() {
if (_type == IDENTIFIER && IsUserDefinedFunction())
_type = USER_FUNCTION;
return base.Emit();
}
My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).
Then the lexer defines:
USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';
In the parser listener, I strip the surrounding §'s from the function name.
Related
I have a string like this.
string strex = "Insert|Update|Delete"
I am retrieving another string as string strex1 = "Insert" (It may retrieve Update or Delete)
I need to match strex1 with strex in "IF" condition in C#.
Do I need to split strex and match with strex1?
The string you posted is a regular expression pattern that matches the words Insert, Update or Delete. Regular expressions are a very common way of specifying validation rules in web applications.
Regular expressions can express far more complex rules than a simple comparison. They're also far faster (think 10x) in validation scenarios than splitting. In a web application, that translates to using fewer servers to serve the same traffic.
You can use .NET's Regex to match strings with that pattern, eg :
var strex = "Insert|Update|Delete";
if (Regex.IsMatch(input,strex))
{
....
}
This will create a new regular expression object each time. You can avoid this by creating a static Regex instance and reuse it. Regex is thread-safe which means there's no problem using the same instance from multiple threads :
static Regex _cmdRegex = new Regex("Insert|Update|Delete");
...
void MyMethod(string input)
{
if(_cmdRegex.IsMatch(input))
{
...
}
}
The Regex class methods will match if the pattern appears anywhere in the pattern. Regex.IsMatch("Insert1",strex) will return True. If you want an exact match, you have to specify that the pattern starts at the beginning of the input with ^ and ends at the end with $ :
static Regex _cmdRegex = new Regex("^(Insert|Update|Delete)$");
With this change, _cmdRegex.IsMatch("Insert1") will return false but _cmdRegex.IsMatch("Insert") will return true.
Performance
In this case a regular expression is a lot faster than splitting and trying exact matches. Think 10-100x over time. There are two reasons for this:
Strings are immutable, so every string modification operation like Split() will generate new temporary strings that have to be allocated and garbage collected. In a busy web application this adds up, eventually using up a lot of RAM and CPU for little or no benefit. One of the reasons ASP.NET Core is 10x times faster than the old ASP.NET is eliminating such substring operations wherever possible.
A regular expression is compiled into a program that performs matching in the most efficient way. When you use Split().Any() the program will compare the input with all the substrings even if it's obvious there's no possible match, eg because the first letter is Z. A Regex program on the other hand would only proceed if the first character was I, U or D
Efficient way I can think of is using string.Contains()
if(strex.Contains($"{strex1}|") || strex.Contains($"|{strex1}"))
{
//Your code goes here
}
Solution using Linq, Split string strex by '|' and check strex1 is present in an array or not, like
Issue with below solution is pointed out by #PanagiotisKanavos in the
comment.
Using .Any(),
if(strex.Split('|').Any(x => x.Equals(strex1)))
{
//Your code goes here
}
or using Contains(),
if(strex.Split('|').Contains(strex1))
{
//Your code goes here
}
if you want to ignore case while comparing string then you can use StringComparison.OrdinalIgnoreCase.
if(strex.Split('|').Any(x => x.Equals(strex1, StringComparison.OrdinalIgnoreCase))
{
//Your code goes here
}
.NETFIDDLE
In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.
Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.
I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).
I'm working from either headers or a list where the signatures take the form:
public: void __thiscall myClass::method(int, class myOtherClass * )
I have the following regex, which works for most functions:
(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)
There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.
I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?
C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.
But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.
Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.
Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.
The BNF specification looks like for a list of integers or words separated might look like :
using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;
// Inside the definition...
integer = +digit_p; // One or more digits.
word = +alpha_p; // One or more letters.
token = integer | word; // An integer or a word.
token_list = token >> *(+space_p >> token) // A token, followed by 0 or more tokens.
For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).
No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.
If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?
I'm attempting to write an application to extract properties and code from proprietary IDE design files. The file format looks something like this:
HEADING
{
SUBHEADING1
{
PropName1 = PropVal1;
PropName2 = PropVal2;
}
SUBHEADING2
{
{ 1 ; PropVal1 ; PropValue2 }
{ 2 ; PropVal1 ; PropValue2 ; OnEvent1=BEGIN
MESSAGE('Hello, World!');
{ block comments are between braces }
//inline comments are after double-slashes
END;
PropVal3 }
{ 1 ; PropVal1 ; PropVal2; PropVal3 }
}
}
What I am trying to do is extract the contents under the subheading blocks. In the case of SUBHEADING2, I would also separate each token as delimited by the semicolons. I had reasonably good success with just counting the brackets and keeping track of what subheading I'm currently under. The main issue I encountered involves dealing with the code comments.
This language happens to use {} for block comments, which interferes with the brackets in the file format. To make it even more interesting, it also needs to take into account double-slash inline comments and ignore everything up to the end of the line.
What is the best approach to tackling this? I looked at some of the compiler libraries discussed in another article (ANTLR, Doxygen, etc.) but they seem like overkill for solving this specific parsing issue.
I'd suggest writing a tokenizer and parser; this will give you more flexibility. The tokenizer basically does a simple text-wise breakdown of the sourcecode and puts it into more usable data structure; the parser figures out what to do with it, often leveraging recursion.
Terms to google: tokenizer, parser, compiler design, grammars
Math expression evaluator: http://www.codeproject.com/KB/vb/math_expression_evaluator.aspx
(you might be able to take an example like this and hack it apart into what you want)
More info about parsing: http://www.codeproject.com/KB/recipes/TinyPG.aspx
You won't have to go nearly as far as those articles go, but, you're going to want to study a bit on this one first.
You should be able to put something together in a few hours, using regular expressions in combination with some code that uses the results.
Something like this should work:
- Initialize the process by loading the file into a string.
Pull each top-level block from the string, using regex tags to separately identify the block keyword and contents.
If a block is found,
Make a decision based on the keyword
Pass the content to this process recursively.
Following this, you would process HEADING, then the first SUBHEADING, then the second SUBHEADING, then each sub-block. For the sub-block containing the block comment, you would presumably know based on the block's lack of a keyword that any sub-block is a comment, so there is no need to process the sub-blocks.
No matter which solution you will choose, I'm pretty sure the best way is to have 2 parsers/tokenizers. One for the main file structure with {} as grouping characters, and one for the code blocks.
I'm looking to write a Truth Table Generator as a personal project.
There are several web-based online ones here and here.
(Example screenshot of an existing Truth Table Generator)
I have the following questions:
How should I go about parsing expressions like: ((P => Q) & (Q => R)) => (P => R)
Should I use a parser generator like ANTLr or YACC, or use straight regular expressions?
Once I have the expression parsed, how should I go about generating the truth table? Each section of the expression needs to be divided up into its smallest components and re-built from the left side of the table to the right. How would I evaluate something like that?
Can anyone provide me with tips concerning the parsing of these arbitrary expressions and eventually evaluating the parsed expression?
This sounds like a great personal project. You'll learn a lot about how the basic parts of a compiler work. I would skip trying to use a parser generator; if this is for your own edification, you'll learn more by doing it all from scratch.
The way such systems work is a formalization of how we understand natural languages. If I give you a sentence: "The dog, Rover, ate his food.", the first thing you do is break it up into words and punctuation. "The", "SPACE", "dog", "COMMA", "SPACE", "Rover", ... That's "tokenizing" or "lexing".
The next thing you do is analyze the token stream to see if the sentence is grammatical. The grammar of English is extremely complicated, but this sentence is pretty straightforward. SUBJECT-APPOSITIVE-VERB-OBJECT. This is "parsing".
Once you know that the sentence is grammatical, you can then analyze the sentence to actually get meaning out of it. For instance, you can see that there are three parts of this sentence -- the subject, the appositive, and the "his" in the object -- that all refer to the same entity, namely, the dog. You can figure out that the dog is the thing doing the eating, and the food is the thing being eaten. This is the semantic analysis phase.
Compilers then have a fourth phase that humans do not, which is they generate code that represents the actions described in the language.
So, do all that. Start by defining what the tokens of your language are, define a base class Token and a bunch of derived classes for each. (IdentifierToken, OrToken, AndToken, ImpliesToken, RightParenToken...). Then write a method that takes a string and returns an IEnumerable'. That's your lexer.
Second, figure out what the grammar of your language is, and write a recursive descent parser that breaks up an IEnumerable into an abstract syntax tree that represents grammatical entities in your language.
Then write an analyzer that looks at that tree and figures stuff out, like "how many distinct free variables do I have?"
Then write a code generator that spits out the code necessary to evaluate the truth tables. Spitting IL seems like overkill, but if you wanted to be really buff, you could. It might be easier to let the expression tree library do that for you; you can transform your parse tree into an expression tree, and then turn the expression tree into a delegate, and evaluate the delegate.
Good luck!
I think a parser generator is an overkill. You could use the idea of converting an expression to postfix and evaluating postfix expressions (or directly building an expression tree out of the infix expression and using that to generate the truth table) to solve this problem.
As Mehrdad mentions you should be able to hand roll the parsing in the same time as it would take to learn the syntax of a lexer/parser. The end result you want is some Abstract Syntax Tree (AST) of the expression you have been given.
You then need to build some input generator that creates the input combinations for the symbols defined in the expression.
Then iterate across the input set, generating the results for each input combo, given the rules (AST) you parsed in the first step.
How I would do it:
I could imagine using lambda functions to express the AST/rules as you parse the tree, and building a symbol table as you parse, you then could build the input set, parsing the symbol table to the lambda expression tree, to calculate the results.
If your goal is processing boolean expressions, a parser generator and all the machinery that go with is a waste of time, unless you want to learn how they work (then any of them would be fine).
But it is easy to build a recursive-descent parser by hand for boolean expressions, that computes and returns the results of "evaluating" the expression. Such a parser could be used on a first pass to determine the number of unique variables, where "evaluation" means "couunt 1 for each new variable name".
Writing a generator to produce all possible truth values for N variables is trivial; for each set of values, simply call the parser again and use it to evaluate the expression, where evaluate means "combine the values of the subexpressions according to the operator".
You need a grammar:
formula = disjunction ;
disjunction = conjunction
| disjunction "or" conjunction ;
conjunction = term
| conjunction "and" term ;
term = variable
| "not" term
| "(" formula ")" ;
Yours can be more complicated, but for boolean expressions it can't be that much more complicated.
For each grammar rule, write 1 subroutine that uses a global "scan" index into the string being parsed:
int disjunction()
// returns "-1"==> "not a disjunction"
// in mode 1:
// returns "0" if disjunction is false
// return "1" if disjunction is true
{ skipblanks(); // advance scan past blanks (duh)
temp1=conjunction();
if (temp1==-1) return -1; // syntax error
while (true)
{ skipblanks();
if (matchinput("or")==false) return temp1;
temp2= conjunction();
if (temp2==-1) return temp1;
temp1=temp1 or temp2;
}
end
int term()
{ skipblanks();
if (inputmatchesvariablename())
{ variablename = getvariablenamefrominput();
if unique(variablename) then += numberofvariables;
return lookupvariablename(variablename); // get truthtable value for name
}
...
}
Each of your parse routines will be about this complicated. Seriously.
You can get source code of pyttgen program at http://code.google.com/p/pyttgen/source/browse/#hg/src It generates truth tables for logical expressions. Code based on ply library, so its very simple :)