I am trying to make a regular expression that checks for the Camel Casing for the name of variables.
The expression I have got so far is:
(?xm-isn:(?:\b\w*(?:-)\w*\s*\=)|(?:\b[A-Z0-9_-]+(?=\s*\W*\b)\s*\=))
which works fine.
The question is, how can I make an exception for the following part of the code so it doesn't consider this naming convention for that particular part of the code in the file?
public enum ProjectType
{
[DisplayName("All")]
All = 0,
[DisplayName("All .NET - Windows Forms and Web Forms")]
AllNet = 1,
}
Regex is great for pattern matching but not lexical analysis. I suggest you look into that by using such tools as Garden Points Lexical Analysis.
Related
I am new to c#. I have a question about parsing a string. If i have a file that contains dome lines such as PC: SWITCH_A == ON or a string like PC: defined(SWITCH_B) && SWITCH_C == OFF. All the operators(==, &&, defined) are string here and all the switch names(SWITCH_A) and their values are identifiers(OFF). How do i parse these kind of string? Do i first have to tokenize them split them by new lines or white spaces and then make an abstract syntax tree for parsing them? Also do i need to store all the identifiers in a dictionary first? I have no idea about parsing can anyone help? an tell me with an example how to do it what should be the methods and classes that should be included? Thanks.
Unfortunately, Yes. You have to tokenize them if the syntax that you are parsing is something custom and not a standard syntax where a compiler already exists for parsing the source.
You could take advantage of Expression Trees. They are there in the .NET Framework for building and evaluating dynamic languages.
To start parsing the syntax you have to have a grammar document that describes all the possible cases of the syntax in each line. After that, you can start parsing the lines and building your expression tree.
Parsing any source code typically goes a character at a time since each character might change the entire semantics of the piece that is being parsed.
So, i suggest you start with a grammar document for the syntax that you have and then start writing your parser.
Make sure that there isn't anything already out there for the syntax you are trying to parse as these kind of projects tend to be error-prone and time consuming
Now since your high-level grammar is
Expression ::= Identifier | IntegerValue | BooleanExpression
Identifier and IntegerValue are constant literals in the source, so you need to start looking for a BooleanExpression.
To find a BooleanExpression you need to look for either BooleanBinaryExpression, BooleanUnaryExpression, TrueExpression or FalseExpression.
You can detect a BooleanBinaryExpression by look for the && or == operators and then taking the left and right operands.
To detect a BooleanUnaryExpression you need to look for the word defined and then parse the identifier in the parantheses.
And so on...
Notice that your grammar supports recursion in the syntax, look at the definition of the AndExpression or EqualsExpression, they point back to Expression
AndExpression ::= Expression '&&' Expression
EqualsExpression ::= Expression '==' Expression
You got a bunch of methods in the String Class in the .NET Framework to assist you in detecting and parsing your grammar.
Another alternative is that you can look for a parser generator that targets c#. For example, see ANTLR
I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).
I'm working from either headers or a list where the signatures take the form:
public: void __thiscall myClass::method(int, class myOtherClass * )
I have the following regex, which works for most functions:
(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)
There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.
I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?
C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.
But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.
Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.
Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.
The BNF specification looks like for a list of integers or words separated might look like :
using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;
// Inside the definition...
integer = +digit_p; // One or more digits.
word = +alpha_p; // One or more letters.
token = integer | word; // An integer or a word.
token_list = token >> *(+space_p >> token) // A token, followed by 0 or more tokens.
For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).
No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.
If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?
I am playing around with the System.ComponentModel.DataAnnotations namespace, with a view to getting some validation going on my ASP.NET MVC application.
I have already hit an issue with the RegularExpression annotation.
Because these annotations are attributes they require constant expressions.
OK, I can use a class filled with regex string constants.
The problem with that is I don't want to pollute my regex with escape characters required for the C# parser. My preference is to store the regex in a resources file.
The problem is I cant use those string resources in my data annotations, because they are not constants!
Is there any solution to this?
If not, this seems a significant limitation of using attributes for validation.
In C# there is only one escape code you need (double-quote)... if you use verbatim string literals:
#"like \this\ note \slash here does nothing only quote "" needs doubling
you can even use newline";
I always write regex with #"..." strings - avoids many headaches.
Apparently in .NET 4 there are overrides for the DataAnnotations attribubtes that take a Func< string> in their constructor described as "The function that enables access to validation resources."
You could create a custom validation attribute like this as a proxy which would load the regular expressions from your resource file.
I'm looking to write a Truth Table Generator as a personal project.
There are several web-based online ones here and here.
(Example screenshot of an existing Truth Table Generator)
I have the following questions:
How should I go about parsing expressions like: ((P => Q) & (Q => R)) => (P => R)
Should I use a parser generator like ANTLr or YACC, or use straight regular expressions?
Once I have the expression parsed, how should I go about generating the truth table? Each section of the expression needs to be divided up into its smallest components and re-built from the left side of the table to the right. How would I evaluate something like that?
Can anyone provide me with tips concerning the parsing of these arbitrary expressions and eventually evaluating the parsed expression?
This sounds like a great personal project. You'll learn a lot about how the basic parts of a compiler work. I would skip trying to use a parser generator; if this is for your own edification, you'll learn more by doing it all from scratch.
The way such systems work is a formalization of how we understand natural languages. If I give you a sentence: "The dog, Rover, ate his food.", the first thing you do is break it up into words and punctuation. "The", "SPACE", "dog", "COMMA", "SPACE", "Rover", ... That's "tokenizing" or "lexing".
The next thing you do is analyze the token stream to see if the sentence is grammatical. The grammar of English is extremely complicated, but this sentence is pretty straightforward. SUBJECT-APPOSITIVE-VERB-OBJECT. This is "parsing".
Once you know that the sentence is grammatical, you can then analyze the sentence to actually get meaning out of it. For instance, you can see that there are three parts of this sentence -- the subject, the appositive, and the "his" in the object -- that all refer to the same entity, namely, the dog. You can figure out that the dog is the thing doing the eating, and the food is the thing being eaten. This is the semantic analysis phase.
Compilers then have a fourth phase that humans do not, which is they generate code that represents the actions described in the language.
So, do all that. Start by defining what the tokens of your language are, define a base class Token and a bunch of derived classes for each. (IdentifierToken, OrToken, AndToken, ImpliesToken, RightParenToken...). Then write a method that takes a string and returns an IEnumerable'. That's your lexer.
Second, figure out what the grammar of your language is, and write a recursive descent parser that breaks up an IEnumerable into an abstract syntax tree that represents grammatical entities in your language.
Then write an analyzer that looks at that tree and figures stuff out, like "how many distinct free variables do I have?"
Then write a code generator that spits out the code necessary to evaluate the truth tables. Spitting IL seems like overkill, but if you wanted to be really buff, you could. It might be easier to let the expression tree library do that for you; you can transform your parse tree into an expression tree, and then turn the expression tree into a delegate, and evaluate the delegate.
Good luck!
I think a parser generator is an overkill. You could use the idea of converting an expression to postfix and evaluating postfix expressions (or directly building an expression tree out of the infix expression and using that to generate the truth table) to solve this problem.
As Mehrdad mentions you should be able to hand roll the parsing in the same time as it would take to learn the syntax of a lexer/parser. The end result you want is some Abstract Syntax Tree (AST) of the expression you have been given.
You then need to build some input generator that creates the input combinations for the symbols defined in the expression.
Then iterate across the input set, generating the results for each input combo, given the rules (AST) you parsed in the first step.
How I would do it:
I could imagine using lambda functions to express the AST/rules as you parse the tree, and building a symbol table as you parse, you then could build the input set, parsing the symbol table to the lambda expression tree, to calculate the results.
If your goal is processing boolean expressions, a parser generator and all the machinery that go with is a waste of time, unless you want to learn how they work (then any of them would be fine).
But it is easy to build a recursive-descent parser by hand for boolean expressions, that computes and returns the results of "evaluating" the expression. Such a parser could be used on a first pass to determine the number of unique variables, where "evaluation" means "couunt 1 for each new variable name".
Writing a generator to produce all possible truth values for N variables is trivial; for each set of values, simply call the parser again and use it to evaluate the expression, where evaluate means "combine the values of the subexpressions according to the operator".
You need a grammar:
formula = disjunction ;
disjunction = conjunction
| disjunction "or" conjunction ;
conjunction = term
| conjunction "and" term ;
term = variable
| "not" term
| "(" formula ")" ;
Yours can be more complicated, but for boolean expressions it can't be that much more complicated.
For each grammar rule, write 1 subroutine that uses a global "scan" index into the string being parsed:
int disjunction()
// returns "-1"==> "not a disjunction"
// in mode 1:
// returns "0" if disjunction is false
// return "1" if disjunction is true
{ skipblanks(); // advance scan past blanks (duh)
temp1=conjunction();
if (temp1==-1) return -1; // syntax error
while (true)
{ skipblanks();
if (matchinput("or")==false) return temp1;
temp2= conjunction();
if (temp2==-1) return temp1;
temp1=temp1 or temp2;
}
end
int term()
{ skipblanks();
if (inputmatchesvariablename())
{ variablename = getvariablenamefrominput();
if unique(variablename) then += numberofvariables;
return lookupvariablename(variablename); // get truthtable value for name
}
...
}
Each of your parse routines will be about this complicated. Seriously.
You can get source code of pyttgen program at http://code.google.com/p/pyttgen/source/browse/#hg/src It generates truth tables for logical expressions. Code based on ply library, so its very simple :)
I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps