Regex: C# method declaration parsing - c#

Could somebody help me parse following from the C# method declaration: scope, isStatic, name, return type and list of the parameters and their types. So given method declaration like this
public static SomeReturnType GetSomething(string param1, int param2)
etc. I need to be able to parse it and get the info above. So in this case
name = "GetSomething"
scope = "public"
isStatic = true
returnType = "SomeReturnType"
and then array of parameter type and name pairs.
Oh almost forgot the most important part. It has to account for all other scopes (protected, private, internal, protected internal), absence of "static", void return type etc.
Please note that REFLECTION is not solution here. I need REGEX.
So far I have these two:
(?:(?:public)|(?:private)|(?:protected)|(?:internal)|(?:protected internal)\s+)*
(?:(?:static)\s+)*
I guess for rest of the problem I can just get away with string manipulation without regex.

Some thoughts on your problem:
A set of strings that can all be matched by a particular regular expression is called a regular language. The set of strings which are legal method declarations is not a regular language in any version of C#. If you are attempting to find a regular expression which matches every legal C# method declaration and rejects every illegal C# method declaration then you are out of luck.
More generally, regular expressions are almost always a bad idea for anything but the simplest matching problems. (Sorry Jeff.) A far better approach is to first write a lexer, which breaks up the string into a sequence of tokens. Then analyze the token sequence. (Using regular expressions as part of a lexer is not a terrible idea, though you can get by without them.)
I note also that you are glossing over rather a lot of complications in parsing method declarations. You did not mention:
generic/array/pointer/nullable return and formal parameter types
generic type parameter declarations
generic type parameter constraints
unsafe/extern/new/override/virtual/abstract/sealed methods
explicit interface implementation methods
method/parameter/return attributes
partial methods -- slightly tricky to parse, partial is a contextual keyword
comments
I also note that you've not said whether you are guaranteed that the method signature is already good, or if you need to identify bad ones and produce diagnostics as to why they're bad. That's a much harder problem.
Why do you want to do this in the first place? Doing this correctly is rather a lot of work. Perhaps there is an easier way to get what you want?

I wouldn't bother with using Regex. When you get to the part of interpreting method parameters, it gets really messy (ref and out keywords for example). I don't know if you need support for attribute notation as well, but that would make it a complete mess.
Maybe a C# parser library can be of help. I've found a few on the internet:
http://www.codeplex.com/csparser (C# 1.0)
http://www.csharpparser.com/
Alternatively, you could first feed the code to the compiler at runtime, and then use reflection on the newly created assembly. It will be slower, but pretty much guaranteed to be correct. Even though you seem to be opposed to the idea of using reflection, this can be a viable solution.
Something like this:
List<string> referenceAssemblies = new List<string>()
{
"System.dll"
// ...
};
string source = "public abstract class TestClass {" + input + ";}";
CSharpCodeProvider codeProvider = new CSharpCodeProvider();
// No assembly name specified
CompilerParameters compilerParameters =
new CompilerParameters(referenceAssemblies.ToArray());
compilerParameters.GenerateExecutable = false;
compilerParameters.GenerateInMemory = false;
CompilerResults compilerResults = codeProvider.CompileAssemblyFromSource(
compilerParameters, source);
// Check for successful compilation here
Type testClass = compilerResults.CompiledAssembly.GetTypes().First();
Then use reflection on testClass.
Compiling should be safe without input validation, because you're not executing any of the code. You'd only need very basic checks, such as making sure only 1 method signature is entered.

Well given the rules you've provided, it would probably be best to use a series of regular expressions rather than trying to come up with a singular expression. That expression would be enormous.
If you're sold on a singular expression, you'll need to use a regular expression that uses grouping, look-ahead and look-behind.
http://www.regular-expressions.info/lookaround.html
Even with the limited scope of what you're trying to parse out of it, you'll still need some very specific guidelines on all possibilities.

string test = #"public static SomeReturnType GetSomething(string param1, int param2)";
var match = Regex.Match(test, #"(?<scope>\w+)\s+(?<static>static\s+)?(?<return>\w+)\s+(?<name>\w+)\((?<parms>[^)]+)\)");
Console.WriteLine(match.Groups["scope"].Value);
Console.WriteLine(!string.IsNullOrEmpty(match.Groups["static"].Value));
Console.WriteLine(match.Groups["return"].Value);
Console.WriteLine(match.Groups["name"].Value);
List<string> parms = match.Groups["parms"].ToString().Split(',').ToList();
parms.ForEach(x => Console.WriteLine(x));
Console.Read();
Broken for parms with commas, but it's quite possible to also handle that.

(?<StringRepresentation>\A\s*(?:(?:(?<Comment>(?://.*\n)|(?:/\*(?:[\w\d!##$%^&*()\[\]<>,.;\\"':|{}`~+=-_?\s]*)?\*/))|(\[\s*(?<Attributes>\w*)[^\[\]]*?\]))\s*)*?(?:(?:(?<Access>protected\s+internal|internal\s+protected|private|public|protected|internal)\s+)?(?:(?<InheritanceModifier>new|abstract|override|virtual)\s+)?(?:(?<Static>static)\s+)?(?:(?<Extern>extern)\s+)?(?:partial\s+)?)+(?:(?<Type>\w+(?:[\w,.\?\[\]])*?(?:\<.*>)*?)\s+)?(?<Operator>operator\s+)?\s*(?<Name>~?(?:[\w\=+\-\!\~\d\.])+?)\s*(?:\<(?:\w\.*\d*\,*\s*)+\>)*\s*\((?<Parameters>(?:[^()])*?)\)\s*(?:where\s+.+)?\s*(?:\:\s*(?:this|base)\s*(?:\(?[^\(\)]*(?:(?:(?:(?<OpenC>\()[^\(\)]*)+(?:(?<CloseC-OpenC>\))[^\(\)]*?)+)*(?(OpenC)(?!))\)))\s*)?(?:;|(?<ah>\{[^\{\}]*(?:(?:(?:(?<Open>\{)[^\{\}]*)+(?:(?<Close-Open>\})[^\{\}]*?)+)*(?(Open)(?!))\}))))
I can't personally take credit for this one, but the guy who made Regionerate (open source) came up with this and it works pretty well for parsing methods in general.

Related

Is it possible to statically verify structure of c# expression tree arguments?

I have a method
public static class PropertyLensMixins
{
public static ILens<Source> PropertyLens<O,Source>
( this O o
, Expression<Func<O, Source>> selector
)
where O: class, INotifyPropertyChanged
where Source: class, Immutable
{
return new PropertyLens<O, Source>(o, selector);
}
}
and the idea is to use it this way
this.PropertyLens(p=>p.MyProp)
however it is an error to create a nested expression even though the compiler will accept it
this.PropertyLens(p=>p.MyProp.NestProp)
now I can catch this at runtime by parsing the expression tree. For example
var names = ReactiveUI.Reflection.ExpressionToPropertyNames(selector).ToList();
if (names.Count > 1)
throw new ArgumentException("Selector may only be depth 1", "selector");
I was wondering however, is there any clever way to detect this at compile time? I doubt it because the compiler is happy with the type signature but I thought I might ask anyway.
I have also tried a Resharper pattern to match it as an error
$id0$.PropertyLens($id1$=>$id1$.$id2$.$id3$)
with all placeholders being identifiers but Resharper can't seem to match it.
There is no way to make the compiler reject such code.
One possible alternative would be to create a custom diagnostic using Roslyn. That way, all such errors will be marked by VS. Though it might be too much work for something like this.

What is 'this' used for in C# language?

I've read open source c# code and there is a lot of strange grammar (to me).
They declare method arguments with the this keyword like this:
this object #object
What does it mean?
If I remove 'this' keyword where is before the data type, then will it work differently?
Sounds like an Extension Method.
The # symbol allows the variable name to be the same as a C# keyword - I tend to avoid them like the plague personally.
If you remove the this keyword, it will no longer be an extension method, just a static method. Depending on the calling code syntax, it may no longer compile, for example:
public static class IntegerMethods
{
public static int Add(this int i, int value)
{
return i + value;
}
}
int i = 0;
// This is an "extension method" call, and will only compile against extension methods.
i = i.Add(2);
// This is a standard static method call.
i = IntegerMethods.Add(i, 2);
The compiler will simply translate all "extension method calls" into standard static method calls at any rate, but extension method calls will still only work against valid extension methods as per the this type name syntax.
Some guidelines
These are my own, but I find they are useful.
Discoverability of extension methods can be a problem, so be mindful of the namespace you choose to contain them in. We have very useful stuff under .NET namespaces such as System.Collections or whatever. Less useful but otherwise "common" stuff tends to go under Extensions.<namespace of extended type> such that discoverability is at least consistent via convention.
Try not to extend often used types in broad scope, you don't want MyFabulousExtensionMethod appearing on object throughout your app. If you need to, either constrain the scope (namespace) to be very specific, or bypass extension methods and use a static class directly - these won't pollute the type metadata in IntelliSense.
In extension methods, "this" can be null (due to how they compile into static method calls) so be careful and don't assume that "this" is not null (from the calling side this looks like a successful method call on a null target).
These are optional and not exhaustive, but I find they usually fall under the banner of "good" advice. YMMV.
The 'this type name' syntax is used for extension methods.
For example if I wanted to add a UnCamelCase method to a string (so I could do "HelloWorld".UnCamelCase() to produce "Hello World` - I'd write this:
public static string UnCamelCase(this string text)
{
/*match any instances of a lower case character followed by an upper case
* one, and replace them with the same characters with a space between them*/
return Regex.Replace(text, "([a-z])([A-Z])", "$1 $2");
}
this string text means the specific instance of the string that you're working with, and text is the identifier for it.
The # syntax allows for variable names that are ordinarily reserved.

Is there a programatic way to identify c# reserved words?

I'm looking for a function like
public bool IsAReservedWord(string TestWord)
I know I could roll my own by grabbing a reserve word list from MSDN. However I was hoping there was something built into either the language or .NET reflection that could be relied upon so I wouldn't have to revisit the function when I move to newer versions of C#/.NET.
The reason I'm looking for this is I'm looking for a safeguard in .tt file code generation.
CSharpCodeProvider cs = new CSharpCodeProvider();
var test = cs.IsValidIdentifier("new"); // returns false
var test2 = cs.IsValidIdentifier("new1"); // returns true
The Microsoft.CSharp.CSharpCodeGenerator has an IsKeyword(string) method that does exactly that. However, the class is internal, so you have to use reflection to access it and there's no guarantee it will be available in future versions of the .NET framework. Please note that IsKeyword doesn't take care of different versions of C#.
The public method System.CodeDom.Compiler.ICodeGenerator.IsValidIdentifier(string) rejects keywords as well. The drawback is this method does some other validations as well, so other non-keyword strings are also rejected.
Update: If you just need to produce a valid identifier rather than decide if a particular string is a keyword, you can use ICodeGenerator.CreateValidIdentifier(string). This method takes care of strings with two leading underscores as well by prefixing them with one more underscore. The same holds for keywords. Note that ICodeGenerator.CreateEscapedIdentifier(string) prefixes such strings with the # sign.
Identifiers startings with two leading underscores are reserved for the implementation (i.e. the C# compiler and associated code generators etc.), so avoiding such identifiers from your code is generally a good idea.
Update 2: The reason to prefer ICodeGenerator.CreateValidIdentifier over ICodeGenerator.CreateEscapedIdentifier is that __x and #__x are essentially the same identifier. The following won't compile:
int __x = 10;
int #__x = 20;
In case the compiler would generate and use a __x identifier, and the user would use #__x as a result to a call to CreateEscapedIdentifier, a compilation error would occur. When using CreateValidIdentifier this situation is prevented, because the custom identifier is turned into ___x (three underscores).
However I was hoping there was something built into either the language or .NET reflection that could be relied upon so I wouldn't have to revisit the function when I move to newer versions of C#/.NET.
Note that C# has never added a new reserved keyword since v1.0. Every new keyword has been an unreserved contextual keyword.
Though it is of course possible that we might add a new reserved keyword in the future, we have tried hard to avoid doing so.
For a list of all the reserved and contextual keywords up to C# 5, see
http://ericlippert.com/2009/05/11/reserved-and-contextual-keywords/
static System.CodeDom.Compiler.CodeDomProvider CSprovider =
Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#");
public static string QuoteName(string name)
{
return CSprovider.CreateEscapedIdentifier(name);
}
public static bool IsAReservedWord(string TestWord)
{
return QuoteName(TestWord) != TestWord;
}
Since the definition of CreateEscapedIdentifier is:
public string CreateEscapedIdentifier(string name)
{
if (!IsKeyword(name) && !IsPrefixTwoUnderscore(name))
{
return name;
}
return ("#" + name);
}
it will properly identify __ identifiers as reserved.

Finding methods in source code using regular expressions

I have a program which looks in source code, locates methods, and performs some calculations on the code inside of each method. I am trying to use regular expressions to do this, but this is my first time using them in C# and I am having difficulty testing the results.
If I use this regular expression to find the method signature:
((private)|(public)|(sealed)|(protected)|(virtual)|(internal))+([a-z]|[A-Z]|[0-9]|[\s])*([\()([a-z]|[A-Z]|[0-9]|[\s])*([\)|\{]+)
and then split the source code by this method, storing the results in an array of strings:
string[] MethodSignatureCollection = regularExpression.Split(SourceAsString);
would this get me what I want, ie a list of methods including the code inside of them?
I would strongly suggest using Reflection (if it is appropriate) or CSharpCodeProvider.Parse(...) (as recommended by rstevens)
It can be very difficult to write a regular expression that works in all cases.
Here are some cases you'd have to handle:
public /* comment */ void Foo(...) // Comments can be everywhere
string foo = "public void Foo(...){}"; // Don't match signatures in strings
private __fooClass _Foo() // Underscores are ugly, but legal
private void #while() // Identifier escaping
public override void Foo(...) // Have to recognize overrides
void Foo(); // Defaults to private
void IDisposable.Dispose() // Explicit implementation
public // More comments // Signatures can span lines
void Foo(...)
private void // Attributes
Foo([Description("Foo")] string foo)
#if(DEBUG) // Don't forget the pre-processor
private
#else
public
#endif
int Foo() { }
Notes:
The Split approach will throw away everything that it matches, so you will in fact lose all the "signatures" that you are splitting on.
Don't forget that signatures can have commas in them
{...} can be nested, your current regexp could consume more { than it should
There is a lot of other stuff (preprocessor commands, using statements, properties, comments, enum definitions, attributes) that can show up in code, so just because something is between two method signatures does not make it part of a method body.
Maybe it is a better approach to use the CSharpCodeProvider.Parse() which can "compile" C# source code into a CompileUnit.
You can then walk through the namespaces, types, classes and methods of in that Compile Unit.
using ICSharpCode.NRefactory.CSharp;
PM> install-package ICSharpCode.NRefactory
var parser = new CSharpParser();
var syntaxTree = parser.Parse(File.ReadAllText(sourceFilePath));
var result = syntaxTree.Descendants.OfType<MethodDeclaration>()
.FirstOrDefault(y => y.NameToken.Name == methodName);
if (result != null)
{
return result.ToString(FormattingOptionsFactory.CreateSharpDevelop()).Trim();
}
It is feasible, I guess, to get something working using regex's, however this does require looking very carefully at the specifications for the C# language and a deep understanding of the C# grammar, this is not a simple problem. I know you've said you want to store the methods as arrays of strings, but presumably there is something beyond that. It has already been pointed out to look at using reflection, however if that does not do what you want, you should consider ANTLR (ANother Tool for Language Recognition). ANTLR does have C# grammars available.
http://www.antlr.org/about.html
No, those access modifiers can also be used for internal classes and fields, among other things. You'd need to write a full C# parser to get it right.
You can do what you want using reflection. Try something like the following:
var methods = typeof (Foo).GetMethods();
foreach (var info in methods)
{
var body = info.GetMethodBody();
}
That probably has what you need for your calculations.
If you need the original C# source code you can't get it with reflection. Don't write your own parser. Use an existing one, listed here.

Is there an easy way to parse a (lambda expression) string into an Action delegate?

I have a method that alters an "Account" object based on the action delegate passed into it:
public static void AlterAccount(string AccountID, Action<Account> AccountAction) {
Account someAccount = accountRepository.GetAccount(AccountID);
AccountAction.Invoke(someAccount);
someAccount.Save();
}
This works as intended...
AlterAccount("Account1234", a => a.Enabled = false);
...but now what I'd like to try and do is have a method like this:
public static void AlterAccount(string AccountID, string AccountActionText) {
Account someAccount = accountRepository.GetAccount(AccountID);
Action<Account> AccountAction = MagicLibrary.ConvertMagically<Action<Account>>(AccountActionText);
AccountAction.Invoke(someAccount);
someAccount.Save();
}
It can then be used like:
AlterAccount("Account1234", "a => a.Enabled = false");
to disable account "Account1234".
I've had a look at the linq dynamic query library, which seems to do more or less what I want but for Func type delegates, and my knowledge of Expression trees etc isn't quite good enough to work out how to achieve what I want.
Is there an easy way to do what I want, or do I need to learn expressions properly and write a load of code?
(The reason I want to do this is to allow an easy way of bulk updating account objects from a powershell script where the user can specify a lambda expression to perform the changes.)
The Dynamic LINQ library is a fine choice, as it'll generate expressions you can compile to code in a lightweight fashion.
The example you provided actually produces a boolean -- so you should be able to ask for a Func and it might sort it out.
Edit: This of course is wrong, as Expressions don't have assignment in them at all.
So, another potential way is to take two lambdas. One to find the property you want, one to provide a value:
(a => a.AccountId), (a => true)
Then use reflection to set the property referenced in the first lambda with the result of the second one. Hackish, but it's still probably lightweight compared to invoking the C# compiler.
This way you don't have to do much codegen yourself - the expressions you get will contain most everything you need.
You may try this: Dynamic Lambda Expressions Using An Isolated AppDomain
It compiles a lambda expression using CodeDOM compiler. In order to dispose the in-memory assembly that gets created, the compiler runs on an isolated AppDomain. For the passing the expression through the domain boundary, it has to be serialized. Alas, Expression<> is not Serializable. So, a trick has to be used. All the details are explained in the post.
I'm the author of that component, by the way. I would like very much to hear your feedback from it.
There is no general way to parse a string into a lambda expression without a full compilation, because lambda expressions can reference things that are defined outside the lambda expression. I know of no library that handles the specific case you want. There's a long discussion of this on a thread on a C# discussion group.
The easiest way to get what you want is to compile a method at runtime. You can write a function that takes in the string "a.Enabled = true; return a;" and sticks that in the middle of a function that takes an Account as a parameter. I would use this library as a starting point, but you can also use the function mentioned on another thread.
That's easy:
Use CodeDom to generate the module containing the "surrounding class" you'll use to build the expression; this class must implement the interface known to your application
Use CodeSnippedExpression to inject the expression into its member.
Use Activator type to create the instance of this class in runtime.
Basically, you need to build the following class with CodeDom:
using System;
using MyNamespace1;
using ...
using MyNamespace[N];
namespace MyNamespace.GeneratedTypes
{
public class ExpressionContainer[M] : IHasAccountAction
{
public Action<Account> AccountAction {
get {
return [CodeSnippedExpression must be used here];
}
}
}
}
Assuming that IHasAccountAction is:
public IHasAccountAction {
public Action<Account> AccountAction { get; }
}
If this is done, you can get the expression compiled from string with ease. If you need its expression tree representation, use Expression<Action<Account>> instead of Action<Account> in generated type.

Categories