Background: I'm doing some complicated code generation that requires me to extract the methods within a C# interface file. I cannot simply use reflection because this code will feed a T4 template which will not have the compiled code to reflect upon. Thus I am attempting parsing. I can easily make my own parser, but it would be nice if there was a regular expression solution.
Question: Is-there/What regex pattern would match the method declarations (including the return types and parameters) of the string below using C#'s Regular Expressions library?
string testing = #"
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication1
{
public interface Service
{
int Test1(int a);
int Test2(int a, int b);
int Test3(
int a,
int b);
int Test4(out int a);
}
}
";
The regex pattern I desire should make four matches:
"int Test1(int a);"
"int Test2(int a, int b);"
"int Test3( int a, int b);" [note: #3 would be multi-line]
"int Test4(out int a);"
Solution Attempt: Here is possibly the closest I have come to a regex solution thus far:
string WhiteSpacePattern = #"\s+";
string PossibleWhiteSpacePattern = #"\s*";
string CsharpWordPattern = #"[a-zA-Z_]+";
string ParenthesesPattern = #"[(][\s\S]*?[)]";
string DoubleCsharpWordPattern = CsharpWordPattern + WhiteSpacePattern + CsharpWordPattern;
string MethodDeclarationPattern =
DoubleCsharpWordPattern +
PossibleWhiteSpacePattern +
ParenthesesPattern;
Pattern usage example:
MatchCollection tests = Regex.Matches(testing, MethodDeclarationPattern);
The individual patterns work perfectly (CsharpWordPattern, ParenthesesPattern, WhiteSpacePattern, and PossibleWhiteSpacePattern). However, when I put them altogether into a single pattern (MethodDeclarationPattern), the full pattern is failing.
How does MethodDeclarationPattern or my usage example need to be altered so that it will start matching the method declarations in the interface code?
To match literal parens, escape them with backslashes:
string ParenthesesPattern = #"\([\s\S]*?\)";
That regex snippet matches a matched pair of parentheses, with optional whitespace between them. You're putting it at the end of your overall regex.
Your complete concatenated regex looks like this:
[a-zA-Z_]+\s+[a-zA-Z_]+\s*[(][\s\S]*?[)]
Identifier, space, identifier, open paren, space, close paren.
For that to match, the method declaration will have to look like this:
"int foo ()"
I believe you'll have better success with something like this:
string openParenPattern = #"\([\s\S]*?";
string closeParenPattern = #"[\s\S]*?\)";
What you really need, conceptually, is this (leaving out space -- no need to clutter it up with that):
identifier
identifier
open paren
((ref|out)? identifier identifier comma)*
((ref|out)? identifier identifier)?
close paren
You know all the syntax for that, I think. You'll have nested groups. Looking at it, I'm really starting to warm up to your idea of putting sub-regexes in string variables and then concatenating them.
The following code matches all four method declarations in your test string:
// This has one bug: It matches "int foo(int a,)"
// Somebody good with regexes could fix that.
var methodPattern =
// return type
identPattern + spacePattern
// method name
+ identPattern + spacePattern
// open paren
+ openParenPattern + spacePattern
// Zero or more parameters followed by commas
+ "(" + paramPattern + spacePattern + "," + spacePattern + ")*" + spacePattern
// Final (or only) parameter not followed by a comma
+ "(" + paramPattern + spacePattern + ")?" + spacePattern
// Close paren
+ closeParenPattern;
Related
I have a big file that has a bunch of data in it, but essentially what I would like to do is to grab only parts of it, let me explain what parts I'm interested in:
(imagine "x" as an IP Address)
(imagine "?" as any alphanumerical character /w any length)
(imagine "MD5" as an MD5 hash)
(Actual -not literally though- text file below)
'xxx.xxx.xxx.xxx'
xxxxxxxxxx
'?'
'?'
'MD5'
Now my inquiry is the following one, How could I identify the line
'xxx.xxx.xxx.xxx'
anywhere at the beginning inside a file and then automatically write to another file both of the '?' entries and the 'MD5' entry for each IP Address instances found.
So in a nutshell, the program should start at the beginning of the file, read the contents, if it hits an IP Address (Regex: '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' works fine for me), skip one line below, then start copying the other data to another file until it hits the MD5 entry (Regex: '[a-f0-9]{32}' works fine for me), then iterate again from that point and so on looking for another instance of an IP Address etc, etc. It should keep doing that until it reaches the end of the file.
I'm trying to do this myself but I don't even know where to start, or methods of doing it at all.
You can use the following to match the content that you are looking for.. and copy it to the desired location/ file:
('\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')(\s*.+\s*)([\s\S]*?)('\b[a-f0-9]{32}\b')
And extract $1$3$4
See DEMO
Code:
String regex = "('\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b')(\\s*.+\\s*)([\\s\\S]*?)('\\b[a-f0-9]{32}\\b')";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(myString);
while (m.find()) {
System.out.println("end(): " + m.group(1));
//System.out.println("end(): " + m.group(2));
System.out.println("end(): " + m.group(3));
System.out.println("end(): " + m.group(4));
}
Given the fact that your file is machine generated and that the overall pattern is pretty specific, I don't think it's necessary to be overly specific with the IP address.
Matching it as "a bunch of digits and dots in single quotes" is probably enough, in the context of the rest of the pattern (*).
Here is an expression that matches your entire requirement into named groups:
^'(?<IP>[\d.]+)'\s+
^(?<ID>\w*)\s+
^'(?<line1>\w*)'\s+
^'(?<line2>\w*)'\s+
^'(?<MD5>[A-Fa-f0-9]{32})'
Use it with the Multiline and IgnorePatternWhitespace regex options (the latter means you can keep the regex layout for better readability).
(*) Besides, regex patterns for IP addresses are literally all over the place, in countless examples. Of course you can use something more sophisticated than '[\d.]+' if you think it's necessary.
I have tried out this in Java as below.
public class TestRegex
{
/**
* #param args
*/
public static void main(String[] args)
{
String input = "assasasa 123.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=aas jjsjjdjd 143.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=asas";
String regexPattern = "(\\b[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\b).*?([A-Z a-z]+[0-9]+=.*?\\s)";
Matcher m = null;
Pattern pattern = Pattern.compile(regexPattern);
m = pattern.matcher(input);
// System.out.println(matcher.toString());
while (m.find()) {
System.out.println("start(): " + m.start());
System.out.println("end(): " + m.end());
System.out.println("end(): " + m.toString());
System.out.println("end(): " + m.group(1));
System.out.println("end(): " + m.group(2));
}
}
}
I am trying to parse a header and create method stubs from the interface/method declarations.
I want to take c++ com method declarations like this:
STDMETHOD(GetCubeMapSurface)(THIS_ D3DCUBEMAP_FACES FaceType,UINT Level,IDirect3DSurface9** ppCubeMapSurface) PURE;
Then modify it to generate a c++ method stub from it like this:
HRESULT __stdcall WrapIDirect3DCubeTexture9::GetCubeMapSurface(D3DCUBEMAP_FACES FaceType, UINT Level, IDirect3DSurface9 * * ppCubeMapSurface)
{
}
I am a little unsure if I should be using regex for this or using .net string functions, and I am confused on how exactly to implement it either way.
I have quite a few methods to do, so creating a tool seems like the right thing to do.
Can anyone help guide me in the right direction?
EDIT: I should have added that I was looking for some help on how I should be implementing it. I wasn't sure if I should be tokenizing all words/special chars and empty spaces and just go from there, using a regex like this and then just parsing and processing with it broken up.
"(\d[x0-9a-fA-F.UL]*|\w+|\s+|"[^"]*"|.)"
Although now it seems like overkill and that I was over analyzing this whole thing. I ended up quickly creating an implementation with .net string functions, and then seen that Caesay helped me out in the regex direction. So I came up with two implementations.
I have decided I will go with the regex implementation. Since I will be doing some other advanced processing and parsing, and regex would make that easier. The implementations are below.
String based implementation:
if (line.StartsWith(" STDMETHOD"))
{
string newstr = line.Replace(" STDMETHOD(", "HRESULT __stdcall WrapIDirect3DCubeTexture9::");
newstr = StringExtensions.RemoveFirst(newstr, ")");
newstr = newstr.Replace("THIS_ ", "");
newstr = newstr.Replace(" PURE;", Environment.NewLine + "{ " + Environment.NewLine + Environment.NewLine + "}");
textBox2.AppendText(newstr + Environment.NewLine);
}
String extension class taken from(C# - Simplest way to remove first occurrence of a substring from another string):
public static class StringExtensions
{
public static string RemoveFirst(this string source, string remove)
{
int index = source.IndexOf(remove);
return (index < 0)
? source
: source.Remove(index, remove.Length);
}
}
Now for the Regex implementation:
if (line.StartsWith(" STDMETHOD"))
{
Regex regex = new Regex(#"\(.*?\)");
MatchCollection matches = regex.Matches(line);
string newstr = String.Format(#"HRESULT __stdcall WrapIDirect3DCubeTexture9::{0}({1})", matches[0].Value.Trim('(', ')'), matches[1].Value.Trim('(', ')'));
newstr = newstr.Replace("THIS_ ", "");
textBox2.AppendText(newstr + Environment.NewLine + "{" + Environment.NewLine + Environment.NewLine + "}" + Environment.NewLine);
}
I will write you some code to help get you started.
If you start with a minimal output string containing the variables, it will be easier to see what needs to be done, so:
String.Format(#"HRESULT __stdcall WrapIDirect3DCubeTexture9::{0}({1})
{{
}}", "methodName", "arguments");
Here we can see there are two items we need to extract from the original string, the method name - and the arguments. I would suggest using a regex to match what is in the parenthesis in the original string. This will give you two matches, the method name - and the arguments. You will need to do post-processing on the arguments string but this will give an idea.
Example String
This is an important example about regex for my work.
I can extract important example about regex with this (?<=an).*?(?=for) snippet. Reference
But i would like to extract to string right to left side. According to this question's example; first position must be (for) second position must be (an).
I mean extracting process works back ways.
I tried what i want do as below codes in else İf case, but it doesn't work.
public string FnExtractString(string _QsString, string _QsStart, string _QsEnd, string _QsWay = "LR")
{
if (_QsWay == "LR")
return Regex.Match(_QsString, #"(?<=" + _QsStart + ").*?(?=" + _QsEnd + ")").Value;
else if (_QsWay == "RL")
return Regex.Match(_QsString, #"(?=" + _QsStart + ").*?(<=" + _QsEnd + ")").Value;
else
return _QsString;
}
Thanks in advance.
EDIT
My real example as below
#Var|First String|ID_303#Var|Second String|ID_304#Var|Third String|DI_t55
When i pass two string to my method (for example "|ID_304" and "#Var|") I would like to extract "Second String" but this example is little peace of my real string and my string is changeable.
No need for forward or backward lookahead! You could just:
(.*)\san\s.*\sfor\s
The \s demands whitespace, so you don't match an import*an*t.
One potential problem in your current solution is that the string passed in contains special characters, which needs to be escaped with Regex.Escape before concatenation:
return Regex.Match(_QsString, #"(?<=" + Regex.Escape(_QsStart) + ").*?(?=" + Regex.Escape(_QsEnd) + ")").Value;
For your other requirement of matching RL, I don't understand your requirement.
I need a regular expression to replace text in string:
string s="Insert into VERSION (ENTRYID,APPVERSION,PLATFORMVERSION,TIMESTAMPED,USERNAME,SQLSCRIPTNAME,COMMENTS)VALUES(SWS_Version_ID."NEXTVAL",'[3.02.01P20]','[4.1.38orcl]',sysdate,null,null,null);";
I need to replace 3.02.01P20 in square brackets to NEW_VERSION.
There can be other version except 3.02.01P20 but in the line we can see that the first opening square bracket follows the version.
Also let me know what changes do I have to make if it(3.02.01P20) follows, say 3 opening square bracket ([) so that I wont have to write a separate question for each one.
using System;
using System.Text.RegularExpressions;
class Tester
{
public static void Main()
{
string s = "Insert into VERSION " +
"(ENTRYID,APPVERSION,PLATFORMVERSION,TIMESTAMPED,USERNAME,SQLSCRIPTNAME,COMMENTS)" +
"VALUES(SWS_Version_ID.\"NEXTVAL\",'[3.02.01P20]','[4.1.38orcl]',sysdate,null,null,null);";
Match m = (new Regex("^(.*)(\\[.*?\\])(.*?)(\\[.*?\\])(.*)$")).Match(s);
//Console.WriteLine("{0},{1}", m.Groups[2].Value, m.Groups[3].Value);
string[] parts = {
m.Groups[1].Value,
m.Groups[2].Value, //[3.02.01P20]
m.Groups[3].Value, //','
m.Groups[4].Value, //[4.1.38orcl]
m.Groups[5].Value //tail
};
parts[1] = "[NEW_VERSION]";
Console.WriteLine(string.Join("",parts));
}
}
You mean like this?
Try this:
string output = Regex.Replace(s, #"(.*'\[)(.*)(\]'.*)('\[.*)", "$1" + newVer + "$3$4");
In the program I'm working on, I need to strip the tags around certain parts of a string, and then insert a comma after each character WITHIN the tag (not not after any other characters in the string). In case this doesn't make sense, here's an example of what needs to happen -
This is a string with a < a > tag < /a > (please ignore the spaces within the tag)
(needs to become)
This is a string with a t,a,g,.
Can anyone help me with this? I've managed to strip the tags using RegEx, but I can't figure out how to insert the commas only after the characters contained within the tag. If someone could help that would be great.
#Dour High Arch I'll elaborate a little bit. The code is for a text-to-speech app that won't recognize SSML tags. When the user enters a message for the text to speech app, they have the option of enclosing a word in a < a > tag to make the speaker say the world as an acronym. Because the acronym SSML tag won't work, I want to remove the < a > tag whenever present, and place commas after each character contained in the tag to fake it out (ex: < a > test< /a > becomes t,e,s,t,). All the non-tagged words in the string do not need commas after them, just those enclosed in tags (see my first example if need be).
If you have figured out the regex, I would imagine it would be simple to capture the inner text of the tag. Then it's a really simple operation to insert the commas:
var commaString = string.Join(",", capturedString.ToList());
Assuming you have your target string already parsed via your RegEx i.e. no tags around it...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication32
{
class Program
{
static void Main(string[] args)
{
// setup a test string
string stringToProcess = "Test";
// actual solution here
string result = String.Concat(stringToProcess.Select(c => c + ","));
// results: T,e,s,t,
Console.WriteLine(result);
}
}
}
Parsing XML is very problematic because you may have to deal with things like CDATA sections, nested elements, entities, surrogate characters, and on and on. I would use a state-based parser like ANTLR.
However, if you are just starting out with C# it is instructive to solve this using the built-in .Net string and array classes. No ANTLR, LINQ, or regular expressions needed:
using System;
class ReplaceAContentsWithCommaSeparatedChars
{
static readonly string acroStartTag = "<a>";
static readonly string acroEndTag = "</a>";
static void Main(string[] args)
{
string s = "Alpha <a>Beta</a> Gamma <a>Delta</a>";
while (true)
{
int start = s.IndexOf(acroStartTag);
if (start < 0)
break;
int end = s.IndexOf(acroEndTag, start + acroStartTag.Length);
if (end < 0)
end = s.Length;
string contents = s.Substring(start + acroStartTag.Length, end - start - acroStartTag.Length);
string[] chars = Array.ConvertAll<char, string>(contents.ToCharArray(), c => c.ToString());
s = s.Substring(0, start)
+ string.Join(",", chars)
+ s.Substring(end + acroEndTag.Length);
}
Console.WriteLine(s);
}
}
Please be aware this does not deal with any of the issues I mentioned. But then, none of the other suggestions do either.