Regex: Match multiple balancing groups - c#

I'm searching for a regex to match all C# methods in a text and the body of each found method (refrenced as "Content") should be accessible via a group.
The C# Regex above only gives the desired result if there exists exactly ONE method in the text.
Source text:
void method1(){
if(a){ exec2(); }
else { exec3(); }
}
void method2(){
if(a){ exec4(); }
else { exec5(); }
}
The regex:
string pattern = "(?:[^{}]|(?<Open>{)|(?<Content-Open>}))+(?(Open)(?!))";
MatchCollection methods = Regex.Matches(source,pattern,RegexOptions.Multiline);
foreach (Match c in methods)
{
string body = c.Groups["Content"].Value; // = if(a){ exec2(); }else { exec3();}
//Edit: get the method name
Match mDef= Regex.Match(c.Value,"void ([\\w]+)");
string name = mDef.Groups[1].Captures[0].Value;
}
If only the method1 is contained in source, it works perfectly, but with additional method2 there is only one Match, and you cannot extract the individual method-body pairs any more.
How to modify the regex to match multiple methods ?

Assuming you only want to match basic code like those samples in your question, you can use
(?<method_name>\w+)\s*\((?s:.*?)\)\s*(?<method_body>\{(?>[^{}]+|\{(?<n>)|}(?<-n>))*(?(n)(?!))})
See demo
To access the values you need, use .Groups["method_name"].Value and .Groups["method_body"].Value.

Related

Implement generic Regex.Matches with string tags

I have a function that gets the content inside 2 tags of a string:
string content = string.Empty;
foreach (Match match in Regex.Matches(stringSource, "<tag1>(.*?)</tag1>"))
{
content = match.Groups[1].Value;
}
I need to do this operation many times with different tags. I want to update method so I can pass in the opening closing tags, but I can't concatenate the parameters of my tags with the regular expression. When I pass these values to the new function, the expression does not work:
public string GetContent(string stringSource, string openTag, string closeTag)
{
string content = string.Empty;
foreach (Match match in Regex.Matches(stringSource, $"{openTag}(.*?){closeTag}"))
{
content = match.Groups[1].Value;
}
return content;
}
I want to use the function like this:
string content = GetContent(sourceString, "<tag1>", "</tag1>");
How can I make this work?
Try this:
public IEnumerable<string> GetContent(string stringSource, string tag)
{
foreach (Match match in Regex.Matches(stringSource, $"<{tag}>(.*?)</{tag}>"))
{
yield return match.Groups[1].Value;
}
}
// ...
var content = GetContent(sourceString, "tag1");
Note I also changed the return type. What you had before was the equivalent of calling this function like this: string content = GetContent(sourceString, "tag").LastOrDefault();
Also, Regex is generally a poor choice for handling HTML and XML. There are all kind of edge cases around this, such that RegEx really doesn't work that well.
You can make it seem to work if you can constrain your input to a subset of the language to limit edge cases, and that might get you by for a while, but usually someone will eventually want to use more of the features of the markup language and you'll start getting weird bugs and errors. You'll really do much better with a dedicated, purpose-built parser!

How to get all files ending with the extension "_\<fileNum>of\<totalFileNum>" and sometimes without? [duplicate]

a user specifies a file name that can be either in the form "<name>_<fileNum>of<fileNumTotal>" or simply "<name>". I need to somehow extract the "<name>" part from the full file name.
Basically, I am looking for a solution to the method "ExtractName()" in the following example:
string fileName = "example_File"; \\ This var is specified by user
string extractedName = ExtractName(fileName); // Must return "example_File"
fileName = "example_File2_1of5";
extractedName = ExtractName(fileName); // Must return "example_File2"
fileName = "examp_File_3of15";
extractedName = ExtractName(fileName); // Must return "examp_File"
fileName = "example_12of15";
extractedName = ExtractName(fileName); // Must return "example"
Edit: Here's what I've tried so far:
ExtractName(string fullName)
{
return fullName.SubString(0, fullName.LastIndexOf('_'));
}
But this clearly does not work for the case where the full name is just "<name>".
Thanks
This would be easier to parse using Regex, because you don't know how many digits either number will have.
var inputs = new[]
{
"example_File",
"example_File2_1of5",
"examp_File_3of15",
"example_12of15"
};
var pattern = new Regex(#"^(.+)(_\d+of\d+)$");
foreach (var input in inputs)
{
var match = pattern.Match(input);
if (!match.Success)
{
// file doesn't end with "#of#", so use the whole input
Console.WriteLine(input);
}
else
{
// it does end with "#of#", so use the first capture group
Console.WriteLine(match.Groups[1].Value);
}
}
This code returns:
example_File
example_File2
examp_File
example
The Regex pattern has three parts:
^ and $ are anchors to ensure you capture the entire string, not just a subset of characters.
(.+) - match everything, be as greedy as possible.
(_\d+of\d+) - match "_#of#", where "#" can be any number of consecutive digits.

How to avoid large switch statements and/or regular expressions when converting code from one language to another

I have to convert a few hundred test cases written in Java to code in C#. At the moment all I could think of is define a set of regular expressions, try to match it on a line and do an action based on which regex matched.
Any better ideas (this still stinks).
An example of from and to:
Java:
Request request = new Request(testRunner)
request.setUsername("userName")
request.setPassword("password")
log.info(request.getRequest())
C#
var request = new LoginRequest(LoginParams);
request.Username = "userName";
request.Password = "password";
var LoginResponse = Account.ExecuteCall(request, pathToApi);
The source I'm trying to convert is from SoapUI and the bits of script involved are within TestSteps of a humongous XML file. Also, most of them are simply forming some sort of request and checking for a specific response so there shouldn't be too many types to implement.
What I ended up doing was defined a base class (Map) that has a Pattern property, a Success indicator and the lines of Code that it results to after a successful match. In some cases a certain line can be simply replaced by another one but in other cases (setUserName) I need to extract content from the original script to put in the c# code. In other cases, a single line might be replaced with more than one. The transformation is all defined in the Match function.
public class SetUserName : Map
{
internal override string Pattern { get { return #"request.setUsername\(""(.*)""\)"; } }
public override void Match(string line)
{
Match match = Regex.Match(line, Pattern);
if (match.Success)
{
Success = true;
CodeLines = new Code<CodeLine>
{new CodeLine("request.Username = \"" + match.Groups[1].Value + "\"")};
}
}
}
Then I put the maps in a list ordered by occurrence and loop through each line of script:
foreach (string scriptLine in scriptLines)
{
string line = Strip(scriptLine);
if (string.IsNullOrEmpty(line) || Regex.Match(line, #"^\s+$").Success)
{
continue;
}
Map[] RegExes =
{
new Request(),
new SetUserName(),
new SetPassword(),
new RunRequest()
};
foreach (Map map in RegExes)
{
map.Match(line);
if (map.Success)
{
codeList.AddRange(map.CodeLines);
break;
}
}
}

Highlight a list of words using a regular expression in c#

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.
For example:
content: This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.
abbreviations: memb = Member; deb = Debut;
result: This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up.
[a title="Debut"]Deb[/a] of course should also be caught here.
(This is just example markup for simplicity).
Thanks.
EDIT:
CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:
This is just a little test of the memb.
And another memb, but not amemba.
Deb of course should also be caught here.deb!
First you would need to Regex.Escape() all the input strings.
Then you can look for them in the string, and iteratively replace them by the markup you have in mind:
string abbr = "memb";
string word = "Member";
string pattern = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output = Regex.Replace(input, pattern, substitue);
EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.
You can go as far as building a single pattern from all your escaped input strings, like this:
\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.
Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?
Anyway, let me know if this is what you're after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var input = #"This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.";
var dictionary = new Dictionary<string,string>
{
{"memb", "Member"}
,{"deb","Debut"}
};
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex /*#"(memb)|(deb)"*/
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
}
Console.Write (input);
Console.ReadLine();
}
}
}
For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.
public partial class Abbreviations : System.Web.UI.UserControl
{
private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
protected void Page_Load(object sender, EventArgs e)
{
string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
litContent.Text = input;
}
private string GetExplanationMarkup(Match m)
{
return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
}
}
The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:
This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:
var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.
I'm doing pretty exactly what you're looking for in my application and this works for me:
the parameter str is your content:
public static string GetGlossaryString(string str)
{
List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below
str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
foreach (string word in glossaryWords)
str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
return str.Trim();
}

Regular expression that returns a constant value as part of a match

I have a regular expression to match 2 different number formats: \=(?[0-9]+)\?|\+(?[0-9]+)\?
This should return 9876543 as its Value for ;1234567890123456?+1234567890123456789012345123=9876543? and ;1234567890123456?+9876543?
What I would like is to be able to return another value along with the matched 'Value'.
So, for example, if the first string was matched, I'd like it to return:
Value:
9876543
Format:
LongFormat
And if matched in the second string:
Value:
9876543
Format:
ShortFormat
Is this possible?
Another option, which is not quite the solution you wanted, but saves you using two separate regexes, is to use named groups, if your implementation supports it.
Here is some C#:
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
var match = regex.Match(test1);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // 9876543
Console.WriteLine("Short: {0}", match.Groups["Short"]); // blank
match = regex.Match(test2);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // blank
Console.WriteLine("Short: {0}", match.Groups["Short"]); // 9876543
Basically just modify your regex to include the names, and then regex.Groups[GroupName] will either have a value or wont. You could even just use the Success property of the group to know which matched (match.Groups["Long"].Success).
UPDATE:
You can get the group name out of the match, with the following code:
static void Main(string[] args)
{
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
ShowGroupMatches(regex, test1);
ShowGroupMatches(regex, test2);
Console.ReadLine();
}
private static void ShowGroupMatches(Regex regex, string testCase)
{
int i = 0;
foreach (Group grp in regex.Match(testCase).Groups)
{
if (grp.Success && i != 0)
{
Console.WriteLine(regex.GroupNameFromNumber(i) + " : " + grp.Value);
}
i++;
}
}
I'm ignoring the 0th group, because that is always the entire match in .NET
No, you can't match text that isn't there. The match can only return a substring of the target.
You essentially want to match against two patterns and take different actions in each case. See if you can separate them in your code:
if match(\=(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: LongFormat'
else if match(\+(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: ShortFormat'
(Excuse the dodgy pseudocode, but you get the idea.)
You can't match text that isn't there - but, depending on what language you're using, you can process what you match, and conditionally add text based on what is there.
With some implementations of regex, you can specify a "callback function" which allows you to run logic against each result.
Here's a pseudo-code example:
Input.replaceAll( /[+=][0-9]+(?=\?)/ , formatValue );
formatValue : function(match,groups)
{
switch( left(match,1) )
{
case '+' : Format = 'Short'; break;
case '=' : Format = 'Long'; break;
default : Format = 'Unknown'; break;
}
Value : match.replace('[+=]');
return 'Value: '+Value+' Format: ' + Format;
}
What that will do, in a language that supports regex callbacks, is execute the formatValue function every time it finds a match, and use the result of the function as the replacement text.
You haven't specified which implementation you're using, so this may or not be possible for you, but it is definitely worth checking out.

Categories