Regular expressions - Equal number of characters in left and right - c#

So I have this regular expression
[a+][a-z-[a]]{1}[a+]
which will match string "aadaa"
but it will also match string "aaaaaaaadaa"
Is there any way to force it to match only those strings in which left side a's and right side a's occurrence count should be same?
so that it will match only "aadaa" and not this "aaaaaaaadaa"
Edit
With the help of Peter's answer I could make it working, this is the working version for my requirement
(a+)[a-z-[a]]{1}\1

You can use a back reference, as follows:
console.log(check("ada"));
console.log(check("aadaa"));
console.log(check("aaaaaaaadaa"));
console.log(check("aaadaaaaaaa"));
function check(str) {
var re = /^(.*).\1$/;
return re.test(str);
}
Or to only match a's and d's:
console.log(check("aca"));
console.log(check("aadaa"));
console.log(check("aaaaaaaadaa"));
console.log(check("aaadaaaaaaa"));
function check(str) {
var re = /^(a*)d\1$/;
return re.test(str);
}
Or to only match a's that surround not-an-a:
console.log(check("aca"));
console.log(check("aadaa"));
console.log(check("aaaaaaaadaa"));
console.log(check("aaadaaaaaaa"));
function check(str) {
var re = /^(a*)[b-z]\1$/;
return re.test(str);
}
I realize all the above is javascript, which was easy for quick demoing within the context of SO.
I made a working DotNetFiddle with the following C# code that is similar to all the above:
public static Regex re = new Regex(#"^(a+)[b-z]\1$");
public static void Main()
{
check("aca");
check("ada");
check("aadaa");
check("aaddaa");
check("aadcaa");
check("aaaaaaaadaa");
check("aadaaaaaaaa");
}
public static void check(string str)
{
Console.WriteLine(str + " -> " + re.IsMatch(str));
}

You can also use the following regex for the same although I would prefer the one suggested by #PeterB
console.log(check("aca"));
console.log(check("aadaa"));
console.log(check("aaaaaaaadaa"));
console.log(check("aaadaaaaaaa"));
function check(str) {
var re = /^(\w+)[A-Za-z]\1$/;
return re.test(str);
}
The code is similar to the one in Peter B's answer, but the regex is the one changed by me.

Related

Problems highlighting substring in a text

I have a problem with a code that I do not know very well how to solve.
The fact is that I want to highlight the substrings found in a text, for this I have developed the following code:
texts.ForEach((a) =>
{
if (a.Content.Contains(word))
{
RenderResultSubString(a, word);
}
});
The method with which I render the text is the following:
public TextGramaticaItemPublic RenderResultSubString(TextGramaticaItemPublic a, string substr) {
a.Content = Regex.Replace(a.Content, String.Format(#"\b{0}\b", substr), new MatchEvaluator(ReplaceKeyWords), RegexOptions.IgnoreCase);
return a;
}
And the delegate of the MatchEvaluator that adds the HTML to cause the highlighting effect is this:
public string ReplaceKeyWords(Match m) {
return "<mark><b>" + m.Value + "</b></mark>";
}
And the truth is that this works when it comes to strings but not when it comes to substrings. I think I'm on the right track, but there is something that escapes me and I can't quite get it right.
I've done a lot of research! But I can't see my failure! :(
SOLVED:
The code must be:
public TextGramaticaItemPublic RenderResultSubString(TextGramaticaItemPublic a, string substr) {
a.Content = Regex.Replace(a.Content, String.Format(substr), new MatchEvaluator(ReplaceKeyWords), RegexOptions.IgnoreCase);
return a;
}
Such #Harshad Raval and #Jamiec comments the expression #"\b{0}\b" of the String.Format() method it's not relevant. Was inherited code...
Thanks to everybody!

Regex: Match multiple balancing groups

I'm searching for a regex to match all C# methods in a text and the body of each found method (refrenced as "Content") should be accessible via a group.
The C# Regex above only gives the desired result if there exists exactly ONE method in the text.
Source text:
void method1(){
if(a){ exec2(); }
else { exec3(); }
}
void method2(){
if(a){ exec4(); }
else { exec5(); }
}
The regex:
string pattern = "(?:[^{}]|(?<Open>{)|(?<Content-Open>}))+(?(Open)(?!))";
MatchCollection methods = Regex.Matches(source,pattern,RegexOptions.Multiline);
foreach (Match c in methods)
{
string body = c.Groups["Content"].Value; // = if(a){ exec2(); }else { exec3();}
//Edit: get the method name
Match mDef= Regex.Match(c.Value,"void ([\\w]+)");
string name = mDef.Groups[1].Captures[0].Value;
}
If only the method1 is contained in source, it works perfectly, but with additional method2 there is only one Match, and you cannot extract the individual method-body pairs any more.
How to modify the regex to match multiple methods ?
Assuming you only want to match basic code like those samples in your question, you can use
(?<method_name>\w+)\s*\((?s:.*?)\)\s*(?<method_body>\{(?>[^{}]+|\{(?<n>)|}(?<-n>))*(?(n)(?!))})
See demo
To access the values you need, use .Groups["method_name"].Value and .Groups["method_body"].Value.

How to split at every second quotation mark

I have a string that looks like this
2,"E2002084700801601390870F"
3,"E2002084700801601390870F"
1,"E2002084700801601390870F"
4,"E2002084700801601390870F"
3,"E2002084700801601390870F"
This is one whole string, you can imagine it being on one row.
And I want to split this in the way they stand right now like this
2,"E2002084700801601390870F"
I cannot change the way it is formatted. So my best bet is to split at every second quotation mark. But I haven't found any good ways to do this. I've tried this https://stackoverflow.com/a/17892392/2914876 But I only get an error about invalid arguements.
Another issue is that this project is running .NET 2.0 so most LINQ functions aren't available.
Thank you.
Try this
var regEx = new Regex(#"\d+\,"".*?""");
var lines = regex.Matches(txt).OfType<Match>().Select(m => m.Value).ToArray();
Use foreach instead of LINQ Select on .Net 2
Regex regEx = new Regex(#"\d+\,"".*?""");
foreach(Match m in regex.Matches(txt))
{
var curLine = m.Value;
}
I see three possibilities, none of them are particularly exciting.
As #dvnrrs suggests, if there's no comma where you have line-breaks, you should be in great shape. Replace ," with something novel. Replace the remaining "s with what you need. Replace the "something novel" with ," to restore them. This is probably the most solid--it solves the problem without much room for bugs.
Iterate through the string looking for the index of the next " from the previous index, and maintain a state machine to decide whether to manipulate it or not.
Split the string on "s and rejoin them in whatever way works the best for your application.
I realize regular expressions will handle this but here's a pure 2.0 way to handle as well. It's much more readable and maintainable in my humble opinion.
using System;
using System.Collections.Generic;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
const string data = #"2,""E2002084700801601390870F""3,""E2002084700801601390870F""1,""E2002084700801601390870F""4,""E2002084700801601390870F""3,""E2002084700801601390870F""";
var parsedData = ParseData(data);
foreach (var parsedDatum in parsedData)
{
Console.WriteLine(parsedDatum);
}
Console.ReadLine();
}
private static IEnumerable<string> ParseData(string data)
{
var results = new List<string>();
var split = data.Split(new [] {'"'}, StringSplitOptions.RemoveEmptyEntries);
if (split.Length % 2 != 0)
{
throw new Exception("Data Formatting Error");
}
for (var index = 0; index < split.Length / 2; index += 2)
{
results.Add(string.Format(#"""{0}""{1}""", split[index], split[index + 1]));
}
return results;
}
}
}

match names with unicode chars

can somebody help me to match following type of strings "BEREŽALINS", "GŽIBOVSKIS" in C# and js , I've tried
\A\w+\z (?>\P{M}\p{M}*)+ ^[-a-zA-Z\p{L}']{2,50}$
, and so on ... but nothing works .
Thanks
Just wrote a little console app to do it:
private static void Main(string[] args) {
var list = new List<string> {
"BEREŽALINS",
"GŽIBOVSKIS",
"TEST"
};
var pat = new Regex(#"[^\u0000-\u007F]");
foreach (var name in list) {
Console.WriteLine(string.Concat(name, " = ", pat.IsMatch(name) ? "Match" : "Not a Match"));
}
Console.ReadLine();
}
Works with the two examples you gave me, but not sure about all scenarios :)
Can you give an example of what is should not match?
Reading your question it's like you want to match just string (on seperates line maybe). If thats the case just use
^.*$
In C# this becomes
foundMatch = Regex.IsMatch(SubjectString, "^.*$", RegexOptions.Multiline);
And in javascript this is
if (/^.*$/m.test(subject)) {
// Successful match
} else {
// Match attempt failed
}

Highlight a list of words using a regular expression in c#

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.
For example:
content: This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.
abbreviations: memb = Member; deb = Debut;
result: This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up.
[a title="Debut"]Deb[/a] of course should also be caught here.
(This is just example markup for simplicity).
Thanks.
EDIT:
CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:
This is just a little test of the memb.
And another memb, but not amemba.
Deb of course should also be caught here.deb!
First you would need to Regex.Escape() all the input strings.
Then you can look for them in the string, and iteratively replace them by the markup you have in mind:
string abbr = "memb";
string word = "Member";
string pattern = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output = Regex.Replace(input, pattern, substitue);
EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.
You can go as far as building a single pattern from all your escaped input strings, like this:
\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.
Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?
Anyway, let me know if this is what you're after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var input = #"This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.";
var dictionary = new Dictionary<string,string>
{
{"memb", "Member"}
,{"deb","Debut"}
};
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex /*#"(memb)|(deb)"*/
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
}
Console.Write (input);
Console.ReadLine();
}
}
}
For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.
public partial class Abbreviations : System.Web.UI.UserControl
{
private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
protected void Page_Load(object sender, EventArgs e)
{
string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
litContent.Text = input;
}
private string GetExplanationMarkup(Match m)
{
return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
}
}
The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:
This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:
var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.
I'm doing pretty exactly what you're looking for in my application and this works for me:
the parameter str is your content:
public static string GetGlossaryString(string str)
{
List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below
str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
foreach (string word in glossaryWords)
str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
return str.Trim();
}

Categories