Regular expression with backreferences - c#

I am attempting to write a regular expression in a C# application to find "{value}", along with a backreference to the text before it up to "[[", and another backreference to the text after it up to "]]". For example:
This is some text [[backreference one {value}
backreference two]]
Would match "[[backreference one ", "{value}", and "\r\nbackreference two]]".
I have tried modified versions of the following with no luck. I believe I am missing word boundaries, and may be having trouble because of "{" in the text I am trying to find.
\[\[(^[\{value\}]+)\{value\}(^\]\]+)\]\]
I'm not sure if it would be possible with regular expressions, but it would be ideal if it could find the matching closing bracket, for example the following would find "[[backreferenc[[e]] one ", "{value}", and "ba[[ckref[[e]]rence t]]wo]]":
This is some text [[backreferenc[[e]] one {value}
ba[[ckref[[e]]rence t]]wo]]

You need to use the MatchEvaluator on Regex replace. Also it would make your life easier by breaking up the matches into named capture groups to help with the match evaluator processing. Let me explain.
What the MatchEvaluator does, is it allows one to intercede in the match process with a C# delegate and return what should be replaced when a match happens by examining the actual match captured. That way you can do your text processing as needed.
Here is a basic example where it handles the sections in a basic way, but the structure is there to add your business logic:
string text = #"This is some text [[Name: {name}]] at [[Address: {address}]].";
Regex.Replace(text,
#"(?:\[\[)(?<Section>[^\:]+)(?:\:)(?<Data>[^\]]+)(?:\]\])",
new MatchEvaluator((mtch) =>
{
if (mtch.Groups["Section"].Value == "Name")
return "Jabberwocky";
return "120 Main";
}));
The result of Regex Replace is:
This is some text Jabberwocky at 120 Main.

To the first part of you question try this:
\[\[(.*)({value})(.*)\]\]

Related

C# replace by regular expression

I have below C# code to remove stop words from a string:
public static string RemoveStopWords(string Parameter)
{
Parameter = Regex.Replace(Parameter, #"(?<=(\A|\s|\.|,|!|\?))($|_|0|1|2|3|4|5|6|7|8|9|A|about|after|all|also|an|and|another|any|are|as|at|B|be|because|been|before|being|between|both|but|by|C|came|can|come|could|D|did|do|does|E|each|else|F|for|from|G|get|got|H|had|has|have|he|her|here|him|himself|his|how|I|if|in|into|is|it|its|J|just|K|L|like|M|make|many|me|might|more|most|much|must|my|N|never|no|not|now|O|of|on|only|or|other|our|out|over|P|Q|R|re|S|said|same|see|should|since|so|some|still|such|T|take|than|that|the|their|them|then|there|these|they|this|those|through|to|too|U|under|up|use|V|very|W|want|was|way|we|well|were|what|when|where|which|while|who|will|with|would|X|Y|you|your|Z)(?=(\s|\z|,|!|\?))([^.])", " ", RegexOptions.IgnoreCase);
return Parameter.Trim();
}
But when I run it, it works when the stop word in not at end of the string, for example:
about this book output is book
manager only output is manager only
only manager output is manager
Can anyone please guide?
The capture group at the end of the pattern ([^.]) requires a single char other than a dot. The looakhead preceding that (?=(\s|\z|,|!|\?)) limits that match to only one of the listed alternatives (it can not match a dot already as it is excluded by the lookahead).
If you want to keep that, you could omit that lookahead, and just match what you would allow to match like ([\s,!?]|\z) but it would still require at least 1 of the listed alternatives.
What you could so is only use the positive lookahead, and update it to (?=[\s,!?]|\z)
(?<=\A|[\s.,!?])(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=[\s,!?]|\z)
.NET regex demo
A few notes about the pattern
To shorten the alternation, you can for example a character class a[ts] to either match at or as or make a character optional and? to match either an or and
Inside the lookarounds, you don't have to add another grouping mechanism, so you can use (?=[\s,!?]|\z) instead of (?=(?:[\s,!?]|\z))
If you don't need the values of the capture groups () you can make them non capturing (?:)
The numbers 1|2|3 and the characters A|B|C can be shortened to [A-Z0-9] and also matching the underscore, you might even shorten it to \w

Match text not surrounded by & and ;

I am currently using the following regular expression:
(?<!&)[^&;]*(?!;)
To match text like this:
match1<match2>
And extract:
match1
match2
However, this seems to match an extra five empty strings. See Regex Storm.
How can I only match the two listed above?
Note the existing pattern ((?<=^|;)[^&]+) by #xanatos will only match matches 1 to 3 in the following string and not match4:
match1&lte;match2<match;3+match&4
Try changing the * to a +:
(?<!&)[^&;]+(?!;)
Test here
More correct regex:
(?<=^|;)[^&]+
Test here
The basic idea here is that a "good" substring starts at the beginning of the string (^) or right after the ;, and ends when you encounter a & ([^&]+).
Third version... But here we are showing how if you have a problem, and you decide to use regexes, now you have two problems:
(?<=^|;)([^&]|&(?=[^&;]*(?:&|$)))+
Test here
I have managed it with:
(?<Text>.+?)(?:&[^&;]*?;|$)
This seems to match all of the corner cases but it might not work with a case I can't think of at the moment.
This won't work if the string starts with a &...; pattern or is only that.
See Regex Storm.

Getting exact substring that satisfies regex's match

I want to get the indices of the regular expression match below:
input : ab
regex: a(?=b)
The Match object contains information on the actual matched part of the string(a) and does not include the zero-width assertions that were required for the match to succeed. I want to be able to capture the exact substring that satisfies this match. I don't want to have to expand the string manually to do so. It seems to me there should be a method somewhere in the FCL.
Edit:
Just to make things more clear as there are recommendations as to not using lookaheads. I am well aware that I shouldn't be using lookaheads when I want to actually match a part of the string. However, the application I am working on receives a series of regular expressions to be used in a preprocessing stage. These regular expressions are out of my control. I cannot guarantee that they properly match the zero-width assertions. In this stage the matched regular expressions are replaced with a piece of text. In order for the following regular expression replace procedure to work, I need to be able to capture the substring in the string that satisfies the regular expression. Consider the code below:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
Match m = regex.Match(input);
regex.Replace(m.Value, "z").Dump();
First notice that I want the replacement to happen only in the portion of the input that the match occurred and not the entire input. This is very important as I don't want all the matches to be replaced just yet. The code above's output is 'a' and not 'z'. The reason for that is that m.Value is a and the regex wouldn't replace a single a with z. It would replace the a found in 'ab' with 'z'. I want to be able to pass 'ab' to the Replace function.
Hope this clears things up.
You are using a wrong API for controlling the replacement: rather than passing the match back to regex, use the four-argument overload of Replace that gives you tighter control over what is being replaced in the original string, and what parts of the string to consider for the replacement:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
regex.Replace(input , "z", 1, 0).Dump();
Only the first match will be replaced, starting at the index zero. If you would like to continue replacing additional matches, change the last parameter to the new starting index. Keep the third parameter at 1, so as to make at most one replacement.

Should i use regular expression in this situation?

I have a xml file containing certain expressions like this :-
1. AAaaaaa-1111
2. AAaaa-1111-aaa
3. AA11111-11111
4. AA111-111-111111
(AA static text) (aaaa-Any alphabet only) then hyphen (1111 - any digit only)
I was thinking i should write regular expression for these I believe regex should be the right approach.
But this XML file is dynamic. User can remove or add different expressions in the list. So How can i use regular expression here? Is there any dynamic regular expression kind of thing. Show me the light here please.
UPDATE:- I am using these expressions to validate user input. So whatever user is entering in a box, it should be matched with any of these expressions from the list.
For Example:-
If user enters
AAabc-4567-trr
, then it should be validated coz it matches with 2nd expression in the list
Well,
What I assume from your question is that:
A is the letter A
a is any letter
1 is any number
That's the only way I see AAabc-4567-trr matches AAaaa-1111-aaa
Is that correct?
If it is correct, yes, you could use Regular Expressions. What you need to do is translate your patterns to regex patterns. Assuming you have a new pattern:
AAA-aaa-111
to obtain the regex that will recognize that pattern, all you have to do is translate that pattern into regex patterns. For example:
string xmlPattern = "AAA-aaa-111"
string regexPattern = xmlPattern.Replace("a", "[a-zA-Z]").Replace("1", #"\d");
Edit:
You should take in count other characters that have special meanings in Regular Expressions, and translate/encode them properly. Maybe classify them. For example, these characters:
., $, ^
can be easily translated to regex patterns just encoding them with a \ before, so they will become:
\., \$, \^, ...
If you can specify what is the format of the validation patterns you are storing in the XML files, I could help you a little more, but I'm just writing this answer kind of blind ;)
Regular expressions that match certain sets of characters in a certain order are fairly simple. For example, this will match #2 (AAaaa-1111-aaa):
[A-Z]{2}[a-z]{3}-[0-9]{4}-[a-z]{3}
Breaking it down:
[A-Z]: Any character from A to Z. So any alphabetic, uppercase character.
{2}: Two of the previous item.
The rest of it works in the same way. The hyphens between things are there to match the hyphens in your expected input.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories