Getting exact substring that satisfies regex's match - c#

I want to get the indices of the regular expression match below:
input : ab
regex: a(?=b)
The Match object contains information on the actual matched part of the string(a) and does not include the zero-width assertions that were required for the match to succeed. I want to be able to capture the exact substring that satisfies this match. I don't want to have to expand the string manually to do so. It seems to me there should be a method somewhere in the FCL.
Edit:
Just to make things more clear as there are recommendations as to not using lookaheads. I am well aware that I shouldn't be using lookaheads when I want to actually match a part of the string. However, the application I am working on receives a series of regular expressions to be used in a preprocessing stage. These regular expressions are out of my control. I cannot guarantee that they properly match the zero-width assertions. In this stage the matched regular expressions are replaced with a piece of text. In order for the following regular expression replace procedure to work, I need to be able to capture the substring in the string that satisfies the regular expression. Consider the code below:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
Match m = regex.Match(input);
regex.Replace(m.Value, "z").Dump();
First notice that I want the replacement to happen only in the portion of the input that the match occurred and not the entire input. This is very important as I don't want all the matches to be replaced just yet. The code above's output is 'a' and not 'z'. The reason for that is that m.Value is a and the regex wouldn't replace a single a with z. It would replace the a found in 'ab' with 'z'. I want to be able to pass 'ab' to the Replace function.
Hope this clears things up.

You are using a wrong API for controlling the replacement: rather than passing the match back to regex, use the four-argument overload of Replace that gives you tighter control over what is being replaced in the original string, and what parts of the string to consider for the replacement:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
regex.Replace(input , "z", 1, 0).Dump();
Only the first match will be replaced, starting at the index zero. If you would like to continue replacing additional matches, change the last parameter to the new starting index. Keep the third parameter at 1, so as to make at most one replacement.

Related

regular expression for match special character in particular index C#

I need to write Regex for highlighting open paranthesis "{" given only when it is given in 3rd index of given string input for C# language.
For Example
hi{there
In below example, i have added { at 3rd position, so it "{" needs to be highlighted
As i am new to Regular Expression, i dont know how to give condition for this.
If you need your pattern to match only if it comes at the N-th position, you may use positive lookbehind ((?<=...)) checking "start of string (^) followed by N-1 characters (.{N-1})" condition. In your particular case it's
(?<=^.{2})\{
See demo

UB: C#'s Regex.Match returns whole string instead of part when matching

Attention! This is NOT related to Regex problem, matches the whole string instead of a part
Hi all.
I try to do
Match y = Regex.Match(someHebrewContainingLine, #"^.{0,9} - \[(.*)?\s\d{1,3}");
Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:
Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).
Regex.Matches() is acting same way.
What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-") (sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ], but simply adding them to test string doesn't causes the same effect.
I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group member of Match is the nesting member. You have to traverse Groups in order to get down to lower captures. Groups also have Capture members. The Match is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures member represents the match of your full expression.
Whenever you have a single capture as you have, Group[1] will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1].
Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match call, then you would see Match.Captures.Count be greater than 1, but each top-level Match.Captures would be the full string matched by your full expression.
There is one capture group in the pattern; that is group 1.
There is always group 0, which is the entire match.
Therefore there are a total of 2 groups.
My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.
Now, why we should call Regex behaviour undocumented, not undefined:
let's do some matches against "1.234567890".
PCRE-like syntax: (.)\.2345678
lookahead syntax: (.)(?=\.\d)
When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.
So, the matches will return:
PCRE: 1.2345678 (at 2300, this looks like original string and I start yelling here at SO)
lookahead: 1

Extending regular expression syntax to say 'does not contain text XYZ'

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?
My app is written in C# .NET 3.5.
My plan (before I got the awesome answers to this question...)
Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.
So I might use some regexes like this (contrived) example:
on (this|that|these) day(s)?¬(every|all) day(s) ?
Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.
In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:
public bool IsMatchExtended(string textToTest, string extendedRegex)
{
int notPosition = extendedRegex.IndexOf('¬');
// Just a normal regex:
if (notPosition==-1)
return Regex.IsMatch(textToTest, extendedRegex);
// Use a positive (normal) regex and a negative one
string positiveRegex = extendedRegex.Substring(0, notPosition);
string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);
return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
}
Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?
Alternative plan
In writing this question I also came across this answer which suggests using something like this:
^(?=(?:(?!negative pattern).)*$).*?positive pattern
So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.
Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?
Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.
Why!?
I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.
I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain
pattern'.
You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.
You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.
Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.
Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=
Some examples
Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>
Given the following regex's, these are the results you will see:
tr - match
td - match
^td - no match
^tr - no match
^<tr - match
^<tr>.*</tr> - no match
^<tr.*>.*</tr> - match
^<tr.*>.*</tr>(?<tr>) - match
^<tr.*>.*</tr>(?<!tr>) - no match
^<tr.*>.*</tr>(?<!Albatross) - match
^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
^(?!.*Albatross.*)<tr.*>.*</tr> - no match
Explanations
The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.
The fifth example matches because the test string starts with <tr.
The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.
The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.
The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.
The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.
The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.
The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.
example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.
For more on this stuff, read up at http://www.regular-expressions.info/
Or get a regex evaluator tool and try out some tests.
like this tool:
source and binary
You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":
if (Regex.IsMatch(text, #"
# Match string containing all of one set of words but none of another.
^ # anchor to start of string.
# Positive look ahead assertions for required substrings.
(?=.*? cat ) # Assert string has: 'cat'.
(?=.*? lion ) # Assert string has: 'lion'.
(?=.*? tiger ) # Assert string has: 'tiger'.
# Negative look ahead assertions for not-allowed substrings.
(?!.*? dog ) # Assert string does not have: 'dog'.
(?!.*? wolf ) # Assert string does not have: 'wolf'.
(?!.*? hyena ) # Assert string does not have: 'hyena'.
",
RegexOptions.Singleline | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.
You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.
Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:
(?!.*?(?:dog|wolf|hyena))

Should i use regular expression in this situation?

I have a xml file containing certain expressions like this :-
1. AAaaaaa-1111
2. AAaaa-1111-aaa
3. AA11111-11111
4. AA111-111-111111
(AA static text) (aaaa-Any alphabet only) then hyphen (1111 - any digit only)
I was thinking i should write regular expression for these I believe regex should be the right approach.
But this XML file is dynamic. User can remove or add different expressions in the list. So How can i use regular expression here? Is there any dynamic regular expression kind of thing. Show me the light here please.
UPDATE:- I am using these expressions to validate user input. So whatever user is entering in a box, it should be matched with any of these expressions from the list.
For Example:-
If user enters
AAabc-4567-trr
, then it should be validated coz it matches with 2nd expression in the list
Well,
What I assume from your question is that:
A is the letter A
a is any letter
1 is any number
That's the only way I see AAabc-4567-trr matches AAaaa-1111-aaa
Is that correct?
If it is correct, yes, you could use Regular Expressions. What you need to do is translate your patterns to regex patterns. Assuming you have a new pattern:
AAA-aaa-111
to obtain the regex that will recognize that pattern, all you have to do is translate that pattern into regex patterns. For example:
string xmlPattern = "AAA-aaa-111"
string regexPattern = xmlPattern.Replace("a", "[a-zA-Z]").Replace("1", #"\d");
Edit:
You should take in count other characters that have special meanings in Regular Expressions, and translate/encode them properly. Maybe classify them. For example, these characters:
., $, ^
can be easily translated to regex patterns just encoding them with a \ before, so they will become:
\., \$, \^, ...
If you can specify what is the format of the validation patterns you are storing in the XML files, I could help you a little more, but I'm just writing this answer kind of blind ;)
Regular expressions that match certain sets of characters in a certain order are fairly simple. For example, this will match #2 (AAaaa-1111-aaa):
[A-Z]{2}[a-z]{3}-[0-9]{4}-[a-z]{3}
Breaking it down:
[A-Z]: Any character from A to Z. So any alphabetic, uppercase character.
{2}: Two of the previous item.
The rest of it works in the same way. The hyphens between things are there to match the hyphens in your expected input.

Regex search and replace where the replacement is a mod of the search term

i'm having a hard time finding a solution to this and am pretty sure that regex supports it. i just can't recall the name of the concept in the world of regex.
i need to search and replace a string for a specific pattern but the patterns can be different and the replacement needs to "remember" what it's replacing.
For example, say i have an arbitrary string: 134kshflskj9809hkj
and i want to surround the numbers with parentheses,
so the result would be: (134)kshflskj(9809)hkj
Finding numbers is simple enough, but how to surround them?
Can anyone provide a sample or point me in the right direction?
In some various langauges:
// C#:
string result = Regex.Replace(input, #"(\d+)", "($1)");
// JavaScript:
thestring.replace(/(\d+)/g, '($1)');
// Perl:
s/(\d+)/($1)/g;
// PHP:
$result = preg_replace("/(\d+)/", '($1)', $input);
The parentheses around (\d+) make it a "group" specifically the first (and only in this case) group which can be backreferenced in the replacement string. The g flag is required in some implementations to make it match multiple times in a single string). The replacement string is fairly similar although some languages will use \1 instead of $1 and some will allow both.
Most regex replacement functions allow you to reference capture groups specified in the regex (a.k.a. backreferences), when defining your replacement string. For instance, using preg_replace() from PHP:
$var = "134kshflskj9809hkj";
$result = preg_replace('/(\d+)/', '(\1)', $var);
// $result now equals "(134)kshflskj(9809)hkj"
where \1 means "the first capture group in the regex".
Another somewhat generic solution is this:
search : /([\d]+)([^\d]*)/g
replace: ($1)$2
([\d]+): match a set of one or more digits and retain them in a group
([^\d]*): match a set of non-digits, and retain them as well. \D could work here, too.
g: indicate this is a global expression, to work multiple times on the input.
($1): in the replace block, parens have no special meaning, so output the first group, surrounding it with parens.
$2: output the second group
I used a pretty good online regex tool to test out my expression. The next step would be to apply it to the language that you are using, as each has its own implemention nuance.
Backreferences (grouping) are not necessary if you're just looking to search for numbers and replace with the found regex surrounded by parens. It is simpler to use the whole regex match in the replacement string.
e.g for perl
$text =~ s/\d+/($&)/g;
This searches for 1 or more digits and replaces with parens surrounding the match (specified by $&), with trailing g to find and replace all occurrences.
see http://www.regular-expressions.info/refreplace.html for the correct syntax for your regex language.
Depending on your language, you're looking to match groups.
So typically you'll make a pattern in the form of
([0-9]{1,})|([a-zA-Z]{1,})
Then, you'll iterate over the resulting groups in (specific to your language).

Categories