regular expression for match special character in particular index C# - c#

I need to write Regex for highlighting open paranthesis "{" given only when it is given in 3rd index of given string input for C# language.
For Example
hi{there
In below example, i have added { at 3rd position, so it "{" needs to be highlighted
As i am new to Regular Expression, i dont know how to give condition for this.

If you need your pattern to match only if it comes at the N-th position, you may use positive lookbehind ((?<=...)) checking "start of string (^) followed by N-1 characters (.{N-1})" condition. In your particular case it's
(?<=^.{2})\{
See demo

Related

Getting exact substring that satisfies regex's match

I want to get the indices of the regular expression match below:
input : ab
regex: a(?=b)
The Match object contains information on the actual matched part of the string(a) and does not include the zero-width assertions that were required for the match to succeed. I want to be able to capture the exact substring that satisfies this match. I don't want to have to expand the string manually to do so. It seems to me there should be a method somewhere in the FCL.
Edit:
Just to make things more clear as there are recommendations as to not using lookaheads. I am well aware that I shouldn't be using lookaheads when I want to actually match a part of the string. However, the application I am working on receives a series of regular expressions to be used in a preprocessing stage. These regular expressions are out of my control. I cannot guarantee that they properly match the zero-width assertions. In this stage the matched regular expressions are replaced with a piece of text. In order for the following regular expression replace procedure to work, I need to be able to capture the substring in the string that satisfies the regular expression. Consider the code below:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
Match m = regex.Match(input);
regex.Replace(m.Value, "z").Dump();
First notice that I want the replacement to happen only in the portion of the input that the match occurred and not the entire input. This is very important as I don't want all the matches to be replaced just yet. The code above's output is 'a' and not 'z'. The reason for that is that m.Value is a and the regex wouldn't replace a single a with z. It would replace the a found in 'ab' with 'z'. I want to be able to pass 'ab' to the Replace function.
Hope this clears things up.
You are using a wrong API for controlling the replacement: rather than passing the match back to regex, use the four-argument overload of Replace that gives you tighter control over what is being replaced in the original string, and what parts of the string to consider for the replacement:
string input = "abcdefg";
Regex regex = new Regex("a(?=b)");
regex.Replace(input , "z", 1, 0).Dump();
Only the first match will be replaced, starting at the index zero. If you would like to continue replacing additional matches, change the last parameter to the new starting index. Keep the third parameter at 1, so as to make at most one replacement.

how to create regular expression based on some condition

i want to create a regular expression to find and replace uppercase character based on some condition.
find the starting uppercase for a group of uppercase character in a string and replace it lowercase and * before the starting uppercase.
If there is any lowercase following the uppercase,replace the uppercase with lowercase and * before the starting uppercase.
input string : stackOVERFlow
expected output : stack*over*flow
i tried but could not get it working perfectly.
Any idea on how to create a regular expression ?
Thanks
Well the expected inputs and outputs are slightly illogical: you're lower-casing the "f" in "flow" but not including it in the asterisk.
Anyway, the regex you want is pretty simple: #"[A-Z]+?". This matches a string of one or more uppercase alpha characters, nongreedily (don't think it makes a difference either way as the matched character class is relatively narrow).
Now, to do the find/replace, you would do something like the following:
Regex.Replace(inputString, #"([A-Z]+?)", "*$1*").ToLower();
This simply finds all occurrences of one or more uppercase alpha characters, and wherever it finds a match it replaces it with itself surrounded by asterisks. This does the surrounding but not the lowercasing; .NET Regex doesn't provide for that kind of string modification. However, since the end result of the operation should be a string with all lowercase chars, just do exactly that with a ToLower() and you'll get the expected result.
KeithS's solution can be simplified a bit
Regex.Replace("stackOVERFlow","[A-Z]+","*$0*").ToLower()
However, this will yield stack*overf*low including the f between the stars. If you want to exclude the last upper case letter, use the following expression
Regex.Replace("stackOVERFlow","[A-Z]+(?=[A-Z])","*$0*").ToLower()
It will yield stack*over*flow
This uses the pattern find(?=suffix), which finds a position before a suffix.

Extending regular expression syntax to say 'does not contain text XYZ'

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?
My app is written in C# .NET 3.5.
My plan (before I got the awesome answers to this question...)
Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.
So I might use some regexes like this (contrived) example:
on (this|that|these) day(s)?¬(every|all) day(s) ?
Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.
In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:
public bool IsMatchExtended(string textToTest, string extendedRegex)
{
int notPosition = extendedRegex.IndexOf('¬');
// Just a normal regex:
if (notPosition==-1)
return Regex.IsMatch(textToTest, extendedRegex);
// Use a positive (normal) regex and a negative one
string positiveRegex = extendedRegex.Substring(0, notPosition);
string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);
return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
}
Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?
Alternative plan
In writing this question I also came across this answer which suggests using something like this:
^(?=(?:(?!negative pattern).)*$).*?positive pattern
So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.
Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?
Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.
Why!?
I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.
I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain
pattern'.
You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.
You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.
Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.
Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=
Some examples
Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>
Given the following regex's, these are the results you will see:
tr - match
td - match
^td - no match
^tr - no match
^<tr - match
^<tr>.*</tr> - no match
^<tr.*>.*</tr> - match
^<tr.*>.*</tr>(?<tr>) - match
^<tr.*>.*</tr>(?<!tr>) - no match
^<tr.*>.*</tr>(?<!Albatross) - match
^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
^(?!.*Albatross.*)<tr.*>.*</tr> - no match
Explanations
The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.
The fifth example matches because the test string starts with <tr.
The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.
The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.
The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.
The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.
The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.
The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.
example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.
For more on this stuff, read up at http://www.regular-expressions.info/
Or get a regex evaluator tool and try out some tests.
like this tool:
source and binary
You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":
if (Regex.IsMatch(text, #"
# Match string containing all of one set of words but none of another.
^ # anchor to start of string.
# Positive look ahead assertions for required substrings.
(?=.*? cat ) # Assert string has: 'cat'.
(?=.*? lion ) # Assert string has: 'lion'.
(?=.*? tiger ) # Assert string has: 'tiger'.
# Negative look ahead assertions for not-allowed substrings.
(?!.*? dog ) # Assert string does not have: 'dog'.
(?!.*? wolf ) # Assert string does not have: 'wolf'.
(?!.*? hyena ) # Assert string does not have: 'hyena'.
",
RegexOptions.Singleline | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.
You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.
Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:
(?!.*?(?:dog|wolf|hyena))

Should i use regular expression in this situation?

I have a xml file containing certain expressions like this :-
1. AAaaaaa-1111
2. AAaaa-1111-aaa
3. AA11111-11111
4. AA111-111-111111
(AA static text) (aaaa-Any alphabet only) then hyphen (1111 - any digit only)
I was thinking i should write regular expression for these I believe regex should be the right approach.
But this XML file is dynamic. User can remove or add different expressions in the list. So How can i use regular expression here? Is there any dynamic regular expression kind of thing. Show me the light here please.
UPDATE:- I am using these expressions to validate user input. So whatever user is entering in a box, it should be matched with any of these expressions from the list.
For Example:-
If user enters
AAabc-4567-trr
, then it should be validated coz it matches with 2nd expression in the list
Well,
What I assume from your question is that:
A is the letter A
a is any letter
1 is any number
That's the only way I see AAabc-4567-trr matches AAaaa-1111-aaa
Is that correct?
If it is correct, yes, you could use Regular Expressions. What you need to do is translate your patterns to regex patterns. Assuming you have a new pattern:
AAA-aaa-111
to obtain the regex that will recognize that pattern, all you have to do is translate that pattern into regex patterns. For example:
string xmlPattern = "AAA-aaa-111"
string regexPattern = xmlPattern.Replace("a", "[a-zA-Z]").Replace("1", #"\d");
Edit:
You should take in count other characters that have special meanings in Regular Expressions, and translate/encode them properly. Maybe classify them. For example, these characters:
., $, ^
can be easily translated to regex patterns just encoding them with a \ before, so they will become:
\., \$, \^, ...
If you can specify what is the format of the validation patterns you are storing in the XML files, I could help you a little more, but I'm just writing this answer kind of blind ;)
Regular expressions that match certain sets of characters in a certain order are fairly simple. For example, this will match #2 (AAaaa-1111-aaa):
[A-Z]{2}[a-z]{3}-[0-9]{4}-[a-z]{3}
Breaking it down:
[A-Z]: Any character from A to Z. So any alphabetic, uppercase character.
{2}: Two of the previous item.
The rest of it works in the same way. The hyphens between things are there to match the hyphens in your expected input.

How can you match words with more than one character?

I would like to use a regular expression to match all words with more that one character, as opposed to words entirely made of the same char.
This should not match: ttttt, rrrrr, ggggggggggggg
This should match: rttttttt, word, wwwwwwwwwu
The following expression will do the trick.
^(?<FIRST>[a-zA-Z])[a-zA-Z]*?(?!\k<FIRST>)[a-zA-Z]+$
capture the first character into the group FIRST
capture some more characters (lazily to avoid backtracking)
ensure that that the next character is different from FIRST using a negative lookahead assertion
capture all (at least one due to the assertion) remaining characters
Note that is sufficient to look for a character that is different from the first one, because if no character is different from the first one, all characters are equal.
You can shorten the expression to the following.
^(\w)\w*?(?!\1)\w+$
This will match some more characters other than [a-zA-Z].
I would add all unique words to a list and then used this regex
\b(\w)\1+\b
to grab all one character words and get rid of them
This doesn't use a regular expression, but I believe it will do what you require:
public bool Match(string str)
{
return string.IsNullOrEmpty(str)
|| str.ToCharArray()
.Skip(1)
.Any( c => !c.Equals(str[0]) );
}
The following RE will do the opposite of what you're asking for: match where a word is composed of the same character. It may still be useful to you though.
\b(\w)\1*\b
\b\w*?(\w)\1*(?:(?!\1)\w)\w*\b
or
\b(\w)(?!\1*\b)\w*\b
This assumes you're plucking the words out of some larger text; that's why it needs the word boundaries and the padding. If you have a list of words and you're just trying to validate the ones that meet the criteria, a much simpler regex would probably do:
(.)(?:(?!\1).)
...because you already know each word contains only word characters. On the other hand, depending on your definition of "word" you might need to replace \w in the first two regexes with something more specific, like [A-Za-z].

Categories