I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.
Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)
You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.
How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"
Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.
Related
I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.
Here's the pattern:
new Regex(
#" (?<escape> \#\# )"
+ #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
+ #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
// captures expressions of the form "foreach (var [var] in [expression]) { <text>"
/* ---> */ + #"| (?<foreach> \#foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures expressions of the form "if ([expression]) { <text>"
/* ---> */ + #"| (?<if> \#if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures the close of a razor text block
+ #"| (?<endBlock> </text> \s* \} )"
// an expression of the form #([(int)] a.b.c)
+ #"| (?<parenAtExpression> \#\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
+ #"| (?<atExpression> \# (?<expressionValue> [\w\.]+ ) )"
/* ---> */ + #"| (?<literal> ([^\#<]+|[^\#]) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/* ---> */ indicates the new "rules" that caused the slowdown.
As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. This will always be time-consuming, but how can it be made less so?
Some thoughts:
I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly.
I can see what you're trying to do with the ( ([^\*]\#) | (\*[^\#]) | . )* - allow # and * within the comments as long as they are not preceded by * or followed by # respectively. But because of the group's * quantifier and the third option ., the sub-pattern will happily match *#, therefore rendering the other options redundant.
And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line
+ #"| (?<comment> #\*.*?\*# )"
i.e. lazily match any characters (but newlines) until the first *# is encountered.
You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem.
I also do not like the ([^\#<]+|[^\#]) sub-pattern in the last line, which equates to ([^\#<]+|<). The [^\#<]+ will greedily match to the end of the string unless it comes across a # or <.
I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \s* seem suspect because of their greed and flexibility, including matching nothing and newlines. Perhaps you could change some of the \s* to [ \t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if.
I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down.
What are you trying to achieve with the RegexOptions.Multiline option? You do not look to be using ^ or $ so it will have no effect.
The escaping of the # is unnecessary.
As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping # or escaping characters aside from \ inside a character class; for example, using [^*] instead of [^\*]).
Here are some ideas for improving performance:
Order your different alternatives so that the most likely ones come first.
The regex engine will attempt to match each alternative in the order that they appear in the regex. If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases.
Remove unnecessary backtracking
Not the ending of your "using" alternative: #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
If for some reason you have a large amount of whitespace, but no closing ; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\s*;). In your case, (\s*;)? can be replaced with \s*;? to prevent backtracking in these scenarios.
In addition, you could use atomic groups (?>...) to prevent backtracking through quantifiers (e.g. * and +). This really helps improve performance when you don't find a match. For example, your "foreach" alternative contains \s* \( \s*. If you find the text "foreach var...", the "foreach" alternative will greedily match all of the whitespace after foreach, and then fail when it doesn't find an opening (. It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\s*)\( will cause the regex engine to not backtrack through \s* if it matches, allowing the regex to fail more quickly.
Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ;), and the atomic grouping (?>...) preventing the regex engine from backtracking one character to match the ending ;).
"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings).
For example: #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
Could be replaced with #"| (?<comment> #\* [^*]* (\*+[^#][^*]*)* \*+# )"
The new regex boils down to:
#\*: Find the beginning of a comment #*
[^*]*: Read all "normal characters" (anything that's not a * because that could signify the end of the comment)
(\*+[^#][^*]*)*: include any non-terminal * inside the comment
(\*+[^#]: If we find a *, ensure that any string of *s doesn't end in a #
[^*]*: Go back to reading all "normal characters"
)*: Loop back to the beginning if we find another *
\*+#: Finally, grab the end of the comment *# being careful to include any extra *
You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).
I am trying to create a RegEx expression that will successfully parse the following line:
"57" "testing123" 82 16 # 13 26 blah blah
What I want is to be able to do is identify the numbers in the line. Currently, what I'm using is this:
[0-9]+
which parses fine. However, where it gets tricky is if the number is in quotes, like "57" is or like "testing123" is, I do not want it to match.
In addition to that, anything after the hash sign (the '#"), I do not want to match anything at all after the hash sign.
So in this example, the matches I should be getting are "82" and "16". Nothing else should match.
Any help on this would be appreciated.
It should be easier for you to build 3 different regexes, and then create the logic that combines them:
Check, whether the string has #, and ignore everything after it.
Check, for all the matches of "\d+", and ignore all of them
Check everything that's left, whether it matches [0-9]+
.Net regular expression can rather easily parse this string. The following pattern should match everything until the comment:
\A # Start of the string
(?>
(?<Quoted> # A quoted string
"" # Open quotes
[^""\\]* # non quotes or backslashes
(?:\\.[^""\\]*)* # but allow escaped characters
"" # Close quotes
)
|
(?<Number> # A number
\d+ # some digits
)
|
\s+ # Whitespace separator
)*
If you also want to match the comment, add:
(?<Comment>
\# .*
)?
\z
You can get your numbers in a single Match, using all captures of the "Number" group:
Match parsed = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection numbers = parsed.Groups["Number"].Captures;
Missing from this pattern is mainly unquoted string tokens, such as 4 8 this 15that, which can add some complexity, depending on how we'd want it to work.
I want to match a pattern [0-9][0-9]KK[a-z][a-z] which is not preceded by either of these words
http://
example
I have a RegEx which takes care of the first criteria, but not the second criteria.
Without OR operator
var body = Regex.Replace(body, "(?<!http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%
\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([0-9][0-9]KK[a-z][a-z])
(?!</a>)","replaced");
wth OR Operator
var body = Regex.Replace(body, "(?example)|(?<!http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#
\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([0-9][0-9]KK[a-
z][a-z])(?!</a>)","replaced");
The second one with OR operator throws an exception. How can I fix this?
It should not match either of these:
example99KKas
http://stack.com/99KKas
Here is one way to do it. Start at the beginning of the string and check that each character is not the start of 'http://' or 'example'. Do this lazily, and one character at a time so that we can spot the magic word once we reach it. Also, capture everything up to the magic word so that we can put it back in the replacement string. Here it is in commented free-spacing mode so that it can be comprehended by mere mortals:
var body = Regex.Replace(body,
#"# Match special word not preceded by 'http://' or 'example'
^ # Anchor to beginning of string
(?i) # Set case-insensitive mode.
( # $1: Capture everything up to special word.
(?: # Non-capture group for applying * quantifier.
(?!http://) # Assert this char is not start of 'http://'
(?!example) # Assert this char is not start of 'example'
. # Safe to match this one acceptable char.
)*? # Lazily match zero or more preceding chars.
) # End $1: Everything up to special word.
(?-i) # Set back to case-sensitive mode.
([0-9][0-9]KK[a-z][a-z]) # $2: Match our special word.
(?!</a>) # Assert not end of Anchor tag contents.
",
"$1replaced",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Note that this is case sensitive for the magic word but not for http:// and example. Note also that this is untested (I don't know C# - just its regex engine). The "var" in "var body = ..." looks kinda suspicious to me. ??
I wasn't able to get the second example working, it gave an ArgumentException of "Unrecognized grouping construct".
But I replaced the url matching and moved the first alternative group a bit and came up with this:
var body = Regex.Replace(body, "(?<!http\\://[a-zA-Z0-9\\-\\.]+\\.[a-zA-Z]{2,3}(/\\S*)?|example)
([0-9][0-9]KK[a-z][a-z])(?!</a>)","replaced");
You could use something like this:
body = Regex.Replace(body, #"(?<!\S)(?!(?i:http://|example))\S*\d\dKK[a-z]{2}\b", "replaced");
I want to match the first number/word/string in quotation marks/list in the input with Regex. For example, it should match those:
"hello world" gdfigjfoj sogjds
-14.5 fdhdfdfi dfjgdlf
test14 hfghdf hjgfjd
(a (c b 7)) (3 4) "hi"
Any ideas to a regex or how can I start?
Thank you.
If you want to match balanced parenthesis, regex is not the right tool for the job. Some regex implementations do facilitate recursive pattern matching (PHP and Perl, that I know of), but AFAIK, C# cannot do that (EDIT: see Steve's comment below: .NET can do this as well, after all).
You can match up to a certain depth using regex, but that very quickly explodes in your face. For example, this:
\(([^()]|\([^()]*\))*\)
meaning
\( # match the character '('
( # start capture group 1
[^()] # match any character from the set {'0x00'..''', '*'..'ÿ'}
| # OR
\( # match the character '('
[^()]* # match any character from the set {'0x00'..''', '*'..'ÿ'} and repeat it zero or more times
\) # match the character ')'
)* # end capture group 1 and repeat it zero or more times
\) # match the character ')'
will match single nested parenthesis like (a (c b 7)) and (a (x) b (y) c (z) d), but will fail to match (a(b(c))).
Any ideas to a regex or how can I start?
You can start with any tutorial on basic regex, such as this.
[Edit] I missed that you wanted to count parentheses. That cannot be done in regex - nothing that involves counting (except for non-standard lookaheads) can.
For first three cases, you could to use:
^("[^"]*"|[+-]?\d*(?:\.\d+)?|\w+)
For last one, I'm not sure if it's possible with regex to match that last closing parenthesis.
EDIT: using that suggested balanced matching for last one:
^\([^()]*(((?<Open>\()[^()]*)+((?<Close-Open>\))[^()]*)+)*(?(Open)(?!))\)
I need to replace character (say) x with character (say) P in a string, but only if it is contained in a quoted substring.
An example makes it clearer:
axbx'cxdxe'fxgh'ixj'k -> axbx'cPdPe'fxgh'iPj'k
Let's assume, for the sake of simplicity, that quotes always come in pairs.
The obvious way is to just process the string one character at a time (a simple state machine approach);
however, I'm wondering if regular expressions can be used to do all the processing in one go.
My target language is C#, but I guess my question pertains to any language having builtin or library support for regular expressions.
I converted Greg Hewgill's python code to C# and it worked!
[Test]
public void ReplaceTextInQuotes()
{
Assert.AreEqual("axbx'cPdPe'fxgh'iPj'k",
Regex.Replace("axbx'cxdxe'fxgh'ixj'k",
#"x(?=[^']*'([^']|'[^']*')*$)", "P"));
}
That test passed.
I was able to do this with Python:
>>> import re
>>> re.sub(r"x(?=[^']*'([^']|'[^']*')*$)", "P", "axbx'cxdxe'fxgh'ixj'k")
"axbx'cPdPe'fxgh'iPj'k"
What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string.
This relies on your assumption that the quotes are always balanced. This is also not very efficient.
A more general (and simpler) solution which allows non-paired quotes.
Find quoted string
Replace 'x' by 'P' in the string
#!/usr/bin/env python
import re
text = "axbx'cxdxe'fxgh'ixj'k"
s = re.sub("'.*?'", lambda m: re.sub("x", "P", m.group(0)), text)
print s == "axbx'cPdPe'fxgh'iPj'k", s
# -> True axbx'cPdPe'fxgh'iPj'k
The trick is to use non-capturing group to match the part of the string following the match (character x) we are searching for.
Trying to match the string up to x will only find either the first or the last occurence, depending whether non-greedy quantifiers are used.
Here's Greg's idea transposed to Tcl, with comments.
set strIn {axbx'cxdxe'fxgh'ixj'k}
set regex {(?x) # enable expanded syntax
# - allows comments, ignores whitespace
x # the actual match
(?= # non-matching group
[^']*' # match to end of current quoted substring
##
## assuming quotes are in pairs,
## make sure we actually were
## inside a quoted substring
## by making sure the rest of the string
## is what we expect it to be
##
(
[^']* # match any non-quoted substring
| # ...or...
'[^']*' # any quoted substring, including the quotes
)* # any number of times
$ # until we run out of string :)
) # end of non-matching group
}
#the same regular expression without the comments
set regexCondensed {(?x)x(?=[^']*'([^']|'[^']*')*$)}
set replRegex {P}
set nMatches [regsub -all -- $regex $strIn $replRegex strOut]
puts "$nMatches replacements. "
if {$nMatches > 0} {
puts "Original: |$strIn|"
puts "Result: |$strOut|"
}
exit
This prints:
3 replacements.
Original: |axbx'cxdxe'fxgh'ixj'k|
Result: |axbx'cPdPe'fxgh'iPj'k|
#!/usr/bin/perl -w
use strict;
# Break up the string.
# The spliting uses quotes
# as the delimiter.
# Put every broken substring
# into the #fields array.
my #fields;
while (<>) {
#fields = split /'/, $_;
}
# For every substring indexed with an odd
# number, search for x and replace it
# with P.
my $count;
my $end = $#fields;
for ($count=0; $count < $end; $count++) {
if ($count % 2 == 1) {
$fields[$count] =~ s/a/P/g;
}
}
Wouldn't this chunk do the job?
Not with plain regexp. Regular expressions have no "memory" so they cannot distinguish between being "inside" or "outside" quotes.
You need something more powerful, for example using gema it would be straighforward:
'<repl>'=$0
repl:x=P
Similar discussion about balanced text replaces: Can regular expressions be used to match nested patterns?
Although you can try this in Vim, but it works well only if the string is on one line, and there's only one pair of 's.
:%s:\('[^']*\)x\([^']*'\):\1P\2:gci
If there's one more pair or even an unbalanced ', then it could fail. That's way I included the c a.k.a. confirm flag on the ex command.
The same can be done with sed, without the interaction - or with awk so you can add some interaction.
One possible solution is to break the lines on pairs of 's then you can do with vim solution.
Pattern: (?s)\G((?:^[^']*'|(?<=.))(?:'[^']*'|[^'x]+)*+)x
Replacement: \1P
\G — Anchor each match at the end of the previous one, or the start of the string.
(?:^[^']*'|(?<=.)) — If it is at the beginning of the string, match up to the first quote.
(?:'[^']*'|[^'x]+)*+ — Match any block of unquoted characters, or any (non-quote) characters up to an 'x'.
One sweep trough the source string, except for a single character look-behind.
Sorry to break your hopes, but you need a push-down automata to do that. There is more info here:
Pushdown Automaton
In short, Regular expressions, which are finite state machines can only read and has no memory while pushdown automaton has a stack and manipulating capabilities.
Edit: spelling...