Optimize a variable regex

Optimize a variable regex - c#

I'm matching the following strings:
watermark=testing
watermark=text-testing|position-24-50
watermark=text-testing|position-24-50|color-6aa6cc
watermark=text-testing|position-24-50|color-6aa6cc|size-48
using the following regex:
watermark=(text-\w+\|position-\d+-\d+\|color-([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})\|size-\d+|text-\w+\|position-\d+-\d+\|color-([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})|text-\w+\|position-\d+-\d+|\w+)
It works but it's so ugly it makes me want poke my eyes out with a hot stick. Would any of you regex gurus be willing to refactor it with a brief explanation on your methods?

watermark=(text-\w+\|position-\d+-\d+(\|color-([0-9a-fA-F]{3}){1,2}(\|size-\d+)?)?|\w+)
Since I observed (from example + original regex) that "size" implies all the fields in front are available, "color" implies all the fields in front are available, I just created nested optional:
(\|color-([0-9a-fA-F]{3}){1,2}
(\|size-\d+)?
)?
For ([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3}), I "simplified" to ([0-9a-fA-F]{3}){1,2}.

\bwatermark=(?:text-|)\w+(?:\|position-\d+-\d+(?:\|color-[0-9a-fA-F]+(?:\|size-\d+|)|)|)\b

You can run your regex through factoring software http://regexformat.com
Before:
watermark=(text-\w+\|position-\d+-\d+\|color-([A-Fa-f0-9]{3}(?:[A-Fa-f0-9]{3})?)\|size-\d+|text-\w+\|position-\d+-\d+\|color-([A-Fa-f0-9]{3}(?:[A-Fa-f0-9]{3})?)|text-\w+\|position-\d+-\d+|\w+)
After:
watermark=(?:text-\w+\|position-\d+-\d+(?:\|color-[A-Fa-f0-9]{3}(?:[A-Fa-f0-9]{3})?(?:\|size-\d+)?)?|\w+)
watermark=
(?:
text- \w+ \| position- \d+ - \d+
(?:
\| color- [A-Fa-f0-9]{3}
(?: [A-Fa-f0-9]{3} )?
(?: \| size- \d+ )?
)?
| \w+
)

Related

Regex - how to match a block comment

when it comes to regex I'm always lost. I have an editor created in C# to work with papyrus scripting, the problem I'm having is that users ask me for styling block comment ";/ /;" already working for single line that use ";"
Here is the code I have so far
var inputData = #"comment test and this line not suppose to show
;/
comment line 1
comment line 2
comment line 3
/;
Not suppose to show";
PapyrusCommentRegex1 = new Regex(#"(;/\*.*?\/;)|(.*\/;)", RegexOptions.Singleline);
foreach (Match match in PapyrusCommentRegex1.Matches(inputData))
{
if (match.Success)
{
textBox1.AppendText(match.Value + Environment.NewLine);
}
}
The result I get is
comment test and this line not suppose to show
;/
comment line 1
comment line 2
comment line 3
/;
All the line before the ";/" shows.
My question is what am I doing wrong in my regex expression?
Thanks in advance to all
Edit:
To make it more clearer I need a regex pattern in C# for finding all block comment that start with ";/" and finish with "/;" and need to include the ";/ /;"

Since you said you need to do this with regex in a .NET library I guess you may want a regex that is using balancing groups to match the block comment
(?x) # ignore spaces and comments
(
;/ # open block comment
(?:
(?<open> ;/ )* # open++
.+
(?<-open> /; )* # open--
)+
/; # close
(?(open)(?!)) # fail if unblanaced: open > 0
)
This should give you what you want. Regex Demo
Some mentioned the problem of block comments in strings (and vice vesa?!). This makes things a lot harder, especially since we do not have the (*SKIP)((*FAIL) backtracking verbs and \K in .NET's regex engine available. I would try to match and capture what you need but only match what you do not need:
This matches your block comments and "..." strings. The trick is to only look at the blockcomment capture group:
(?x) # ignore spaces and comments
(
;/ # open block comment
(?:
(?<open> ;/ )* # open++
.+
(?<-open> /; )* # open--
)+
/; # close
(?(open)(?!)) # fail if unblanaced: open > 0
)
|
(?:(?<openq>")
[^"]*?
(?<-openq>")*
)+(?(openq)(?!))
Demo Code
I hope you can apply this in your code.

Try this:
(;/.*?/;)|(;.*?(?=$|[\r\n]))
Note that I'm still using the SingleLine mode.
The part before the | matches multiline comments, the part after the | matches single line comments (comments that end when they encounter the end of the text $ or a new line \r or \n`. Note that the regex won't capture the end-of-the-line at the end of the single-line comments, so
;xyz\n
the \n won't be captured. To capture it:
(;/.*?/;)|(;.*?(?:$|\r\n?|\n))

Regex to capture parenthesis with hash tag?

So far I have this perfectly working regex:
(?:(?<=\s)|^)#(\w*[A-Za-z_]+\w*)
It finds any word that starts with a hash tag (ex. #lolz but not hsshs#jdjd)
The problem is I also want it to match parenthesis. So if I have this it will match:
(#lolz wow)
or
(wow #cool)
or
(#cool)
Any idea on how can I make or use my regex to work like that?

The following seemed to work for me ...
\(?#(\w*[A-Za-z_]+\w*)\)?

The way you are using the following in context is overkill..
\w*[A-Za-z_]\w*
\w alone matches word characters ( a-z, A-Z, 0-9, _ ). And it is not necessary for the use of the non-capturing group (?: to be wrapped around your lookbehind assertion here.
I do believe that the following would suffice by itself.
(?<=^|\s)\(?#(\w+)\)?
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\(? '(' (optional (matching the most amount possible))
# '#'
( group and capture to \1:
\w+ word characters (a-z, A-Z, 0-9, _) (1 or more times)
) end of \1
\)? ')' (optional (matching the most amount possible))
See live demo
You can also use a negative lookbehind here if you wanted to.
(?<![^\s])\(?#(\w+)\)?

How to set repeat regular expression?

I have regular expression ^\d{5}$|^\d{5}-\d{4}*$" it checked US zip.
But I need check "zip, zip, zip" how to do this?
I tried this ^(\d{5}$|^\d{5}-\d{4},)*$ but it not work

Try
((^|, )(\d{5}|\d{5}-\d{4}))*$
Tester: http://regexr.com?36297
Each match must be preceded by (^|, ), so by the beginning of the string or a , (comma space)
Note that you shouldn't use the \d in .NET, because ٠١٢٣٤ are \d! (in .NET \d includes non-ASCII Unicode digits). [0-9] is normally better.

The expression you appear to need is:
^\d{5}(|-\d{4})(,\d{5}(|-\d{4}))*$
The one you were attempting to write was:
^(\d{5}|\d{5}-\d{4},)*$
but that would require every ZIP to have a trailing comma, which the very last one would not have had.
Breaking down the answer given,
\d{5}(|-\d{4}) is a variant of your original, but simply making the -1234 optional.
(,\d{5}(|-\d{4}))* is the first regular expression preceded by a comma, and allowed zero or more times.

I would use this for speed:
^\d{5}(?:-\d{4})?(?:,\s*\d{5}(?:-\d{4})?)*$
expanded
^
\d{5}
(?: - \d{4} )?
(?:
, \s* \d{5}
(?: - \d{4} )?
)*
$
and this for speed/flexibility:
^\s*\d{5}(?:\s*-\s*\d{4})?(?:\s*,\s*\d{5}(?:\s*-\s*\d{4})?)*\s*$
expanded
^
\s*
\d{5}
(?: \s* - \s* \d{4} )?
(?:
\s* , \s* \d{5}
(?: \s* - \s* \d{4} )?
)*
\s*
$

How can I improve the performance of a .NET regular expression?

I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.
Here's the pattern:
new Regex(
#" (?<escape> \#\# )"
+ #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
+ #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
// captures expressions of the form "foreach (var [var] in [expression]) { <text>"
/* ---> */ + #"| (?<foreach> \#foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures expressions of the form "if ([expression]) { <text>"
/* ---> */ + #"| (?<if> \#if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures the close of a razor text block
+ #"| (?<endBlock> </text> \s* \} )"
// an expression of the form #([(int)] a.b.c)
+ #"| (?<parenAtExpression> \#\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
+ #"| (?<atExpression> \# (?<expressionValue> [\w\.]+ ) )"
/* ---> */ + #"| (?<literal> ([^\#<]+|[^\#]) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/* ---> */ indicates the new "rules" that caused the slowdown.

As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. This will always be time-consuming, but how can it be made less so?
Some thoughts:
I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly.
I can see what you're trying to do with the ( ([^\*]\#) | (\*[^\#]) | . )* - allow # and * within the comments as long as they are not preceded by * or followed by # respectively. But because of the group's * quantifier and the third option ., the sub-pattern will happily match *#, therefore rendering the other options redundant.
And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line
+ #"| (?<comment> #\*.*?\*# )"
i.e. lazily match any characters (but newlines) until the first *# is encountered.
You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem.
I also do not like the ([^\#<]+|[^\#]) sub-pattern in the last line, which equates to ([^\#<]+|<). The [^\#<]+ will greedily match to the end of the string unless it comes across a # or <.
I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \s* seem suspect because of their greed and flexibility, including matching nothing and newlines. Perhaps you could change some of the \s* to [ \t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if.
I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down.
What are you trying to achieve with the RegexOptions.Multiline option? You do not look to be using ^ or $ so it will have no effect.
The escaping of the # is unnecessary.

As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping # or escaping characters aside from \ inside a character class; for example, using [^*] instead of [^\*]).
Here are some ideas for improving performance:
Order your different alternatives so that the most likely ones come first.
The regex engine will attempt to match each alternative in the order that they appear in the regex. If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases.
Remove unnecessary backtracking
Not the ending of your "using" alternative: #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
If for some reason you have a large amount of whitespace, but no closing ; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\s*;). In your case, (\s*;)? can be replaced with \s*;? to prevent backtracking in these scenarios.
In addition, you could use atomic groups (?>...) to prevent backtracking through quantifiers (e.g. * and +). This really helps improve performance when you don't find a match. For example, your "foreach" alternative contains \s* \( \s*. If you find the text "foreach var...", the "foreach" alternative will greedily match all of the whitespace after foreach, and then fail when it doesn't find an opening (. It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\s*)\( will cause the regex engine to not backtrack through \s* if it matches, allowing the regex to fail more quickly.
Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ;), and the atomic grouping (?>...) preventing the regex engine from backtracking one character to match the ending ;).
"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings).
For example: #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
Could be replaced with #"| (?<comment> #\* [^*]* (\*+[^#][^*]*)* \*+# )"
The new regex boils down to:
#\*: Find the beginning of a comment #*
[^*]*: Read all "normal characters" (anything that's not a * because that could signify the end of the comment)
(\*+[^#][^*]*)*: include any non-terminal * inside the comment
(\*+[^#]: If we find a *, ensure that any string of *s doesn't end in a #
[^*]*: Go back to reading all "normal characters"
)*: Loop back to the beginning if we find another *
\*+#: Finally, grab the end of the comment *# being careful to include any extra *
You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).

C# Regex - How to remove multiple paired parentheses from string

I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.

Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)

You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.

How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"

Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize a variable regex - c#

\bwatermark=(?:text-|)\w+(?:\|position-\d+-\d+(?:\|color-[0-9a-fA-F]+(?:\|size-\d+|)|)|)\b

Related

Regex - how to match a block comment

Regex to capture parenthesis with hash tag?

How to set repeat regular expression?

How can I improve the performance of a .NET regular expression?

C# Regex - How to remove multiple paired parentheses from string

Categories

Resources