Regular expression performance issue - c#

I've got a long string which contains about 100 parameters (string parameterName) matching the following pattern:
parameterName + "(Whitespace | CarriageReturn | Tabulation | Period | Underline | Digit | Letter | QuotationMark | Slash)* DATA Whitespace Hexadecimal"
I've tried to used this regular expression, but it works a way too long:
parameterName + "[\\s\\S]*DATA\\s0x[0-9A-F]{4,8}"
This messy one works a little better:
parameterName + "(\\s|\r|\n|\t|\\.|[_0-9A-z]|\"|/)*DATA\\s0x[0-9A-F]{4,8}"
I'd use ".*", however, it doesn't match "\n".
I've tried "(.|\n)", but it works even slower than "[\s\S]".
Is there any way to improve this regular expression?

You can use something like
(?>(?>[^D]+|D(?!ATA))*)DATA\\s0x[0-9A-F]{4,8}
(?> # atomic grouping (no backtracking)
(?> # atomic grouping (no backtracking)
[^D]+ # anything but a D
| # or
D(?!ATA) # a D not followed by ATA
)* # zero or more time
)
The idea
The idea is to get to the DATA without asking ourselves any question, and to not go any further and then backtrack to it.
If you use .*DATA on a string like DATA321, see what the regex engine does:
.* eats up all the string
There's no DATA to be found, so step by step the engine will backtrack and try these combinations: .* will eat only DATA32, then DATA3, then DATA... then nothing and that's when we find our match.
Same thing happens if you use .*?DATA on 123DATA: .*? will try to match nothing, then 1, then 12...
On each try we have to check there is no DATA after the place where .* stopped, and this is time consuming. With the [^D]+|D(?!ATA) we ensure we stop exactly when we need to - not before, not after.
Beware of backtracking
So why not use (?:[^D]|D(?!ATA)) instead of these weird atomic grouping?
This is all good and working fine when we have a match to be found. But what happens when we don't? Before declaring failure, the regex have to try ALL possible combinations. And when you have something like (.*)* at each character the regex engine can use both the inside * or the outside one.
Which means the number of combinations very rapidely becomes huge. We want to not try all of these: we know that we stopped at the right place, if we didn't find a match right away we never will. Hence the atomic grouping (apparently .NET doesn't support possessive quantifiers).
You can see what I mean over here: 80'000 steps to check a 15 character long string that will never match.
This is discussed more in depth (and better put than what I could ever do) in this great article by Friedl, regex guru, over here

Related

Stop When <br> is Encountered In C# RegEx [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

substring with regular expression [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Regex: Matching all words EXCEPT those inside of parenthesis (C#)

So given:
COLUMN_1, COLUMN_2, COLUMN_3, ((COLUMN_1) AS SOME TEXT) AS COLUMN_4, COLUMN_5
How would I go about getting my matches as:
COLUMN_1
COLUMN_2
COLUMN_3
COLUMN_4
COLUMN_5
I've tried:
(?<!(\(.*?\)))(\w+)(,\s*\w+)*?
But I feel like I'm way off base :( I'm using regexstorm.net for testing.
Appreciate any help :)
You need a regex that keeps track of opening and closing parentheses and makes sure that a word is only matched if a balanced set of parentheses (or no parentheses at all) follow:
Regex regexObj = new Regex(
#"\w+ # Match a word
(?= # only if it's possible to match the following:
(?> # Atomic group (used to avoid catastrophic backtracking):
[^()]+ # Match any characters except parens
| # or
\( (?<DEPTH>) # a (, increasing the depth counter
| # or
\) (?<-DEPTH>) # a ), decreasing the depth counter
)* # any number of times.
(?(DEPTH)(?!)) # Then make sure the depth counter is zero again
$ # at the end of the string.
) # (End of lookahead assertion)",
RegexOptions.IgnorePatternWhitespace);
I tried to provide a test link to regexstorm.net, but it was too long for StackOverflow. Apparently, SO also doesn't like URL shorteners, so I can't link this directly, but you should be able to recreate the link easily: http://bit[dot]ly/2cNZS0O
This should work:
(?<!\()COLUMN_[\d](?!\))
Try it: https://regex101.com/r/bC4D7n/1
Update:
Ok, then try to use this regular expression:
[\(]+[\w\s\W]+[\)]+
Demo here: https://regex101.com/r/bC4D7n/2
Matching all words except some set of them is one of the most difficult exercises you can do with regular expressions. The easy way is: just construct the finite automata that accepts your original non negated predicate about the strings it should accept, then change all the accepting states by non-accepting ones, and finally construct a regular expression that is equivalent to the automata just constructed. This is a task difficult to do, so the most easy way to deal with it is construct the regexp for the predicate you want to negate and pass your string through the regexp matcher, if it maches, just reject it.
The main problem with this is that that is easy to do with computers, but constructing a regular expression from an automata description is tedious and normally gives you not the result you want (and actually a huge result). Let me illustrate with an example:
You have asked for matching words, but from these words, you want the ones that don't appear in a set of them. Let's suppose we want the automata that matches preciselly that set of words, and suppose we have matched the first n-1 letters of that word. This string should be matched, but only if you don't get the final letter next. So the proper regexp should be a regexp that matches all the letters of the first word but the last.... Not, we can skip this test if we have a word that matches all the letters in the first word but the last two, and so successively, back to the first letter (obviously, if your regexp doesn't begin with the first letter of the word, it doesn't match anyway) Let's suppose the first word is BEGIN. A good regexp matching things that are not equal to BEGIN is something like this:
[^B]|B[^E]|BE[^G]|BEG[^I]|BEGI[^N]
a different scenario (that complicates things more) is to find a regexp that matches the string if the word BEGIN is not contained in the string. Let's part from the opposite predicate, to find a string that has the word BEGIN included
^.*BEGIN.*$
and let's construct its finite automata:
(0)---B--->(1)---E--->(2)---G--->(3)---I--->(4)---N--->((5))
^ \ | | | | ^ \
| | | | | | | |
`-+<-------+<---------+<---------+<---------' `-+
where the double parenthesis indicates an accepting state. If you just
change all the accepting states with non-accepting ones, you'll get an automata that accepts all the strings the first one didn't and viceversa.
((0))--B-->((1))--E-->((2))--G-->((3))--I-->((4))--N-->(5)
^ \ | | | | ^ \
| | | | | | | |
`-+<--------+<---------+<---------+<---------' `-+
But converting this into a simple regular expression is far from easy (you can try, if you don't believe me)
And this only with one word, so think how to match any of the words, construct the automata, and then switch the acceptance-nonacceptance status of each state.
In your case, we have something to deal with, in addition to the premise your predicate is not equivalent to the one I have formulated. My predicate is for matching expressions that have one word in it (which is the target for which regexp were conceived) but yours if for matching groups inside your regexp. If you try my example, you will find that a simple string as "" (the empty string) matches the second regexp, as the starting ((0)) state is accepting state (well, the empty string doesn't contain the word BEGIN), but you want your regexp to match words (and "" isn't a word) so we first need to define what is a word for you and construct the regular expression that matches a word:
[a-zA-Z][a-zA-Z]*
should be a good candidate. It should go in an automata definition like this:
(0)---[a-zA-Z]--->((1))---[a-zA-Z]--.
^ \ | ^ |
| * * | |
`--+<-------------' `-------------'
and you want an automata to accept both (1-must be a word, and 2-not in the set of words) (not being in the set of words is the same as not being the first word, and not being the second and not being the third... you can construct it by first constructing an automata that matches if it's the first word, or the second, or the third, ... and then negating it) construct the first automaton, the second and then construct an automaton that matches both. This, again is easy to be done with automatons for computers, but not for people.
As I said, construct an automaton from a regexp is an easy and direct thing for a computer, but not for a person. Construct a regexp from an automaton is also, but it results in huge regular expressions and because of this problem, most implementations have result in implementation of extender operators that match if some regexp doesn't and the opposite.
CONCLUSION
Use the negation operators that allow you to get to the opposite predicate about the set of strings your regexp acceptor must accept, or just simply construct a regexp to do simple things and use the boolean algebra to do the rest.
Since you have nested parentheses things get trickier. Although .NET RegEx engine provides balancing group constructs which uses stack memory, I go with a more general approach called recursive match.
Regex:
\((?(?!\(|\)).|(?R))*\)|(\w+)
Live demo
All you need is in first capturing group.
Explanation of left side of alternation:
\( # Match an opening bracket
(?(?!\(|\)) # If next character is not `(` or `)`
. # Then match it
| # Otherwise
(?R) # Recurs whole pattern
)* # As much as possible
\) # Up to corresponding closing bracket

Regex extremely slow on large documents

When running the following code the CPU load goes way up and it takes a long time on larger documents:
string pattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
pattern,
RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());
foreach (Match match in matches)
{
...
}
Any idea how to speed it up?
Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase to RegexOptions.Compiled yields the same results (since your pattern does not include any literal letters or ^/$).
On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).
EDIT: So I looked into this some more and have discovered the real issue.
The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*#. This works fine when matching sections of the input that actually contain the # symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)* but are not followed by #, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no #!)
You can fix this by forcing the match to start at a word boundary using \b:
string pattern = #"\b\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
On your sample document, this produces the same 10 results in under 1 second.
Try to use regex for streams, use mono-project regex and this article can be useful for .Net
Building a Regular Expression Stream with the .NET Framework
and try to improve your regex performance.
To answer how to change it, you need to tell us, what it should match.
The problem is probably in the last part #\w+([-.]\w+)*\.\w+([-.]\w+)*. On a string "bla#a.b.c.d.e-f.g.h" it will have to try many possibilities, till it finds a match.
Could be a little bit of Catastrophic Backtracking.
So, you need to define you pattern in a better, more "unique" way. Do you really need "Dash/dot - dot - dash/dot"?

Extending regular expression syntax to say 'does not contain text XYZ'

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?
My app is written in C# .NET 3.5.
My plan (before I got the awesome answers to this question...)
Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.
So I might use some regexes like this (contrived) example:
on (this|that|these) day(s)?¬(every|all) day(s) ?
Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.
In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:
public bool IsMatchExtended(string textToTest, string extendedRegex)
{
int notPosition = extendedRegex.IndexOf('¬');
// Just a normal regex:
if (notPosition==-1)
return Regex.IsMatch(textToTest, extendedRegex);
// Use a positive (normal) regex and a negative one
string positiveRegex = extendedRegex.Substring(0, notPosition);
string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);
return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
}
Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?
Alternative plan
In writing this question I also came across this answer which suggests using something like this:
^(?=(?:(?!negative pattern).)*$).*?positive pattern
So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.
Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?
Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.
Why!?
I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.
I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain
pattern'.
You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.
You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.
Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.
Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=
Some examples
Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>
Given the following regex's, these are the results you will see:
tr - match
td - match
^td - no match
^tr - no match
^<tr - match
^<tr>.*</tr> - no match
^<tr.*>.*</tr> - match
^<tr.*>.*</tr>(?<tr>) - match
^<tr.*>.*</tr>(?<!tr>) - no match
^<tr.*>.*</tr>(?<!Albatross) - match
^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
^(?!.*Albatross.*)<tr.*>.*</tr> - no match
Explanations
The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.
The fifth example matches because the test string starts with <tr.
The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.
The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.
The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.
The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.
The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.
The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.
example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.
For more on this stuff, read up at http://www.regular-expressions.info/
Or get a regex evaluator tool and try out some tests.
like this tool:
source and binary
You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":
if (Regex.IsMatch(text, #"
# Match string containing all of one set of words but none of another.
^ # anchor to start of string.
# Positive look ahead assertions for required substrings.
(?=.*? cat ) # Assert string has: 'cat'.
(?=.*? lion ) # Assert string has: 'lion'.
(?=.*? tiger ) # Assert string has: 'tiger'.
# Negative look ahead assertions for not-allowed substrings.
(?!.*? dog ) # Assert string does not have: 'dog'.
(?!.*? wolf ) # Assert string does not have: 'wolf'.
(?!.*? hyena ) # Assert string does not have: 'hyena'.
",
RegexOptions.Singleline | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.
You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.
Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:
(?!.*?(?:dog|wolf|hyena))

Categories