I have kind of a weird problem that I am attempting to resolve with some elegant regular expressions.
The system I am working on was initially designed to accept an incoming string and through a pattern matching method, alter the string which it then returns. A very simplistic example is:
Incoming string:
The dog & I went to the park and had a great time...
Outgoing string:
The dog {&} I went to the park and had a great time {...}
The punctuation mapper wraps key characters or phrases and wraps them in curly braces. The original implementation was a one way street and was never meant for how it is currently being applied and as a result, if it is called incorrectly, it is very easy for the system to "double" wrap a string as it is just doing a simple string replace.
I spun up Regex Hero this morning and started working on some pattern matches and having not written a regular expression in nearly a year, quickly hit a wall.
My first idea was to match a character (i.e. &) but only if it wasn't wrapped in braces and came up with [^\{]&[^\}], which is great but of course catches any instance of the ampersand so long as it is not preceded by a curly brace, including white spaces and would not work in a situation where there were two ampersands back to back (i.e. && would need to be {&}{&} in the outgoing string. To make matters more complicated, it is not always a single character as ellipsis (...) is also one of the mapped values.
Every solution I noodle over either hits a barrier because there is an unknown number of occurrences of a particular value in the string or that the capture groups will either be too greedy or finally, cannot compensate for multiple values back to back (i.e. a single period . vs ellipsis ...) which the original dev handled by processing ellipsis first which covered the period in the string replace implementation.
Are there any regex gurus out there that have any ideas on how I can detect the undecorated (unwrapped) values in a string and then perform their replacements in an ungreedy fashion that can also handle multiple repeated characters?
My datasource that I am working against is a simple key value pair that contains the value to be searched for and the value to replace it with.
Updated with example strings:
Undecorated:
Show Details...
Default Server:
"Smart" 2-Way
Show Lender's Information
Black & White
Decorated:
Show Details{...}
Default Server{:}
{"}Smart{"} 2-Way
Show Lender{'}s Information
Black {&} White
Updated With More Concrete Examples and Datasource
Datasource (SQL table, can grow at any time):
TaggedValue UntaggedValue
{:} :
{&} &
{<} <
{$} $
{'} '
{} \
{>} >
{"} "
{%} %
{...} ...
{...} …
{:} :
{"} “
{"} ”
{'} `
{'} ’
Broken String: This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&
String that needs decoration: Show Details... Default Server: "Smart" 2-Way Show Lender's Information Black & White
String that would pass through the method untouched (because it was already decorated): The dog {&} I went to the park and had a great time {...}
The other "gotcha" in moving to regex is the need to handle escaping, especially of backslashes elegantly due to their function in regular expressions.
Updated with output from #Ethan Brown
#Ethan Brown,
I am starting think that regex, while elegant might not be the way to go here. The updated code you provided, while closer still does not yield correct results and the number of variables involved may exceed the regex logics capability.
Using my example above:
'This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&'
yields
This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it {&} follow by {!} and {...} {_} {&} . {&&}&
Where the last group of ampersands which should come out as {&}{&}{&} actually comes out as {&&}&.
There is so much variability here (i.e. need to handle ellipsis and wide ellipsis from far east languages) and the need to utilize a database as the datasource is paramount.
I think I am just going to write a custom evaluator which I can easily enough write to perform this type of validation and shelve the regex route for now. I will grant you credit for your answer and work as soon as I get in front of a desktop browser.
This kind of problem can be really tough, but let me give you some ideas that might help out. One thing that's really going to give you headaches is handling the case where the punctuation appears at the beginning or end of the string. Certainly that's possible to handle in a regex with a construct like (^|[^{])&($|[^}]), but in addition to that being painfully hard to read, it also has efficiency issues. However, there's a simple way to "cheat" and get around this problem: just pad your input string with a space on either end:
var input = " " + originalInput + " ";
When you're done you can just trim. Of course if you care about preserving input at the beginning or end, you'll have to be more clever, but I'm going to assume for argument's sake that you don't.
So now on to the meat of the problem. Certainly, we can come up with some elaborate regular expressions to do what we're looking for, but often the answer is much much simpler if you use more than one regular expression.
Since you've updated your answer with more characters, and more problem inputs, I've updated this answer to be more flexible: hopefully it will meet your needs better as more characters get added.
Looking over your input space, and the expressions you need quoted, there are really three cases:
Single-character replacements (! becomes {!}, for example).
Multi-character replacements (... becomes {...}).
Slash replacement (\ becomes {})
Since the period is included in the single-character replacements, order matters: if you replace all the periods first, then you will miss ellipses.
Because I find the C# regex library a little clunky, I use the following extension method to make this more "fluent":
public static class StringExtensions {
public static string RegexReplace( this string s, string regex, string replacement ) {
return Regex.Replace( s, regex, replacement );
}
}
Now I can cover all of the cases:
// putting this into a const will make it easier to add new
// characters in the future
const string normalQuotedChars = #"\!_\\:&<\$'>""%:`";
var output = s
.RegexReplace( "(?<=[^{])\\.\\.\\.(?=[^}])", "{$&}" )
.RegexReplace( "(?<=[^{])[" + normalQuotedChars + "](?=[^}])", "{$&}" )
.RegexReplace( "\\\\", "{}" );
So let's break this solution down:
First we handle the ellipses (which will keep us from getting in trouble with periods later). Note that we use a zero-width assertions at the beginning and end of the expression to exclude expressions that are already quoted. The zero-width assertions are necessary, because without them, we'd get into trouble with quoted characters right next to each other. For example, if you have the regex ([^{])!([^}]), and your input string is foo !! bar, the match would include the space before the first exclamation point and the second exclamation point. A naive replacement of $1!$2 would therefore yield foo {!}! bar because the second exclamation point would have been consumed as part of the match. You'd have to end up doing an exhaustive match, and it's much easier to just use zero-width assertions, which are not consumed.
Then we handle all of the normal quoted characters. Note that we use zero-width assertions here for the same reasons as above.
Finally, we can find lone slashes (note we have to escape it twice: once for C# strings and again for regex metacharacters) and replace that with empty curly brackets.
I ran all of your test cases (and a few of my own invention) through this series of matches, and it all worked as expected.
I'm no regex god, so one simple way:
Get / construct the final replacement string(s) - ex. "{...}", "{&}"
Replace all occurrences of these in the input with a reserved char (unicode to the rescue)
Run your matching regex(es) and put "{" or whatever desired marker(s).
Replace reserved char(s) with the original string.
Ignoring the case where your original input string has a { or } character, a common way to avoid re-applying a regex to an already-escaped string is to look for the escape sequence and remove it from the string before applying your regex to the remainders. Here's an example regex to find things that are already escaped:
Regex escapedPattern = new Regex(#"\{[^{}]*\}"); // consider adding RegexOptions.Compiled
The basic idea of this negative-character class pattern comes from regular-expressions.info, a very helpful site for all thing regex. The pattern works because for any inner-most pair of braces, there must be a { followed by non {}'s followed by a }
Run the escapedPattern on the input string, find for each Match get the start and end indices in the original string and substring them out, then with the final cleaned string run your original pattern match again or use something like the following:
Regex punctPattern = new Regex(#"[^\w\d\s]+"); // this assumes all non-word,
// digit or space chars are punctuation, which may not be a correct
//assumption
And replace Match.Groups[1].Value for each match (groups are a 0 based array where 0 is the whole match, 1 is the first set of parentheses, 2 is the next etc.) with "{" + Match.Groups[1].Value + "}"
Related
I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub
I have an awful time with regular expressions, so I usually resort to lousy kludges and workarounds when parsing strings. I need to get better at using regex. This one seems simple to me, but I don't even know where to start.
Here's the string output from my device:
testString = IP:192.168.5.210\rPlaylist:1\rEnable:On\rMode:HDMI\rLineIn:unbal\r
Example:
I want to find if the device is off or on. I need to search for the string "Enable:" then locate the carriage return and determine if the word between Enable: and \r is off or on. It seems like that's what regex is for or do I totally misunderstand it.
Can someone point me in the right direction?
Additional information - Maybe I need to expand on the question.
Based on the answers, finding whether or not the device is Enabled appears to be fairly simple. Since I get a return string is similar to a key/value pair what's more vexing determining the substring between the : and the carriage return. A number of these pairs have a response with lengths that vary significantly, such as DeviceLocation, DeviceName, IPAddress. In fact, the device responds to every command sent to it by returning the entire status list, 48 key/value pairs, which I then must parse even if I only need to know one property.
Also based on your answers .... regular expressions is not the way to go.
Thanks for any help.
Norm
I would suggest for a simple line as shown, ask for one or the other, but verify as well. Based partially off Ken White's suggestions.
if(input.Contains(":On")){
//DoWork()
}else{
if(input.Contains(":Off"))
//DoOtherWork
}
This presumes that ":On" and ":Off" will not appear anywhere else in the string, even with a different string.
Consider the following code:
// This regular expression matches text 'Enabled: ' followed by one or more non '\r' followed by '\r'
// RegexOptions.Multiline is optional but MAY be necessary on other platforms.
// Also, '\r' is not a line break. '\n' is.
Regex regex = new Regex("Enable: ([^\r]+)\r", RegexOptions.Multiline);
string input = "IP:192.168.5.210\rPlaylist: 1\rEnable: On\rMode: HDMI\rLineIn: unbal\r";
var matches = regex.Match(input);
Debug.Assert(matches != Match.Empty);
// The match variable will contain 2 Groups:
// First will be 'Enabled: On\r'
// The other is 'On' since we enclosed ([^\r]+) in ().
Console.WriteLine(matches.Groups[1]);
I have a block of text as such.
google.sbox.p50 && google.sbox.p50(["how to",[["how to tie a tie",0],["how to train your dragon 2 trailer",0],["how to do the cup song",0],["how to get a six pack in 3 minutes",0],["how to make a paper gun that shoots",0],["how to basic",0],["how to love lil wayne",0],["how to sing like your favorite artist",0],["how to be a heartbreaker marina and the diamonds",0],["how to tame a horse in minecraft",0]],{"q":"XJW--0IKH6sqOp0ME-x5B7b_5wY","j":"5","k":1}])
Using \\[([^]]+)\\] I am able to get everything I need, but with a little extra that I don't. I do not need the ["how to",[[. I only need the blocks that are formatted like,
["how to tie a tie",0]
Can someone please help me modify my expression to only get what I need? I've been at it for hours and I can't grasp the idea of RegEx.
Put both the opening and closing square brackets in the negated character class?
\\[([^][]+)\\]
\\[ matches a literal [
\\] matches a literal ]
[^][] is a negated class, which for instance matches any character except ][. It might be a little difficult to see it, but it's equivalent to [^\\]\\[]. Here the double escapes are not required because you are using a character class (just like \\. is equivalent to [.])
([^][]+) captures everything within square brackets, making sure there's no ] or [ inside.
In C#, you can use the # symbol to avoid having to double escape everytime and using this makes the regex like that:
var regex = new Regex(#"\[([^][]+)\]");
Note: This regex will capture everything within square brackets. If you wish to specificly get the format ["how to tie a tie",0], you can be more precise. After all, the regex will only match stuff you make it match:
var regex = new Regex(#"\["[^"]+",0\]");
Here, we have another negated character class: [^"]. This will match any character which is not a quote character.
This one assumes that the digit is always 0, as depicted in your sample text block. If you have multiple possibilities of numbers, you can use the character class [0-9]+:
var regex = new Regex(#"\["[^"]+",[0-9]+\]");
You can use \d+ as well, but this character class also matches other characters which may or may not render the regex worse. If you want to be more even cautious by allowing possible spaces, tabs, newlines, form feeds in between the characters, you can use this regex:
var regex = new Regex(#"\[\s*"[^"]+"\s*,\s*[0-9]+\s*\]");
Conclusion, there might be many regexes which suit what you need, just make sure you know how your data is coming through so you can pick one which has the right amount of freeway.
I think this is what you are looking for to match the format of ["how to tie a tie",0]:
(\["[^"]+",\d\])
( ) - around the whole thing so it all gets captured in this group
\[" - find ["
[^"]+ - find one or more of anything except "
", - find ",
\d - find a number, if you want more than just a single digit, do \d+
\] - match the ending ]
The only variable things in this regex are whatever is within the quotes ([^"]+) and the number (\d+).
Demo
If you don't want the square brackets in the capture group, you can do it like this:
\[("[^"]+",\d+)\]
I assume you don't want to match if there are quotes within your quotes as it would probably break whatever purpose you are using it for, but if you do, this should work:
\[("[^[\]]+",\d+)\]
You must use this pattern
#"\[[^][]+\]"
More informations about square brackets here.
I think you need this one: (\[[^\[^]+?])
What you did mis is the ? (smallest match) and exclude any [ or ]
Seemingly the text in the outer brackets is a JSON representation of an object. Instead of a regular expression I'd just:
strip off the stuff before the bracket + first bracket (google.sbox.p50 && google.sbox.p50() plus strip off the trailing bracket ). There are more ways to do this, and it can be more efficient than regex.
JSON parse the remaining inner part.
From that point you have the object representation, you can leave out the first element of the array what you don't need, plus you have everything else in a traversable form.
There's the session information at the end along with parameters anyway (in {} brackets), so in the end you may end up parsing stuff anyway. Better not to reinvent the wheel (JSON parsing).
I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?
My app is written in C# .NET 3.5.
My plan (before I got the awesome answers to this question...)
Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.
So I might use some regexes like this (contrived) example:
on (this|that|these) day(s)?¬(every|all) day(s) ?
Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.
In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:
public bool IsMatchExtended(string textToTest, string extendedRegex)
{
int notPosition = extendedRegex.IndexOf('¬');
// Just a normal regex:
if (notPosition==-1)
return Regex.IsMatch(textToTest, extendedRegex);
// Use a positive (normal) regex and a negative one
string positiveRegex = extendedRegex.Substring(0, notPosition);
string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);
return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
}
Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?
Alternative plan
In writing this question I also came across this answer which suggests using something like this:
^(?=(?:(?!negative pattern).)*$).*?positive pattern
So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.
Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?
Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.
Why!?
I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.
I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain
pattern'.
You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.
You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.
Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.
Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=
Some examples
Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>
Given the following regex's, these are the results you will see:
tr - match
td - match
^td - no match
^tr - no match
^<tr - match
^<tr>.*</tr> - no match
^<tr.*>.*</tr> - match
^<tr.*>.*</tr>(?<tr>) - match
^<tr.*>.*</tr>(?<!tr>) - no match
^<tr.*>.*</tr>(?<!Albatross) - match
^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
^(?!.*Albatross.*)<tr.*>.*</tr> - no match
Explanations
The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.
The fifth example matches because the test string starts with <tr.
The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.
The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.
The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.
The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.
The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.
The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.
example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.
For more on this stuff, read up at http://www.regular-expressions.info/
Or get a regex evaluator tool and try out some tests.
like this tool:
source and binary
You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":
if (Regex.IsMatch(text, #"
# Match string containing all of one set of words but none of another.
^ # anchor to start of string.
# Positive look ahead assertions for required substrings.
(?=.*? cat ) # Assert string has: 'cat'.
(?=.*? lion ) # Assert string has: 'lion'.
(?=.*? tiger ) # Assert string has: 'tiger'.
# Negative look ahead assertions for not-allowed substrings.
(?!.*? dog ) # Assert string does not have: 'dog'.
(?!.*? wolf ) # Assert string does not have: 'wolf'.
(?!.*? hyena ) # Assert string does not have: 'hyena'.
",
RegexOptions.Singleline | RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace)) {
// Successful match
} else {
// Match attempt failed
}
You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.
You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.
Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:
(?!.*?(?:dog|wolf|hyena))
I really know very little about regex's.
I'm trying to test a password validation.
Here's the regex that describes it (I didn't write it, and don't know what it means):
private static string passwordField = "[^A-Za-z0-9_.\\-!##$%^&*()=+;:'\"|~`<>?\\/{}]";
I've tried a password like "dfgbrk*", and my code, using the above regex, allowed it.
Is this consistent with what the regex defines as acceptable, or is it a problem with my code?
Can you give me an example of a string that validation using the above regex isn't suppose to allow?
Added: Here's how the original code uses this regex (and it works there):
public static bool ValidateTextExp(string regexp, string sText)
{
if ( sText == null)
{
Log.WriteWarning("ValidateTextExp got null text to validate against regExp {0} . returning false",regexp);
return false;
}
return (!Regex.IsMatch(sText, regexp));
}
It seems I'm doing something wrong..
Thanks.
Your regex matches a value that contains any single character which is not in that list.
Your test value matches because it has spaces in it, which do not appear to be in your expression.
The reason it's not is because your character class starts with ^. The reason it matches any value that contains any single character that is not that is because you did not specify the beginning or end of the string, or any quantifiers.
The above assumes I'm not missing the importance of any of the characters in the middle of the character soup :)
This answer is also dependent on how you actually use the Regex in code.
If your intention was for that Regex string to represent the only characters that are actually allowed in a password, you would change the regex like so:
string pattern = "^[A-Z0-9...etc...]+$";
The important parts there are:
The ^ has been removed from inside the bracket, to outside; where it signifies the start of the whole string.
The $ has been added to the end, where it signifies the end of the whole string.
Those are needed because otherwise, your pattern will match anything that contains the valid values anywhere inside - even if invalid values are also present.
finally, I've added the + quantifier, which means you want to find any one of those valid characters, one or more times. (this regex would not permit a 0-length password)
If you wanted to permit the ^ character also as part of the password, you would add it back in between the brackets, but just *not as the first thing right after the opening bracket [. So for example:
string pattern = "^[A-Z0-9^...etc...]+$";
The ^ has special meaning in different places at different times in Regexes.
[^A-Za-z0-9_.\-!##$%^&*()=+;:'\"|~`?\/{}]
----------------------^
Looks fine to me, at least in regards to your question title. I'm not clear yet on why the spaces in your sample don't trip it up.
Note that I'm assuming the purpose of this expression is to find invalid characters. Thus, if the expression is a positive match, you have a bad password that you must reject. Since there appears to be some confusion about this, perhaps I can clear it up with a little psuedo-code:
bool isGoodPassword = !Regex.IsMatch(#"[^A-Za-z0-9_.\-!...]", requestedPassword);
You could re-write this for a positive match (without the negation) like so:
bool isGoodPassword = Regex.IsMatch(#"^[A-Za-z0-9_.\-!...]+$", requestedPassword);
The new expression matches a string that from the beginning of the string is filled with one or more of any of the characters in the list all the way the way to end. Any character not in the list would cause the match to fail.
You regular expression is just an inverted character class and describes just one single character (but that can’t be *). So it depends on how you use that character class.
Depends on how you apply it. It describes exactly one character, however, the ^ in the beginning buggs me a little, as it prohibits every other character, so there is probably something terribly fishy there.
Edit: as pointed out in other answers, the reason for your string to match is the space, not the explanation that was replaced by this line.