Regex - Match pattern A unless it matches pattern B - c#

I'm trying to excluded some very specific routes from my MVC project.
More specifically I want to ignore all calls to .ashx pages, unless they match a certain pattern.
(?<!invoices\/(order|membership)\/(\d{5,})-([a-f0-9]{8}))\.ashx
This is the pattern I came up with, but since you can't use quantifiers in a negated lookbehind, it's not working.
Any ideas as to how I can achieve this so I can ignore my routes correctly with a call like this:
routes.Ignore("{*handlers}", new { handlers = "(?<!invoices/(order|membership)/(\\d{5,})-([a-f0-9]{8}))\\.ashx" });

.NET regex flavor does support infinite-width lookbehinds, so the only issue with your pattern is the double backslashes. Use \d instead of \\d and \. instead of \\., or just work around that with character classes [0-9] (a digit) and [.] (a literal dot):
(?<!invoices/(order|membership)/[0-9]{5,}-[a-f0-9]{8})[.]ashx
^^^^^ ^^^
You can also get rid of the lookbehind, and use a lookahead anchored at the start:
^(?!.*invoices/(order|membership)/[0-9]{5,}-[a-f0-9]{8}).*[.]ashx.
The (?!.*invoices/(order|membership)/[0-9]{5,}-[a-f0-9]{8}) negative lookahead will fail the match if a string contains (remove the first .* to make it starts with) the invoices/(order|membership)/[0-9]{5,}-[a-f0-9]{8} pattern.

Related

Stop When <br> is Encountered In C# RegEx [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

Regex to stop parsing after semicolon is encountered

I am using this regex to parse URL from a semicolon separated string.
\b(?:https?:|http?:|www\.)\S+\b
It is working fine if my input text is in these formats:
"Google;\"https://google.com\""
//output - https://google.com
"Yahoo;\"www.yahoo.com\""
//output - www.yahoo.com
but in this case it gives incorrect string
"https://google.com;\"https://google.com\""
//output - https://google.com;\"https://google.com
how can I stop the parsing when I encounter the ';' ?
Looking at your examples, I would just match any URL between quotation marks. Something like this:
(?<=")(?:https?:|www\.)[^"]*
You can try it out here
Or as others have said, split the input string by the semicolon character using string.Split, and check each string sequentially for your desired match.
For your example data you might use a positive lookahead (?=) and a positive lookbehind (?<=)
(?<=")(?:https?:|www\.).+?(?=;?\\")
That would match
(?<=") Positive lookbehind to assert that what is on the left side is a double quote
(?:https?:|www\.) Match either http with an optional s or www.
.+? Match any character one or more times non greedy
(?=;?\\") Positive lookahead which asserts that what follows is an optional ; followed by\"
I would personally just modify the regex to look specifically for URLs and add some conditionals to the https:// protocols and www quantifier. Using \S+ can be kind of iffy because it will grab every non whitespace character, in which in a URL, it's limited on the characters you can use.
Something like this should work great for your particular needs.
(https?:\/{2})?([w]{3}.)?\w+\.[a-zA-Z]+
This sets up a conditional on the http (s also optional) protocol which would then be immediately be followed by the ://. Then, it will grab all letters, numbers, and underscores as many as possible until the ., followed by the last set of characters to end it. You can exchange the [a-zA-Z] character set for a explicit set of domains if you'd prefer.

Use OR in Regex Expression

I have a regex to match the following:
somedomain.com/services/something
Basically I need to ensure that /services is present.
The regex I am using and which is working is:
\/services*
But I need to match /services OR /servicos. I tried the following:
(\/services|\/servicos)*
But this shows 24 matches?! https://regex101.com/r/jvB1lr/1
How to create this regex?
The (\/services|\/servicos)* matches 0+ occurrences of /services or /servicos, and that means it can match an empty string anywhere inside the input string.
You can group the alternatives like /(services|servicos) and remove the * quantifier, but for this case, it is much better to use a character class [oe] as the strings only differ in 1 char.
You want to use the following pattern:
/servic[eo]s
See the regex demo
To make sure you match a whole subpart, you may append (?:/|$) at the pattern end, /servic[eo]s(?:/|$).
In C#, you may use Regex.IsMatch with the pattern to see if there is a match in a string:
var isFound = Regex.IsMatch(s, #"/servic[eo]s(?:/|$)");
Note that you do not need to escape / in a .NET regex as it is not a special regex metacharacter.
Pattern details
/ - a /
servic[eo]s - services or servicos
(?:/|$) - / or end of string.
Well the * quantifier means zero or more, so that is the problem. Remove that and it should work fine:
(\/services|\/servicos)
Keep in mind that in your example, you have a typo in the URL so it will correctly not match anything as it stands.
Here is an example with the typo in the URL fixed, so it shows 1 match as expected.
First off you specify C# (really .Net is the library which holds regex not the language) in this post but regex101 in your example is set to PHP. That is providing you with invalid information such as needed to escape a forward slash / with \/ which is unnecessary in .Net regular expressions. The regex language is the same but there are different tools which behave differently and php is not like .Net regex.
Secondly the star * on the ( ) is saying that there may be nothing in the parenthesis and your match is getting null nothing matches on every word.
Thirdly one does not need to split the whole word. I would just extract the commonality in the words into a set [ ]. That will allow the "or-ness" you need to match on either services or servicos. Such as
(/servic[oe]s)
Will inform you if services are found or not. Nothing else is needed.

Multiple RegEx negation matching

I have the following RegEx patterns:
"[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]" ===> matches 8411.T or JNID8
"[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]" ==> matches 9345.HK or HCEIU9-A9
".*\.SI|SFC[A-Z][0-9]" ==> matches 8345.SI or SFCX8
How can I obtain a RegEx from the negation of these patterns?
I want to match strings that match neither of these 3 patterns:
e.g. I want to match 8411.ABC, but not any of the aforementioned strings (8411.T, HCEIU-A9, 8345.SI, etc.).
I've tried (just to exclude 2 and 3 for instance, but it doesn't work):
^(?!((.*\.SI|SFC[A-Z][0-9])|([0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9])))
The main idea here is to place the patterns into (?!.*<pattern>) negative lookaheads anchored at the start of the string (^). The difficulty here is that you patterns contain unanchored alternations, and if not grouped, the .* before the patterns will only refer to the first alternative (i.e. all the subsequent alternatives will only be negated at the start of the string.
Thus, your pattern formula is ^(?!.*(?:<PATTERN1>))(?!.*(?:<PATTERN2>))(?!.*(?:<PATTERN3>)). Note that .+ or .* at the end is optional if you need to just get a boolean result. Note that in the last pattern, you need to remove the .* in the first alternative, it won't make sense to use .*.*.
Use
^(?!.*(?:[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]))(?!.*(?:[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]))(?!.*(?:\.SI|SFC[A-Z][0-9])).+
See the regex demo.
You may also contract the formula to ^(?!.*(?:<PATTERN1>|<PATTERN2>|<PATTERN3>)):
^(?!.*(?:[0-9]{4,5}\.FU|[0-9]{4,5}\.NG|[0-9]{4,5}\.SP|[0-9]{4,5}\.T|JGB[A-Z][0-9]|JNI[A-Z][0-9]|JN4F[A-Z][0-9]|JNM[A-Z][0-9]|JTI[A-Z][0-9]|JTM[A-Z][0-9]|NIY[A-Z][0-9]|SSI[A-Z][0-9]|JNI[A-Z][0-9]-[A-Z][0-9]|JTI[A-Z][0-9]-[A-Z][0-9]|[0-9]{4,5}\.HK|HSI[A-Z][0-9]|HMH[A-Z][0-9]|HCEI[A-Z][0-9]|HCEI[A-Z][0-9]-[A-Z][0-9]|\.SI|SFC[A-Z][0-9])).+
See another regex demo.

Categories