Excluding certain patterns in a regex - c#

I'm working on a Regex in C# to exclude certain patterns within a string.
These are the types patterns I want to accept are: "%00" (Hex 00-FF) and any other character without a starting '%'. The patterns I would like to exclude are: "%0" (Values with a starting % and one character after) and/or characters "&<>'/".
So far I have this
Regex correctStringRegex = new Regex(#"(%[0-9a-fA-F]{2})|[^%&<>'/]|(^(%.))",
RegexOptions.IgnoreCase);
Below are examples of what I'm trying to pass and reject.
Passing String %02This is%0A%0Da string%03
Reject String %0%0Z%A&<%0a%
If a string doesn't pass all the requirements I would like to reject the whole string completely.
Any Help would be greatly appreciated!

I suggest this:
^(?:%[0-9a-f]{2}|[^%&<>'/])*$
Explanation:
^ # Start of string
(?: # Match either
%[0-9a-f]{2} # %xx
| # or
[^%&<>'/] # any character except the forbidden ones
)* # any number of times
$ # until end of string.
This ensures that % is only matched when followed by two hexadecimals. Since you're already compiling the regex with the IgnoreCase flag set, you don't need a-fA-F, either.

Hmm, given the comments so far, I think you need a different problem definition. You want to pass or fail a string, using regex, based on whether or not the string contains any invalid patterns. Im assuming a string will fail if there is ANY invalid pattern, rather than the reverse of a string passing if there is any valid pattern.
As such, I would use this regex: %(?![0-9a-f]{2})|[&<>'/]
You would then run this in such a way that a string is invalid if you GET a match, a valid string will not have any matches in this set.
A quick explanation of a rather odd regex. The format (?!) tells the regex "Match the previous symbol if the symbols in this set DONT follow it" ie: Match if suffix not present. So, what im telling it to look for is any instance of % that is not followed by 2 hex characters, or any other invalid character. The assumption is that anything that DOESN'T match this regex is a valid character entry.

Related

Regular expression in RegularExpressionAttribute behavior

I am using this regular expression: #"[ \]\[;\/\\\?:*""<>|+=]|^[.]|[.]$"
First part [ \]\[;\/\\\?:*""<>|+=] should match any of the characters inside the brackets.
Next part ^[.] should match if the string starts with a 'dot'
Last part [.]$ should match if the string ends with a 'dot'
This works perfectly fine if I use Regex.IsMatch() function. However if I use RegularExpressionAttribute in ASP.NET MVC, I always get invalid model. Does anyone have any clue why this behavior occurs?
Examples:
"abcdefg" should not match
".abcdefg" should match
"abc.defg" should not match
"abcdefg." should match
"abc[defg" should match
Thanks in advance!
EDIT:
The RegularExpressionAttribute Specifies that a data field value in ASP.NET Dynamic Data must match the specified regular expression..
Which means. I need the "abcdef" to match, and ".abcdefg" to not match. Basically negate the whole expression I have above.
You need to make sure the pattern matches the entire string.
In a general case, you may append/prepend the pattern with .*.
Here, you may use
.*[ \][;/\\?:*"<>|+=].*|^[.].*|.*[.]$
Or, to make it a bit more efficient (that is, to reduce backtracking in the first branch) a negated character class will perform better:
[^ \][;/\\?:*"<>|+=]*[ \][;\/\\?:*"<>|+=].*|^[.].*|.*[.]$
But it is best to put the branches matching text at the start/end of the string as first branches:
^[.].*|.*[.]$|[^ \][;/\\?:*"<>|+=]*[ \][;/\\?:*"<>|+=].*
NOTE: You do not have to escape / and ? chars inside the .NET regex since you can't use regex delimiters there.
C# declaration of the last pattern will look like
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*"
See this .NET regex demo.
RegularExpressionAttrubute:
[RegularExpression(
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*",
ErrorMessage = "Username cannot contain following characters: ] [ ; / \\ ? : * \" < > | + =")
]
Your regex is an alternation which matches 1 character out of 3 character classes, the first consisting of more than 1 characters, the second a dot at the start of the string and the third a dot at the end of the string.
It works fine because it does match one of the alternations, only not the whole string you want to match.
You could use 3 alternations where the first matches a dot followed by repeating the character class until the end of the string, the second the other way around but this time the dot is at the end of the string.
Or the third using a positive lookahead asserting that the string contains at least one of the characters [\][;\/\\?:*"<>|+=]
^\.[a-z \][;\/\\?:*"<>|+=]+$|^[a-z \][;\/\\?:*"<>|+=]+\.$|^(?=.*[\][;\/\\?:*"<>|+=])[a-z \][;\/\\?:*"<>|+=]+$
Regex demo

Use OR in Regex Expression

I have a regex to match the following:
somedomain.com/services/something
Basically I need to ensure that /services is present.
The regex I am using and which is working is:
\/services*
But I need to match /services OR /servicos. I tried the following:
(\/services|\/servicos)*
But this shows 24 matches?! https://regex101.com/r/jvB1lr/1
How to create this regex?
The (\/services|\/servicos)* matches 0+ occurrences of /services or /servicos, and that means it can match an empty string anywhere inside the input string.
You can group the alternatives like /(services|servicos) and remove the * quantifier, but for this case, it is much better to use a character class [oe] as the strings only differ in 1 char.
You want to use the following pattern:
/servic[eo]s
See the regex demo
To make sure you match a whole subpart, you may append (?:/|$) at the pattern end, /servic[eo]s(?:/|$).
In C#, you may use Regex.IsMatch with the pattern to see if there is a match in a string:
var isFound = Regex.IsMatch(s, #"/servic[eo]s(?:/|$)");
Note that you do not need to escape / in a .NET regex as it is not a special regex metacharacter.
Pattern details
/ - a /
servic[eo]s - services or servicos
(?:/|$) - / or end of string.
Well the * quantifier means zero or more, so that is the problem. Remove that and it should work fine:
(\/services|\/servicos)
Keep in mind that in your example, you have a typo in the URL so it will correctly not match anything as it stands.
Here is an example with the typo in the URL fixed, so it shows 1 match as expected.
First off you specify C# (really .Net is the library which holds regex not the language) in this post but regex101 in your example is set to PHP. That is providing you with invalid information such as needed to escape a forward slash / with \/ which is unnecessary in .Net regular expressions. The regex language is the same but there are different tools which behave differently and php is not like .Net regex.
Secondly the star * on the ( ) is saying that there may be nothing in the parenthesis and your match is getting null nothing matches on every word.
Thirdly one does not need to split the whole word. I would just extract the commonality in the words into a set [ ]. That will allow the "or-ness" you need to match on either services or servicos. Such as
(/servic[oe]s)
Will inform you if services are found or not. Nothing else is needed.

Why is this regex not allowing this text?

I have a username validator IsValidUsername, and I am testing "baconman" but it is failing, could someone please help me out with this regex?
if(!Regex.IsMatch(str, #"^[a-zA-Z]\\w+|[0-9][0-9_]*[a-zA-Z]+\\w*$")) {
isValid = false;
}
I want the restrictions to be: (It's very close)
Be between 5 & 17 characters long
contain at least one letter
no spaces
no special characters
You're escaping unnecessarily: if you write your regex as starting with # outside the string, you don't need both \ - just one is fine.
Either:
#"\w"
or
"\\w"
Edit: I didn't make this clear: right now due to the double escaping, you're looking for a \ in your regex and a w. So your match would need [some character]\w to match (example: "a\w" or "a\wwwwww" would match.
Your requirements are best taken care of in normal C#. They don't map well to a regular expression. Just code them up using LINQ which works on strings like it would on an IEnumerable<char>.
Also, understanding a query of a string is much easier than understanding a Regex with the requirements that you have.
It is possible to do everything as part of a Regex, however it is not pretty :-)
^(\w(?=\w*[a-zA-Z])|[a-zA-Z]|\w(?<=[a-zA-Z]\w*)){5,17}$
It does 3 checks that always results in 1 character being matched (so we can perform the length check in the end)
Either the character is any word character \w which is before [a-zA-Z]
Or it is [a-zA-Z]
Or it is any word character \w which is after [a-zA-Z]

How to check for special characters using regex

NET. I have created a regex validator to check for special characters means I donot want any special characters in username. The following is the code
Regex objAlphaPattern = new Regex(#"[a-zA-Z0-9_#.-]");
bool sts = objAlphaPattern.IsMatch(username);
If I provide username as $%^&asghf then the validator gives as invalid data format which is the result I want but If I provide a data s23_#.-^&()%^$# then my validator should block the data but my validator allows the data which is wrong
So how to not allow any special characters except a-z A-A 0-9 _ # .-
Thanks
Sunil Kumar Sahoo
There's a few things wrong with your expression. First you don't have the start string character ^ and end string character $ at the beginning and end of your expression meaning that it only has to find a match somewhere within your string.
Second, you're only looking for one character currently. To force a match of all the characters you'll need to use * Here's what it should be:
Regex objAlphaPattern = new Regex(#"^[a-zA-Z0-9_#.-]*$");
bool sts = objAlphaPattern.IsMatch(username);
Your pattern checks only if the given string contains any "non-special" character; it does not exclude the unwanted characters. You want to change two things; make it check that the whole string contains only allowed characters, and also make it check for more than one character:
^[a-zA-Z0-9_#.-]+$
Added ^ before the pattern to make it start matching at the beginning of the string. Also added +$ after, + to ensure that there is at least one character in the string, and $ to make sure that the string is matched to the end.
Change your regex to ^[a-zA-Z0-9_#.-]+$. Here ^ denotes the beginning of a string, $ is the end of the string.

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?
If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.
In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Categories