Regular expression and \n EOL on the end of tested string [duplicate] - c#

The following returns true
Regex.IsMatch("FooBar\n", "^([A-Z]([a-z][A-Z]?)+)$");
so does
Regex.IsMatch("FooBar\n", "^[A-Z]([a-z][A-Z]?)+$");
The RegEx is in SingleLine mode by default, so $ should not match \n. \n is not an allowed character.
This is to match a single ASCII PascalCaseWord (yes, it will match a trailing Cap)
Doesn't work with any combinations of RegexOptions.Multiline | RegexOptions.Singleline
What am I doing wrong?

In .NET regex, the $ anchor (as in PCRE, Python, PCRE, Perl, but not JavaScript) matches the end of line, or the position before the final newline ("\n") character in the string.
See this documentation:
$ The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.
No modifier can redefine this in .NET regex (in PCRE, you can use D PCRE_DOLLAR_ENDONLY modifier).
You must be looking for \z anchor: it matches only at the very end of the string:
\z The match must occur at the end of the string only. For more information, see End of String Only.
A short test in C#:
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+$")); // => True
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+\z")); // => False

From wikipedia:
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
So you are asking whether there is a capital letter after the start of the beginning of the string, followed by any number of times (zero or one letter), followed by the end of the string, or the position just before the newline.
That all seems true.
And yes, there seems to be some mismatch between different documentation sources about what is regarded as newline and how $ works or should work exactly. It always brings to mind the wisdom:
Sometimes a man has a problem and he figures he will use a regex to solve it.
Now the man has two problems.

Related

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

I'm trying to highlight markdown code, but am running into this weird behavior of the .NET regex multiline option.
The following expression: ^(#+).+$ works fine on any online regex testing tool:
But it refuses to work with .net:
It doesn't seem to take into account the $ tag, and just highlights everything until the end of the string, no matter what. This is my C#
RegExpression = new Regex(#"^(#+).+$", RegexOptions.Multiline)
What am I missing?
It is clear your text contains a linebreak other than LF. In .NET regex, a dot matches any char but LF (a newline char, \n).
See Multiline Mode MSDN regex reference
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
So, use
#"^(#+).+?\r?$"
The .+?\r?$ will match lazily any one or more chars other than LF up to the first CR (that is optional) right before a newline.
Or just use a negated character class:
#"^(#+)[^\r\n]+"
The [^\r\n]+ will match one or more chars other than CR/LF.
What you have is good. The only thing you're missing is that . doesn't match newline characters, even with the multiline option. You can get around this in two different ways.
The easiest is to use the RegexOptions.Singleline flag which cause newlines to be treated as characters. That way, ^ still matches the start of the string, $ matches the end of the string and . matches everything including newlines.
The other way to fix this (although I wouldn't recomend it for your use case) is to modify your regex to explicitly allow newlines. To do this you can just replace any . with (?:.|\n) which means either anycharacter or a newline. For your example, you would end up with ^(#+)(?:.|\n)+$. If you want to ensure that there's a non-linebreak character first, add an extra dot: ^(#+).(?:.|\n)+$

Regex to match comma separated string with no comma at the end of the line

I am trying to write a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line. I have tried do this,that includes all the possible characters,but it still does not give me the correct output:
[RegularExpression("^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+$", ErrorMessage = "Comma is not allowed at the end of {0} ")]
^.*[^,]$
.* means all char,don't need so long
^([a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+,)*[a-zA-Z0-9\t\n ./<>?;:\"'!##$%^&*()[]{}_+=|\\-]+(?<!,)$
^^
Just add lookbehind at the end.
a regex that will allow input of all characters on the keyboard(even space) but will restrict the input of comma at the end of the line.
Mind that you can type much more than what you typed using a keyboard. Basically, you want to allow any character but a comma at the end of the line.
So,
(?!,).(?=\r\n|\z)
This regex is checking each line (because of the (?=\r\n|$) look-ahead), and the (?!,) look-ahead makes sure the last character (that we match using .) is not a comma. \z is an unambiguous string end anchor.
See regex demo
This will work even on a client side.
To also get the full line match, you can just add .* at the beginning of the pattern (as we are not using singleline flag, . does not match newline symbols):
.*(?!,).(?=\r\n|\z)
Or (making it faster with an atomic group or an inline multiline option with ^ start of line anchor, but will not work on the client side)
(?>.*)(?!,).(?=\r\n|\z)
(?m)^.*?(?!,).(?=\r\n|\z) // The fastest of the last three
See demo

Regular expression oddity

I have a regular expression to check for valid identifiers in a script language. These start with a letter or underscore, and can be followed by 0 or more letters, underscores, digits and $ symbols. However, if I call
Util.IsValidIdentifier( "hello\n" );
it returns true. My regex is
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_\$]*$";
so how does the "\n" get through?
The $ matches the end of lines. You need to use \z to match the end of the text, along with RegexOptions.Multiline. You might also want to use \A instead of ^ to match the beginning of the text, not of the line.
Also, you don't need to escape the $ in the character class.
Because $ is a valid metacharacter which means the end of the string (or the end of the line, just before the newline). From msdn:
$: The match must occur at the end of the string or before \n at the end of the line or string.
You should escape it: \$ (and add \z if you want to match the end of the string there).
Your result is true with hello\n because you don't need to escape the $ inside a character class, thus the backslash is matched because you have a backslash (seen as literal) inside the character class.
Try this:
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_$]*$";
Since you are testing variable names that are in one line, you can use $ as end of the string.

Remove non-alphanumeric characters from start and end of string only

I am trying to clean up some data using a helper exe (C#).
I iterate through each string and I want to remove invalid characters from the start and end of the string i.e. remove the dollar symbols from $$$helloworld$$$.
This works fine using this regular expression: \W.
However, strings which contain invalid character in the middle should be left alone i.e. hello$$$$world is fine and my regular expression should not match this particular string.
So in essence, I am trying to figure out the syntax to match invalid characters at the start and the end of of a string, but leave the strings which contain invalid characters in their body.
Thanks for your help!
This does it!
(^[\W_]*)|([\W_]*$)
This regex says match zero or more non word characters at the start(^) or(|) at the end($)
The following should work:
^\W+|\W+$
^ and $ are anchors to the beginning and end of the string respectively. The | in the middle is an OR, so this regex means "either match one or more non-word characters at the start of the string, or match one or more non-word characters at the end of the string".
Use ^ to match the start of string, and $ to match the end of string. C# Regex Cheat Sheet
Try this one,
(^[^\w]*)|([^\w]*$)
Use ^ to match 'beginning of line' and $ to match 'end of line', i.e. you code should match and remove ^\W* and \W*$

Categories