Regular expression oddity - c#

I have a regular expression to check for valid identifiers in a script language. These start with a letter or underscore, and can be followed by 0 or more letters, underscores, digits and $ symbols. However, if I call
Util.IsValidIdentifier( "hello\n" );
it returns true. My regex is
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_\$]*$";
so how does the "\n" get through?

The $ matches the end of lines. You need to use \z to match the end of the text, along with RegexOptions.Multiline. You might also want to use \A instead of ^ to match the beginning of the text, not of the line.
Also, you don't need to escape the $ in the character class.

Because $ is a valid metacharacter which means the end of the string (or the end of the line, just before the newline). From msdn:
$: The match must occur at the end of the string or before \n at the end of the line or string.
You should escape it: \$ (and add \z if you want to match the end of the string there).

Your result is true with hello\n because you don't need to escape the $ inside a character class, thus the backslash is matched because you have a backslash (seen as literal) inside the character class.
Try this:
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_$]*$";
Since you are testing variable names that are in one line, you can use $ as end of the string.

Related

Regular expression and \n EOL on the end of tested string [duplicate]

The following returns true
Regex.IsMatch("FooBar\n", "^([A-Z]([a-z][A-Z]?)+)$");
so does
Regex.IsMatch("FooBar\n", "^[A-Z]([a-z][A-Z]?)+$");
The RegEx is in SingleLine mode by default, so $ should not match \n. \n is not an allowed character.
This is to match a single ASCII PascalCaseWord (yes, it will match a trailing Cap)
Doesn't work with any combinations of RegexOptions.Multiline | RegexOptions.Singleline
What am I doing wrong?
In .NET regex, the $ anchor (as in PCRE, Python, PCRE, Perl, but not JavaScript) matches the end of line, or the position before the final newline ("\n") character in the string.
See this documentation:
$ The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.
No modifier can redefine this in .NET regex (in PCRE, you can use D PCRE_DOLLAR_ENDONLY modifier).
You must be looking for \z anchor: it matches only at the very end of the string:
\z The match must occur at the end of the string only. For more information, see End of String Only.
A short test in C#:
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+$")); // => True
Console.WriteLine(Regex.IsMatch("FooBar\n", #"^[A-Z]([a-z][A-Z]?)+\z")); // => False
From wikipedia:
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
So you are asking whether there is a capital letter after the start of the beginning of the string, followed by any number of times (zero or one letter), followed by the end of the string, or the position just before the newline.
That all seems true.
And yes, there seems to be some mismatch between different documentation sources about what is regarded as newline and how $ works or should work exactly. It always brings to mind the wisdom:
Sometimes a man has a problem and he figures he will use a regex to solve it.
Now the man has two problems.

Regex to replace single quote by ignoring "already replaced single quote" and "beginning/ ending single quote"

I have a string like this:
var path = "'Ah'This is a 'sample\'e'";
In the above string beginning and ending single quote(after double quotes) are as expected.
i.e "'...............'";
In the rest part of the string, there are single quotes (both replaced (i.e \' and un-replaced). I have a necessity to replace the single quote wherever it is not replaced. If it is already escaped, then no action needed. I have a hard time to find suitable regex to replace this.
After replacing the string must look like this( Please note that beginning and ending single quotes must not be replaced.
"'Ah\'This is a \'sample\'e'";
Could someone please help?
You may use
s = Regex.Replace(s, #"(?<!\\)(?!^)'(?!$)", #"\'");
See the regex demo. Regex graph:
Details
(?<!\\) - a negative lookbehind that matches a location in string that is not immediately preceded with \
(?!^) - a negative lookahead that matches a location in string that is not immediately followed with start of string (it is just failing the match if the current position is the start of string)
' - a ' char
(?!$) - a negative lookahead that matches a location in string that is not immediately followed with the end of string (it is failing the match if the current position is the end of string).

Remove non-alphanumeric characters from start and end of string only

I am trying to clean up some data using a helper exe (C#).
I iterate through each string and I want to remove invalid characters from the start and end of the string i.e. remove the dollar symbols from $$$helloworld$$$.
This works fine using this regular expression: \W.
However, strings which contain invalid character in the middle should be left alone i.e. hello$$$$world is fine and my regular expression should not match this particular string.
So in essence, I am trying to figure out the syntax to match invalid characters at the start and the end of of a string, but leave the strings which contain invalid characters in their body.
Thanks for your help!
This does it!
(^[\W_]*)|([\W_]*$)
This regex says match zero or more non word characters at the start(^) or(|) at the end($)
The following should work:
^\W+|\W+$
^ and $ are anchors to the beginning and end of the string respectively. The | in the middle is an OR, so this regex means "either match one or more non-word characters at the start of the string, or match one or more non-word characters at the end of the string".
Use ^ to match the start of string, and $ to match the end of string. C# Regex Cheat Sheet
Try this one,
(^[^\w]*)|([^\w]*$)
Use ^ to match 'beginning of line' and $ to match 'end of line', i.e. you code should match and remove ^\W* and \W*$

Simple C# regex

I have a regex I need to match against a path like so: "C:\Documents and Settings\User\My Documents\ScanSnap\382893.pd~". I need a regex that matches all paths except those ending in '~' or '.dat'. The problem I am having is that I don't understand how to match and negate the exact string '.dat' and only at the end of the path. i.e. I don't want to match {d,a,t} elsewhere in the path.
I have built the regex, but need to not match .dat
[\w\s:\.\\]*[^~]$[^\.dat]
[\w\s:\.\\]* This matches all words, whitespace, the colon, periods, and backspaces.
[^~]$[^\.dat]$ This causes matches ending in '~' to fail. It seems that I should be able to follow up with a negated match for '.dat', but the match fails in my regex tester.
I think my answer lies in grouping judging from what I've read, would someone point me in the right direction? I should add, I am using a file watching program that allows regex matching, I have only one line to specify the regex.
This entry seems similar: Regex to match multiple strings
You want to use a negative look-ahead:
^((?!\.dat$)[\w\s:\.\\])*$
By the way, your character group ([\w\s:\.\\]) doesn't allow a tilde (~) in it. Did you intend to allow a tilde in the filename if it wasn't at the end? If so:
^((?!~$|\.dat$)[\w\s:\.\\~])*$
The following regex:
^.*(?<!\.dat|~)$
matches any string that does NOT end with a '~' or with '.dat'.
^ # the start of the string
.* # gobble up the entire string (without line terminators!)
(?<!\.dat|~) # looking back, there should not be '.dat' or '~'
$ # the end of the string
In plain English: match a string only when looking behind from the end of the string, there is no sub-string '.dat' or '~'.
Edit: the reason why your attempt failed is because a negated character class, [^...] will just negate a single character. A character class always matches a single character. So when you do [^.dat], you're not negating the string ".dat" but you're matching a single character other than '.', 'd', 'a' or 't'.
^((?!\.dat$)[\w\s:\.\\])*$
This is just a comment on an earlier answer suggestion:
. within a character class, [], is a literal . and does not need escaping.
^((?!\.dat$)[\w\s:.\\])*$
I'm sorry to post this as a new solution, but I apparently don't have enough credibility to simply comment on an answer yet.
I believe you are looking for this:
[\w\s:\.\\]*([^~]|[^\.dat])$
which finds, like before, all word chars, white space, periods (.), back slashes. Then matches for either tilde (~) or '.dat' at the end of the string. You may also want to add a caret (^) at the very beginning if you know that the string should be at the beginning of a new line.
^[\w\s:\.\\]*([^~]|[^\.dat])$

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?
If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.
In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Categories