Remove non-alphanumeric characters from start and end of string only - c#

I am trying to clean up some data using a helper exe (C#).
I iterate through each string and I want to remove invalid characters from the start and end of the string i.e. remove the dollar symbols from $$$helloworld$$$.
This works fine using this regular expression: \W.
However, strings which contain invalid character in the middle should be left alone i.e. hello$$$$world is fine and my regular expression should not match this particular string.
So in essence, I am trying to figure out the syntax to match invalid characters at the start and the end of of a string, but leave the strings which contain invalid characters in their body.
Thanks for your help!

This does it!
(^[\W_]*)|([\W_]*$)
This regex says match zero or more non word characters at the start(^) or(|) at the end($)

The following should work:
^\W+|\W+$
^ and $ are anchors to the beginning and end of the string respectively. The | in the middle is an OR, so this regex means "either match one or more non-word characters at the start of the string, or match one or more non-word characters at the end of the string".

Use ^ to match the start of string, and $ to match the end of string. C# Regex Cheat Sheet

Try this one,
(^[^\w]*)|([^\w]*$)

Use ^ to match 'beginning of line' and $ to match 'end of line', i.e. you code should match and remove ^\W* and \W*$

Related

.Net regex matching $ with the end of the string and not of line, even with multiline enabled

I'm trying to highlight markdown code, but am running into this weird behavior of the .NET regex multiline option.
The following expression: ^(#+).+$ works fine on any online regex testing tool:
But it refuses to work with .net:
It doesn't seem to take into account the $ tag, and just highlights everything until the end of the string, no matter what. This is my C#
RegExpression = new Regex(#"^(#+).+$", RegexOptions.Multiline)
What am I missing?
It is clear your text contains a linebreak other than LF. In .NET regex, a dot matches any char but LF (a newline char, \n).
See Multiline Mode MSDN regex reference
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
So, use
#"^(#+).+?\r?$"
The .+?\r?$ will match lazily any one or more chars other than LF up to the first CR (that is optional) right before a newline.
Or just use a negated character class:
#"^(#+)[^\r\n]+"
The [^\r\n]+ will match one or more chars other than CR/LF.
What you have is good. The only thing you're missing is that . doesn't match newline characters, even with the multiline option. You can get around this in two different ways.
The easiest is to use the RegexOptions.Singleline flag which cause newlines to be treated as characters. That way, ^ still matches the start of the string, $ matches the end of the string and . matches everything including newlines.
The other way to fix this (although I wouldn't recomend it for your use case) is to modify your regex to explicitly allow newlines. To do this you can just replace any . with (?:.|\n) which means either anycharacter or a newline. For your example, you would end up with ^(#+)(?:.|\n)+$. If you want to ensure that there's a non-linebreak character first, add an extra dot: ^(#+).(?:.|\n)+$

Regular expression oddity

I have a regular expression to check for valid identifiers in a script language. These start with a letter or underscore, and can be followed by 0 or more letters, underscores, digits and $ symbols. However, if I call
Util.IsValidIdentifier( "hello\n" );
it returns true. My regex is
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_\$]*$";
so how does the "\n" get through?
The $ matches the end of lines. You need to use \z to match the end of the text, along with RegexOptions.Multiline. You might also want to use \A instead of ^ to match the beginning of the text, not of the line.
Also, you don't need to escape the $ in the character class.
Because $ is a valid metacharacter which means the end of the string (or the end of the line, just before the newline). From msdn:
$: The match must occur at the end of the string or before \n at the end of the line or string.
You should escape it: \$ (and add \z if you want to match the end of the string there).
Your result is true with hello\n because you don't need to escape the $ inside a character class, thus the backslash is matched because you have a backslash (seen as literal) inside the character class.
Try this:
const string IDENTIFIER_REGEX = #"^[A-Za-z_][A-Za-z0-9_$]*$";
Since you are testing variable names that are in one line, you can use $ as end of the string.

How to check for special characters using regex

NET. I have created a regex validator to check for special characters means I donot want any special characters in username. The following is the code
Regex objAlphaPattern = new Regex(#"[a-zA-Z0-9_#.-]");
bool sts = objAlphaPattern.IsMatch(username);
If I provide username as $%^&asghf then the validator gives as invalid data format which is the result I want but If I provide a data s23_#.-^&()%^$# then my validator should block the data but my validator allows the data which is wrong
So how to not allow any special characters except a-z A-A 0-9 _ # .-
Thanks
Sunil Kumar Sahoo
There's a few things wrong with your expression. First you don't have the start string character ^ and end string character $ at the beginning and end of your expression meaning that it only has to find a match somewhere within your string.
Second, you're only looking for one character currently. To force a match of all the characters you'll need to use * Here's what it should be:
Regex objAlphaPattern = new Regex(#"^[a-zA-Z0-9_#.-]*$");
bool sts = objAlphaPattern.IsMatch(username);
Your pattern checks only if the given string contains any "non-special" character; it does not exclude the unwanted characters. You want to change two things; make it check that the whole string contains only allowed characters, and also make it check for more than one character:
^[a-zA-Z0-9_#.-]+$
Added ^ before the pattern to make it start matching at the beginning of the string. Also added +$ after, + to ensure that there is at least one character in the string, and $ to make sure that the string is matched to the end.
Change your regex to ^[a-zA-Z0-9_#.-]+$. Here ^ denotes the beginning of a string, $ is the end of the string.

Simple C# regex

I have a regex I need to match against a path like so: "C:\Documents and Settings\User\My Documents\ScanSnap\382893.pd~". I need a regex that matches all paths except those ending in '~' or '.dat'. The problem I am having is that I don't understand how to match and negate the exact string '.dat' and only at the end of the path. i.e. I don't want to match {d,a,t} elsewhere in the path.
I have built the regex, but need to not match .dat
[\w\s:\.\\]*[^~]$[^\.dat]
[\w\s:\.\\]* This matches all words, whitespace, the colon, periods, and backspaces.
[^~]$[^\.dat]$ This causes matches ending in '~' to fail. It seems that I should be able to follow up with a negated match for '.dat', but the match fails in my regex tester.
I think my answer lies in grouping judging from what I've read, would someone point me in the right direction? I should add, I am using a file watching program that allows regex matching, I have only one line to specify the regex.
This entry seems similar: Regex to match multiple strings
You want to use a negative look-ahead:
^((?!\.dat$)[\w\s:\.\\])*$
By the way, your character group ([\w\s:\.\\]) doesn't allow a tilde (~) in it. Did you intend to allow a tilde in the filename if it wasn't at the end? If so:
^((?!~$|\.dat$)[\w\s:\.\\~])*$
The following regex:
^.*(?<!\.dat|~)$
matches any string that does NOT end with a '~' or with '.dat'.
^ # the start of the string
.* # gobble up the entire string (without line terminators!)
(?<!\.dat|~) # looking back, there should not be '.dat' or '~'
$ # the end of the string
In plain English: match a string only when looking behind from the end of the string, there is no sub-string '.dat' or '~'.
Edit: the reason why your attempt failed is because a negated character class, [^...] will just negate a single character. A character class always matches a single character. So when you do [^.dat], you're not negating the string ".dat" but you're matching a single character other than '.', 'd', 'a' or 't'.
^((?!\.dat$)[\w\s:\.\\])*$
This is just a comment on an earlier answer suggestion:
. within a character class, [], is a literal . and does not need escaping.
^((?!\.dat$)[\w\s:.\\])*$
I'm sorry to post this as a new solution, but I apparently don't have enough credibility to simply comment on an answer yet.
I believe you are looking for this:
[\w\s:\.\\]*([^~]|[^\.dat])$
which finds, like before, all word chars, white space, periods (.), back slashes. Then matches for either tilde (~) or '.dat' at the end of the string. You may also want to add a caret (^) at the very beginning if you know that the string should be at the beginning of a new line.
^[\w\s:\.\\]*([^~]|[^\.dat])$

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?
If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.
In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Categories