Simple C# regex - c#

I have a regex I need to match against a path like so: "C:\Documents and Settings\User\My Documents\ScanSnap\382893.pd~". I need a regex that matches all paths except those ending in '~' or '.dat'. The problem I am having is that I don't understand how to match and negate the exact string '.dat' and only at the end of the path. i.e. I don't want to match {d,a,t} elsewhere in the path.
I have built the regex, but need to not match .dat
[\w\s:\.\\]*[^~]$[^\.dat]
[\w\s:\.\\]* This matches all words, whitespace, the colon, periods, and backspaces.
[^~]$[^\.dat]$ This causes matches ending in '~' to fail. It seems that I should be able to follow up with a negated match for '.dat', but the match fails in my regex tester.
I think my answer lies in grouping judging from what I've read, would someone point me in the right direction? I should add, I am using a file watching program that allows regex matching, I have only one line to specify the regex.
This entry seems similar: Regex to match multiple strings

You want to use a negative look-ahead:
^((?!\.dat$)[\w\s:\.\\])*$
By the way, your character group ([\w\s:\.\\]) doesn't allow a tilde (~) in it. Did you intend to allow a tilde in the filename if it wasn't at the end? If so:
^((?!~$|\.dat$)[\w\s:\.\\~])*$

The following regex:
^.*(?<!\.dat|~)$
matches any string that does NOT end with a '~' or with '.dat'.
^ # the start of the string
.* # gobble up the entire string (without line terminators!)
(?<!\.dat|~) # looking back, there should not be '.dat' or '~'
$ # the end of the string
In plain English: match a string only when looking behind from the end of the string, there is no sub-string '.dat' or '~'.
Edit: the reason why your attempt failed is because a negated character class, [^...] will just negate a single character. A character class always matches a single character. So when you do [^.dat], you're not negating the string ".dat" but you're matching a single character other than '.', 'd', 'a' or 't'.

^((?!\.dat$)[\w\s:\.\\])*$
This is just a comment on an earlier answer suggestion:
. within a character class, [], is a literal . and does not need escaping.
^((?!\.dat$)[\w\s:.\\])*$
I'm sorry to post this as a new solution, but I apparently don't have enough credibility to simply comment on an answer yet.

I believe you are looking for this:
[\w\s:\.\\]*([^~]|[^\.dat])$
which finds, like before, all word chars, white space, periods (.), back slashes. Then matches for either tilde (~) or '.dat' at the end of the string. You may also want to add a caret (^) at the very beginning if you know that the string should be at the beginning of a new line.
^[\w\s:\.\\]*([^~]|[^\.dat])$

Related

Regex - find matches not contained within pattern

I would like to use a regular expression to match all occurrences of a phrase where it's not contained within some delimiting characters. I tried putting one together but had some difficulty with the negative lookaheads.
My search phrase is "my phrase". The start delimiter tag is [[ and the end delimiter tag is ]]. The string I'd like to search is:
Here is a sentence with my phrase, here's another part which I don't want to match on [[my phrase]]. I would like to find this occurrence of my phrase.
From this string I would expect to find all occurrences of "my phrase" except the one contained within [[ ]].
I hope that makes sense, thanks in advance for any guidance.
[^#]my phrase[^#]
I have knocked up a RegEx that will do what you ask, this can be seen here.
Literally just escaping out # as a character and allowing any other character to be returned. You can return the index of these results but remember to strip off the first and last character of the string.
Note: This will not pick up any "my phrase" that end the sentence without a character following it
Edit - Seeing as you changed the scope while I was writing this answer,
here is the RegEx for the other delimiter:
[^[[]my phrase[^\]\]]
(?<=[^\[])my phrase(?=[^\]]*)
This will also elliminate the trailing punctuation marks.

Regex pattern in C# with empty space

I am having issue with a reg ex expression and can't find the answer to my question.
I am trying to build a reg ex pattern that will pull in any matches that have # around them. for example #match# or #mt# would both come back.
This works fine for that. #.*?#
However I don't want matches on ## to show up. Basically if there is nothing between the pound signs don't match.
Hope this makes sense.
Thanks.
Please use + to match 1 or more symbols:
#+.+#+
UPDATE:
If you want to only match substrings that are enclosed with single hash symbols, use:
(?<!#)#(?!#)[^#]+#(?!#)
See regex demo
Explanation:
(?<!#)#(?!#) - a # symbol that is not preceded with a # (due to the negative lookbehind (?<!#)) and not followed by a # (due to the negative lookahead (?!#))
[^#]+ - one or more symbols other than # (due to the negated character class [^#])
#(?!#) - a # symbol not followed with another # symbol.
Instead of using * to match between zero and unlimited characters, replace it with +, which will only match if there is at least one character between the #'s. The edited regex should look like this: #.+?#. Hope this helps!
Edit
Sorry for the incorrect regex, I had not expected multiple hash signs. This should work for your sentence: #+.+?#+
Edit 2
I am pretty sure I got it. Try this: (?<!#)#[^#].*?#. It might not work as expected with triple hashes though.
Try:
[^#]?#.+#[^#]?
The [^ character_group] construction matches any single character not included in the character group. Using the ? after it will let you match at the beginning/end of a string (since it matches the preceeding character zero or more times. Check out the documentation here

Escaping hash and quote to regular expression

I am trying to define a regular to use with a regular expression validator that limits the content of a textbox to only alphanumeric characters, slash (/), hash (#), left and right parentheses (()), period (.), apostrophe ('), quote ("), hyphen (-) and spaces.
I am having troubles with the hash and quote, the other restrictions are working, but when I insert one of these chars the evaluation fails and I get the error message. I have tried to escape these characters without and also using verbatim which was my last attempt.
#"[ a-zA-ZÀ-ÿ/().\'-""#]"
Any thoughts on these? Thank you
The regex language is smart enough to understand that periods and parentheses within a character class actually refer to the characters and not to the patterns they usually do when they appear outside of character classes.
Within your character class, you need to escape the slash (\) and the hyphen(-), but that's it:
#"[ a-zA-ZÀ-ÿ/().\\'\-""#]"
If you move your hyphen to the end of the character class, you won't even need to escape that:
#"[ a-zA-ZÀ-ÿ/().\\'""#-]"
And of course this still only matches one a single character. If you want to ensure that the entire string consists only of these characters, you'll need to use start (^) and end ($) anchors and a quantifier (* or +) after your character class.
I believe your final pattern should look like this:
#"^[ a-zA-ZÀ-ÿ/().\\'""#-]*$"

Remove non-alphanumeric characters from start and end of string only

I am trying to clean up some data using a helper exe (C#).
I iterate through each string and I want to remove invalid characters from the start and end of the string i.e. remove the dollar symbols from $$$helloworld$$$.
This works fine using this regular expression: \W.
However, strings which contain invalid character in the middle should be left alone i.e. hello$$$$world is fine and my regular expression should not match this particular string.
So in essence, I am trying to figure out the syntax to match invalid characters at the start and the end of of a string, but leave the strings which contain invalid characters in their body.
Thanks for your help!
This does it!
(^[\W_]*)|([\W_]*$)
This regex says match zero or more non word characters at the start(^) or(|) at the end($)
The following should work:
^\W+|\W+$
^ and $ are anchors to the beginning and end of the string respectively. The | in the middle is an OR, so this regex means "either match one or more non-word characters at the start of the string, or match one or more non-word characters at the end of the string".
Use ^ to match the start of string, and $ to match the end of string. C# Regex Cheat Sheet
Try this one,
(^[^\w]*)|([^\w]*$)
Use ^ to match 'beginning of line' and $ to match 'end of line', i.e. you code should match and remove ^\W* and \W*$

why do these regex tests let certain characters pass?

I am checking a string with the following regexes:
[a-zA-Z0-9]+
[A-Za-z]+
For some reason, the characters:
.
-
_
are allowed to pass, why is that?
If you want to check that the complete string consists of only the wanted characters you need to anchor your regex like follows:
^[a-zA-Z0-9]+$
Otherwise every string will pass that contains a string of the allowed characters somewhere. The anchors essentially tell the regular expression engine to start looking for those characters at the start of the string and stop looking at the end of the string.
To clarify: If you just use [a-zA-Z0-9]+ as your regex, then the regex engine would rightfully reject the string -__-- as the regex doesn't match against that. There is no single character from the character class you defined.
However, with the string a-b it's different. The regular expression engine will match the first a here since that matches the expression you entered (at least one of the given characters) and won't care about the - or the b. It has done its job and successfully matched a substring according to your regular expression.
Similarly with _-abcdef- – the regex will match the substring abcdef just fine, because you didn't tell it to match only at the start or end of the string; and ignore the other characters.
So when using ^[a-zA-Z0-9]+$ as your regex you are telling the regex engine definitely that you are looking for one or more letters or digits, starting at the very beginning of the string right until the end of the string. There is no room for other characters to squeeze in or hide so this will do what you apparently want. But without the anchors, the match can be anywhere in your search string. For validation purposes you always want to use those anchors.
In regular expressions the + tells the engine to match one or more characters.
So this expression [A-Za-z]+ passes if the string contains a sequence of 1 or more alphabetic characters. The only strings that wouldn't pass are strings that contain no alphabetic characters at all.
The ^ symbol anchors the character class to the beginning of the string and the $ symbol anchors to the end of the string.
So ^[A-Za-z0-9]+ means 'match a string that begins with a sequence of one or more alphanumeric characters'. But would allow strings that include non-alphanumerics so long as those characters were not at the beginning of the string.
While ^[A-Za-z0-9]+$ means 'match a string that begins and ends with a sequence of one or more alphanumeric characters'. This is the only way to completely exclude non-alphanumerics from a string.

Categories