C# Regex match on special characters - c#

I know this stuff has been talked about a lot, but I'm having a problem trying to match the following...
Example input: "test test 310-315"
I need a regex expression that recognizes a number followed by a dash, and returns 310. How do I include the dash in the regex expression though. So the final match result would be: "310".
Thanks a lot - kcross
EDIT: Also, how would I do the same thing but with the dash preceding, but also take into account that the number following the dash could be a negative number... didnt think of this one when I wrote the question immediately. for example: "test test 310--315" returns -315 and "test 310-315" returns 315.

Regex regex = new Regex(#"\d+(?=\-)");
\d+ - Looks for one or more digits
(?=\-) - Makes sure it is followed by a dash
The # just eliminates the need to escape the backslashes to keep the compiler happy.
Also, you may want this instead:
\d+(?=\-\d+)
This will check for a one or more numbers, followed by a dash, followed by one or more numbers, but only match the first set.
In response to your comment, here's a regex that will check for a number following a -, while accounting for potential negative (-) numbers:
Regex regex = new Regex(#"(?<=\-)\-?\d+");
(?<=\-) - Negative lookbehind which will check and make sure there is a preceding -
\-? - Checks for either zero or one dashes
\d+ - One or more digits

(?'number'\d+)- will work ( no need to escape ). In this example the group containing the single number is the named group 'number'.
if you want to match both groups with optional sign try:
#"(?'first'-?\d+)-(?'second'-?\d+)"
See it working here.
Just to describe, nothing complicated, just using -? to match an optional - and \d+ to match one or more digit. a literal - match itself.

here's some documentation that I use:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
in the comments section of that page, it suggests escaping the dash with '\-'
make sure you escape your escape character \

You would escape the special meaning of - in regex language (means range) using a backslash (\). Since backslash has a special meaning in C# literals to escape quotes or be part of some characters, you need to escape that with another backslash(\). So essentially it would be \d+\\-.

\b\d*(?=\-) you will want to look ahead for the dash
\b = is start at a word boundry
\d = match any decimal digit
* = match the previous as many times as needed
(?=\-) = look ahead for the dash
Edited for Formatting issue with the slash not showing after posting

Related

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

Regex to stop parsing after semicolon is encountered

I am using this regex to parse URL from a semicolon separated string.
\b(?:https?:|http?:|www\.)\S+\b
It is working fine if my input text is in these formats:
"Google;\"https://google.com\""
//output - https://google.com
"Yahoo;\"www.yahoo.com\""
//output - www.yahoo.com
but in this case it gives incorrect string
"https://google.com;\"https://google.com\""
//output - https://google.com;\"https://google.com
how can I stop the parsing when I encounter the ';' ?
Looking at your examples, I would just match any URL between quotation marks. Something like this:
(?<=")(?:https?:|www\.)[^"]*
You can try it out here
Or as others have said, split the input string by the semicolon character using string.Split, and check each string sequentially for your desired match.
For your example data you might use a positive lookahead (?=) and a positive lookbehind (?<=)
(?<=")(?:https?:|www\.).+?(?=;?\\")
That would match
(?<=") Positive lookbehind to assert that what is on the left side is a double quote
(?:https?:|www\.) Match either http with an optional s or www.
.+? Match any character one or more times non greedy
(?=;?\\") Positive lookahead which asserts that what follows is an optional ; followed by\"
I would personally just modify the regex to look specifically for URLs and add some conditionals to the https:// protocols and www quantifier. Using \S+ can be kind of iffy because it will grab every non whitespace character, in which in a URL, it's limited on the characters you can use.
Something like this should work great for your particular needs.
(https?:\/{2})?([w]{3}.)?\w+\.[a-zA-Z]+
This sets up a conditional on the http (s also optional) protocol which would then be immediately be followed by the ://. Then, it will grab all letters, numbers, and underscores as many as possible until the ., followed by the last set of characters to end it. You can exchange the [a-zA-Z] character set for a explicit set of domains if you'd prefer.

Regex pattern in C# with empty space

I am having issue with a reg ex expression and can't find the answer to my question.
I am trying to build a reg ex pattern that will pull in any matches that have # around them. for example #match# or #mt# would both come back.
This works fine for that. #.*?#
However I don't want matches on ## to show up. Basically if there is nothing between the pound signs don't match.
Hope this makes sense.
Thanks.
Please use + to match 1 or more symbols:
#+.+#+
UPDATE:
If you want to only match substrings that are enclosed with single hash symbols, use:
(?<!#)#(?!#)[^#]+#(?!#)
See regex demo
Explanation:
(?<!#)#(?!#) - a # symbol that is not preceded with a # (due to the negative lookbehind (?<!#)) and not followed by a # (due to the negative lookahead (?!#))
[^#]+ - one or more symbols other than # (due to the negated character class [^#])
#(?!#) - a # symbol not followed with another # symbol.
Instead of using * to match between zero and unlimited characters, replace it with +, which will only match if there is at least one character between the #'s. The edited regex should look like this: #.+?#. Hope this helps!
Edit
Sorry for the incorrect regex, I had not expected multiple hash signs. This should work for your sentence: #+.+?#+
Edit 2
I am pretty sure I got it. Try this: (?<!#)#[^#].*?#. It might not work as expected with triple hashes though.
Try:
[^#]?#.+#[^#]?
The [^ character_group] construction matches any single character not included in the character group. Using the ? after it will let you match at the beginning/end of a string (since it matches the preceeding character zero or more times. Check out the documentation here

Explain the Regex mentioned

Can any one please explain the regex below, this has been used in my application for a very long time even before I joined, and I am very new to regex's.
/^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$/
As far as I understand
this regex will validate
- for a minimum of 6 chars to a maximum of 10 characters
- will escape the characters like ^ and $
also, my basic need is that I want a regex for a minimum of 6 characters with 1 character being a digit and the other one being a special character.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
^ is called an "anchor". It basically means that any following text must be immediately after the "start of the input". So ^B would match "B" but not "AB" because in the second "B" is not the first character.
.* matches 0 or more characters - any character except a newline (by default). This is what's known as a greedy quantifier - the regex engine will match ("consume") all of the characters to the end of the input (or the end of the line) and then work backwards for the rest of the expression (it "gives up" characters only when it must). In a regex, once a character is "matched" no other part of the expression can "match" it again (except for zero-width lookarounds, which is coming next).
(?=.{6,10}) is a lookahead anchor and it matches a position in the input. It finds a place in the input where there are 6 to 10 characters following, but it does not "consume" those characters, meaning that the following expressions are free to match them.
(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) is another lookahead anchor. It matches a position in the input where the following text contains four letters ([a-zA-Z] matches one lowercase or uppercase letter), but any number of other characters (including zero characters) may be between them. For example: "++a5b---C#D" would match. Again, being an anchor, it does not actually "consume" the matched characters - it only finds a position in the text where the following characters match the expression.
(?=.*\d.*\d) Another lookahead. This matches a position where two numbers follow (with any number of other characters in between).
.* Already covered this one.
$ This is another kind of anchor that matches the end of the input (or the end of a line - the position just before a newline character). It says that the preceding expression must match characters at the end of the string. When ^ and $ are used together, it means that the entire input must be matched (not just part of it). So /bcd/ would match "abcde", but /^bcd$/ would not match "abcde" because "a" and "e" could not be included in the match.
NOTE
This looks like a password validation regex. If it is, please note that it's broken. The .* at the beginning and end will allow the password to be arbitrarily longer than 10 characters. It could also be rewritten to be a bit shorter. I believe the following will be an acceptable (and slightly more readable) substitute:
^(?=(.*[a-zA-Z]){4})(?=(.*\d){2}).{6,10}$
Thanks to #nhahtdh for pointing out the correct way to implement the character length limit.
Check Cyborgx37's answer for the syntax explanation. I'll do some explanation on the meaning of the regex.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
The first .* is redundant, since the rest are zero-width assertions that begins with any character ., and .* at the end.
The regex will match minimum 6 characters, due to the assertion (?=.{6,10}). However, there is no upper limit on the number of characters of the string that the regex can match. This is because of the .* at the end (the .* in the front also contributes).
This (?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) part asserts that there are at least 4 English alphabet character (uppercase or lowercase). And (?=.*\d.*\d) asserts that there are at least 2 digits (0-9). Since [a-zA-Z] and \d are disjoint sets, these 2 conditions combined makes the (?=.{6,10}) redundant.
The syntax of .*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z] is also needlessly verbose. It can be shorten with the use of repetition: (?:.*[a-zA-Z]){4}.
The following regex is equivalent your original regex. However, I really doubt your current one and this equivalent rewrite of your regex does what you want:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).*$
More explicit on the length, since clarity is always better. Meaning stay the same:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).{6,}$
Recap:
Minimum length = 6
No limit on maximum length
At least 4 English alphabet, lowercase or uppercase
At least 2 digits 0-9
REGEXPLANATION
/.../: slashes are often used to represent the area where the regex is defined
^: matches beginning of input string
.: this can match any character
*: matches the previous symbol 0 or more times
.{6,10}: matches .(any character) somewhere between 6 and 10 times
[a-zA-Z]: matches all characters between a and z and between A and Z
\d: matches a digit.
$: matches the end of input.
I think that just about does it for all the symbols in the regex you've posted
For your regex request, here is what you would use:
^(?=.{6,}$)(?=.*?\d)(?=.*?[!##$%&*()+_=?\^-]).*
And here it is unrolled for you:
^ // Anchor the beginning of the string (password).
(?=.{6,}$) // Look ahead: Six or more characters, then the end of the string.
(?=.*?\d) // Look ahead: Anything, then a single digit.
(?=.*?[!##$%&*()+_=?\^-]) // Look ahead: Anything, and a special character.
.* // Passes our look aheads, let's consume the entire string.
As you can see, the special characters have to be explicitly defined as there is not a reserved shorthand notation (like \w, \s, \d) for them. Here are the accepted ones (you can modify as you wish):
!, #, #, $, %, ^, &, *, (, ), -, +, _, =, ?
The key to understanding regex look aheads is to remember that they do not move the position of the parser. Meaning that (?=...) will start looking at the first character after the last pattern match, as will subsequent (?=...) look aheads.

Simple Regex Question

I am new to regex (15 minutes of experience) so I can't figure this one out. I just want something that will match an alphanumeric string with no spaces in it. For example:
"ThisIsMyName" should match, but
"This Is My Name" should not match.
^[a-zA-Z0-9]+$ will match any letters and any numbers with no spaces (or any punctuation) in the string. It will also require at least one alphanumeric character. This uses a character class for the matching. Breakdown:
^ #Match the beginning of the string
[ #Start of a character class
a-z #The range of lowercase letters
A-Z #The range of uppercase letters
0-9 #The digits 0-9
] #End of the character class
+ #Repeat the previous one or more times
$ #End of string
Further, if you want to "capture" the match so that it can be referenced later, you can surround the regex in parens (a capture group), like so:
^([a-zA-Z0-9]+)$
Even further: since you tagged this with C#, MSDN has a little howto for using regular expressions in .NET. It can be found here. You can also note the fact that if you run the regex with the RegexOptions.IgnoreCase flag then you can simplify it to:
^([a-z0-9])+$
this will match any sequence of non-space characters:
\S+
Take a look at this link for a good basic Regex information source: http://regexlib.com/CheatSheet.aspx
They also have a handy testing tool that I use quite a bit: http://regexlib.com/RETester.aspx
That said, #eldarerathis' or #Nicolas Bottarini's answers should work for you.
I have just written a blog entry about regex, maybe it's something you may find useful:)
http://blogs.appframe.com/erikv/2010-09-23-Regular-Expression
Try using this regex to see if it works: (\w+)

Categories