Regex for delimiters, except in numbers - c#

Using C#'s Regex.Split, I have a regex that finds delimiters between words:
[\b\s\p{P}]+
On "sample text. another:word" it works, and produces: sample | text | another | word. Great!
On "word 120,000 another word" it produces: word | 120 | 000 | another | word. Not great!
How can I change the regex so that commas inside numbers will not be matched? i.e, so that 120,000 will not break?

I believe you will find that doing this with Regex.Split is only going to get more complicated as additional requirements emerge. You might find it preferable to use Regex.Match instead to do the reverse (recognize "whole words" instead of logical "word boundaries").
Here's why:
((?<=\p{L})\p{P}(?=\p{L}))|(\p{Z}|(?<=[\p{Z}\p{P}])\p{P}|\p{P}(?=[\p{Z}\p{P}]))+
Not pretty, so let's explain it. First off, I have replaced \s with the \p{Z} class (visible/invisible whitespace) because why not. Second, this regex matches four different things:
#1: (?<=\p{L})\p{P}(?=\p{L})
This matches a punctuation character that is sandwiched between letters. Needed to match the : in another:word. It's also the only subpattern on which the + quantifier does not apply (it would make no sense). Positive lookaround is used to assert the presence of the letters but avoid matching them.
#2: \p{Z}
This matches a sequence of whitespace. All such sequences result in splits.
#3: (?<=[\p{Z}\p{P}])\p{P}
This matches a punctuation character that is preceded by anything other than punctuation or whitespace, using positive lookbehind.
#4: \p{P}(?=[\p{Z}\p{P}])
This is the converse of the above: it matches a punctuation character followed by anything other than punctuation or whitespace.
It follows that since the comma in 100,000 matches none of the above this regex will not split that token. But you can see where this is going: instead of specifying what symbols you want to keep together as one token, using Regex.Split you will have to specify... everything else.

Try this one:
(([\s\p{P}](?!\d))|((?<!\d)[\s\p{P}]))+
The first half
([\s\p{P}](?!\d))
matches any delimiters not followed by a number and the second - any delimiters, not following a number.

Related

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

Regex to find a string that does not contain a space while matching other conditions

I have a bunch of strings that may contains certain patterns. Specifically, the following 3.
Starts with (- followed by 10 digits followed by ).
E.g.:
(-1234567890)
Starts with (, ends with ), and may contain 1 or more characters, but NO spaces.
E.g.:
(ABC) or (AF33) or (2345)
Starts with (, ends with ), and may contain 1 or more characters, INCLUDING spaces.
E.g.:
(Some string)
The strings I work with may contain zero or more of the patterns above. My requirement is to match ONLY the second one from above in a given string, and I'd like to be able to use Regex class in C#.
For example, let's say following are five different strings I have.
This is some random text.
This is some (ABC) random (-1234567890) text.
This is some (XY12) random (-1234567890) text.
This is some (Contains space) random (-1234567890) text.
This is some () random text.
My Regex should match only the 2nd and 3rd strings from the above list.
So far, I've managed to write this following Regex, which excludes strings 1 and 5.
.*\((?!\-).+\).*
This matches 2nd, 3rd, AND 4th strings above. Now I'm not sure how I can get it to exclude the 4th, one which contains spaces inside parenthesis. I know that \S detects whitespaces, but how can I tell it to detect strings that do not contain spaces only within the parenthesis that don't contain a - after the first (?
EDIT 1:
There will never be nested parenthesis in my strings.
EDIT 2:
Here's a Regex Tester.
.*\(\w+\).*
If you use above regex, second and third strings are matches only
.* all characters
( pharantesis
\w+ all word characters (at least one)
) pharantesis
.* all characters
\(([^- ]+[^ ]*)\)
should work
Explanation:
[^- ]+ will first match one character that's neither - or This will make sure it contains at least one character
Then [^ ]* will match 0 or more none white space characters
This will work for any char set

Regex to match `xyz` in `abc|qw|xzy mno`

It's driving nuts.
The input strings are:
abc|qw|xzy mno
abc||xzy mno
abc|qw|xzy
abc|qw|
I need to extract the first word (if any) after the 2nd vertical bar, in all cases above xyz but in general words in multiple (natural) languages.
Also, all lines must be considered as a block so single line does not apply, iow, the EOL is the break to account for.
Thank you, guys.
You can use the following regexp with the RegexOptions.Multiline option.
(?<=^(?:[^|]*\|){2})\w+
(?<= begins a positive lookbehind, so this matches a word that must be preceded by the beginning of the line followed by two pipe-delimited sequences.

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Explain the Regex mentioned

Can any one please explain the regex below, this has been used in my application for a very long time even before I joined, and I am very new to regex's.
/^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$/
As far as I understand
this regex will validate
- for a minimum of 6 chars to a maximum of 10 characters
- will escape the characters like ^ and $
also, my basic need is that I want a regex for a minimum of 6 characters with 1 character being a digit and the other one being a special character.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
^ is called an "anchor". It basically means that any following text must be immediately after the "start of the input". So ^B would match "B" but not "AB" because in the second "B" is not the first character.
.* matches 0 or more characters - any character except a newline (by default). This is what's known as a greedy quantifier - the regex engine will match ("consume") all of the characters to the end of the input (or the end of the line) and then work backwards for the rest of the expression (it "gives up" characters only when it must). In a regex, once a character is "matched" no other part of the expression can "match" it again (except for zero-width lookarounds, which is coming next).
(?=.{6,10}) is a lookahead anchor and it matches a position in the input. It finds a place in the input where there are 6 to 10 characters following, but it does not "consume" those characters, meaning that the following expressions are free to match them.
(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) is another lookahead anchor. It matches a position in the input where the following text contains four letters ([a-zA-Z] matches one lowercase or uppercase letter), but any number of other characters (including zero characters) may be between them. For example: "++a5b---C#D" would match. Again, being an anchor, it does not actually "consume" the matched characters - it only finds a position in the text where the following characters match the expression.
(?=.*\d.*\d) Another lookahead. This matches a position where two numbers follow (with any number of other characters in between).
.* Already covered this one.
$ This is another kind of anchor that matches the end of the input (or the end of a line - the position just before a newline character). It says that the preceding expression must match characters at the end of the string. When ^ and $ are used together, it means that the entire input must be matched (not just part of it). So /bcd/ would match "abcde", but /^bcd$/ would not match "abcde" because "a" and "e" could not be included in the match.
NOTE
This looks like a password validation regex. If it is, please note that it's broken. The .* at the beginning and end will allow the password to be arbitrarily longer than 10 characters. It could also be rewritten to be a bit shorter. I believe the following will be an acceptable (and slightly more readable) substitute:
^(?=(.*[a-zA-Z]){4})(?=(.*\d){2}).{6,10}$
Thanks to #nhahtdh for pointing out the correct way to implement the character length limit.
Check Cyborgx37's answer for the syntax explanation. I'll do some explanation on the meaning of the regex.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
The first .* is redundant, since the rest are zero-width assertions that begins with any character ., and .* at the end.
The regex will match minimum 6 characters, due to the assertion (?=.{6,10}). However, there is no upper limit on the number of characters of the string that the regex can match. This is because of the .* at the end (the .* in the front also contributes).
This (?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) part asserts that there are at least 4 English alphabet character (uppercase or lowercase). And (?=.*\d.*\d) asserts that there are at least 2 digits (0-9). Since [a-zA-Z] and \d are disjoint sets, these 2 conditions combined makes the (?=.{6,10}) redundant.
The syntax of .*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z] is also needlessly verbose. It can be shorten with the use of repetition: (?:.*[a-zA-Z]){4}.
The following regex is equivalent your original regex. However, I really doubt your current one and this equivalent rewrite of your regex does what you want:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).*$
More explicit on the length, since clarity is always better. Meaning stay the same:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).{6,}$
Recap:
Minimum length = 6
No limit on maximum length
At least 4 English alphabet, lowercase or uppercase
At least 2 digits 0-9
REGEXPLANATION
/.../: slashes are often used to represent the area where the regex is defined
^: matches beginning of input string
.: this can match any character
*: matches the previous symbol 0 or more times
.{6,10}: matches .(any character) somewhere between 6 and 10 times
[a-zA-Z]: matches all characters between a and z and between A and Z
\d: matches a digit.
$: matches the end of input.
I think that just about does it for all the symbols in the regex you've posted
For your regex request, here is what you would use:
^(?=.{6,}$)(?=.*?\d)(?=.*?[!##$%&*()+_=?\^-]).*
And here it is unrolled for you:
^ // Anchor the beginning of the string (password).
(?=.{6,}$) // Look ahead: Six or more characters, then the end of the string.
(?=.*?\d) // Look ahead: Anything, then a single digit.
(?=.*?[!##$%&*()+_=?\^-]) // Look ahead: Anything, and a special character.
.* // Passes our look aheads, let's consume the entire string.
As you can see, the special characters have to be explicitly defined as there is not a reserved shorthand notation (like \w, \s, \d) for them. Here are the accepted ones (you can modify as you wish):
!, #, #, $, %, ^, &, *, (, ), -, +, _, =, ?
The key to understanding regex look aheads is to remember that they do not move the position of the parser. Meaning that (?=...) will start looking at the first character after the last pattern match, as will subsequent (?=...) look aheads.

Categories