Match a string until it meets a '(' - c#

I've managed to get everything (well, all letters) up to a whitespace using the following:
#"^.*([A-Z][a-z].*)]\s"
However, I want to to match to a ( instead of a whitespace... how can I manage this?
Without having the '(' in the match

If what you want is to match any character up until the ( character, then this should work:
#"^.*?(?=\()"
If you want all letters, then this should do the trick:
#"^[a-zA-Z]*(?=\()"
Explanation:
^ Matches the beginning of the string
.*? One or more of any character. The trailing ? means 'non-greedy',
which means the minimum characters that match, rather than the maximum
(?= This means 'zero-width positive lookahead assertion'. That means that the
containing expression won't be included in the match.
\( Escapes the ( character (since it has special meaning in regular
expressions)
) Closes off the lookahead
[a-zA-Z]*? Zero or more of any character from a to z, or from A to Z
Reference: Regular Expression Language - Quick Reference (MSDN)
EDIT: Actually, instead of using .*?, as Casimir has noted in his answer it's probably easier to use [^\)]*. The ^ used inside a character class (a character class is the [...] construct) inverts the meaning, so instead of "any of these characters", it means "any except these characters". So the expression using that construct would be:
#"^[^\(]*(?=\()"

Using a constraining character class is the best way
#"^[^(]*"
[^(] means all characters but (
Note that you don't need a capture group since that you want is the whole pattern.

You can use this pattern:
([A-Z][a-z][^(]*)\(
The group will match a capital Latin letter, followed by a lower-case Latin letter, followed by any number of characters other than an open parenthesis. Note that ^.* is not necessary.
Or this, which produces the same basic behavior but uses a non-greedy quantifier instead:
([A-Z][a-z].*?)\(

Related

C# Regex specify allowed start and end condtions

I'm trying to create a regex expression with the following requirements:
The value:
Must start with a-z or _, numbers are OK after the first character
Can have parentheses if they are opened and closed with number inside at the end of string, i.e SomeVar(10) is OK, SomeVar(10 is not OK.
Can have a . but only one at a time, and only between letters or numbers. SomeVar.InnerVar is OK, SomeVar..Innevar is not OK.
My try at the regex:
[a-zA-Z_]
??
??
Assuming you want to match an entire string, you may use something like the following:
^[a-zA-Z_](?:\w|(?<=\w)\.(?=\w))*(?:\(\d+\))?$
Demo.
If you want to match partial strings, you'd need to decide what boundaries are allowed. Otherwise, "SomeVar(10" would have a match (i.e., what comes before (), for example.
Notes:
\w matches a lowercase/uppercase letter, a digit, or an underscore. But it also matches Unicode letters and numbers. If you don't want that, you could use [a-zA-Z0-9_] instead.
Similarly, \d matches any Unicode digit. You either use it or use [0-9] depending on your requirements.
Use
^[a-zA-Z_][a-zA-Z0-9_]*(\.[a-zA-Z_][a-zA-Z0-9_]*)*(\([^()]*\))?$
See proof.
[a-zA-Z_][a-zA-Z0-9_]* - a letter or underscore, then zero or more letters, digits, underscores
(\([^()]*\))? - optional group, parens may be present or absent
(\.[a-zA-Z_][a-zA-Z0-9_]*)* - dot is allowed between letter/digit/underscore.

Explain the Regex mentioned

Can any one please explain the regex below, this has been used in my application for a very long time even before I joined, and I am very new to regex's.
/^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$/
As far as I understand
this regex will validate
- for a minimum of 6 chars to a maximum of 10 characters
- will escape the characters like ^ and $
also, my basic need is that I want a regex for a minimum of 6 characters with 1 character being a digit and the other one being a special character.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
^ is called an "anchor". It basically means that any following text must be immediately after the "start of the input". So ^B would match "B" but not "AB" because in the second "B" is not the first character.
.* matches 0 or more characters - any character except a newline (by default). This is what's known as a greedy quantifier - the regex engine will match ("consume") all of the characters to the end of the input (or the end of the line) and then work backwards for the rest of the expression (it "gives up" characters only when it must). In a regex, once a character is "matched" no other part of the expression can "match" it again (except for zero-width lookarounds, which is coming next).
(?=.{6,10}) is a lookahead anchor and it matches a position in the input. It finds a place in the input where there are 6 to 10 characters following, but it does not "consume" those characters, meaning that the following expressions are free to match them.
(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) is another lookahead anchor. It matches a position in the input where the following text contains four letters ([a-zA-Z] matches one lowercase or uppercase letter), but any number of other characters (including zero characters) may be between them. For example: "++a5b---C#D" would match. Again, being an anchor, it does not actually "consume" the matched characters - it only finds a position in the text where the following characters match the expression.
(?=.*\d.*\d) Another lookahead. This matches a position where two numbers follow (with any number of other characters in between).
.* Already covered this one.
$ This is another kind of anchor that matches the end of the input (or the end of a line - the position just before a newline character). It says that the preceding expression must match characters at the end of the string. When ^ and $ are used together, it means that the entire input must be matched (not just part of it). So /bcd/ would match "abcde", but /^bcd$/ would not match "abcde" because "a" and "e" could not be included in the match.
NOTE
This looks like a password validation regex. If it is, please note that it's broken. The .* at the beginning and end will allow the password to be arbitrarily longer than 10 characters. It could also be rewritten to be a bit shorter. I believe the following will be an acceptable (and slightly more readable) substitute:
^(?=(.*[a-zA-Z]){4})(?=(.*\d){2}).{6,10}$
Thanks to #nhahtdh for pointing out the correct way to implement the character length limit.
Check Cyborgx37's answer for the syntax explanation. I'll do some explanation on the meaning of the regex.
^.*(?=.{6,10})(?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z])(?=.*\d.*\d).*$
The first .* is redundant, since the rest are zero-width assertions that begins with any character ., and .* at the end.
The regex will match minimum 6 characters, due to the assertion (?=.{6,10}). However, there is no upper limit on the number of characters of the string that the regex can match. This is because of the .* at the end (the .* in the front also contributes).
This (?=.*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z]) part asserts that there are at least 4 English alphabet character (uppercase or lowercase). And (?=.*\d.*\d) asserts that there are at least 2 digits (0-9). Since [a-zA-Z] and \d are disjoint sets, these 2 conditions combined makes the (?=.{6,10}) redundant.
The syntax of .*[a-zA-Z].*[a-zA-Z].*[a-zA-Z].*[a-zA-Z] is also needlessly verbose. It can be shorten with the use of repetition: (?:.*[a-zA-Z]){4}.
The following regex is equivalent your original regex. However, I really doubt your current one and this equivalent rewrite of your regex does what you want:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).*$
More explicit on the length, since clarity is always better. Meaning stay the same:
^(?=(?:.*[a-zA-Z]){4})(?=(?:.*\d){2}).{6,}$
Recap:
Minimum length = 6
No limit on maximum length
At least 4 English alphabet, lowercase or uppercase
At least 2 digits 0-9
REGEXPLANATION
/.../: slashes are often used to represent the area where the regex is defined
^: matches beginning of input string
.: this can match any character
*: matches the previous symbol 0 or more times
.{6,10}: matches .(any character) somewhere between 6 and 10 times
[a-zA-Z]: matches all characters between a and z and between A and Z
\d: matches a digit.
$: matches the end of input.
I think that just about does it for all the symbols in the regex you've posted
For your regex request, here is what you would use:
^(?=.{6,}$)(?=.*?\d)(?=.*?[!##$%&*()+_=?\^-]).*
And here it is unrolled for you:
^ // Anchor the beginning of the string (password).
(?=.{6,}$) // Look ahead: Six or more characters, then the end of the string.
(?=.*?\d) // Look ahead: Anything, then a single digit.
(?=.*?[!##$%&*()+_=?\^-]) // Look ahead: Anything, and a special character.
.* // Passes our look aheads, let's consume the entire string.
As you can see, the special characters have to be explicitly defined as there is not a reserved shorthand notation (like \w, \s, \d) for them. Here are the accepted ones (you can modify as you wish):
!, #, #, $, %, ^, &, *, (, ), -, +, _, =, ?
The key to understanding regex look aheads is to remember that they do not move the position of the parser. Meaning that (?=...) will start looking at the first character after the last pattern match, as will subsequent (?=...) look aheads.

Regular expression for matching php's constant definition

I wrote a regular expression for matching php's constant definition.
Example:
define('Symfony∞DI', SYS_DIRECTORY_PUBLIC . SYS_DIRECTORY_INCLUDES . SYS_DIRECTORY_CLASSES . SYS_DIRECTORY_EXTERNAL . 'symfony/di/');
Here is the regular expression:
define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);
When I executed with ActionScript it works fine. But when I executed with C# it gives me the following error:
parsing "define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);" - Cannot include class \s in character range.
Could you help me resolve this issue?
You seem to be using regexes in a completely convoluted way:
character classes: the - is special and it there to compute an interval; I guess you have an ordering inversion which .Net doesn't handle whereas PHP handles it (or maybe the collating order is different in PHP). Your character class should read [\w.∞] instead of [\w-.-∞], just to quote the first example;
no need to put a group around \s: \s+, not (\s)+; similarly, , instead of (,).
' is not special in a regex, and if you want to match two characters, use a character class, not a group + alternative: ['\"] instead of (\'|\") -- and note that the '"' is escaped only because you are in a doubly quoted string;
your regex is not anchored at the beginning and it looks like you want to match define at the beginning of the output: ^define and not define.
The 1. is probably the source of your problems.
Rewriting your regex with all of the above gives this (in double quotes):
"^define\(([\"'][\w.∞]+[\"'],(\s+[\"']+[\w'\".∞\s\\/]+)+[\"']|([\w\s.∞\\/]+))\);"
which definitely doesn't look that it will ever match your input...
Try this instead:
"^define\(\s*(['\"])[\w.∞]+\1\s*,\s*([\w/]+(\s*\.\s*[\w/]+)*\s*\);$"
See fge's answer for the error you're having. Without knowing what your tring to do and not deviating too much from your original, here is an alternative regex:
define\(\s*(["'])\s*[\w.∞]+\s*\1(?:\s*[.,]\s*(["']?)\s*[\w/]+\s*\2)*\s*\);
define
\(
\s* (["'])
\s* [\w.∞]+
\s* \1
(?:
\s* [.,]
\s* (["']?)
\s* [\w/]+
\s* \2
)*
\s*
\);

Regexp Remove any non alphanumeric, but leave some special characters in one expression

I have this code that replaces all non alphanumeric characters with "-" char.
return Regex.Replace(strIn, #"[\W|_]+", "-", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
but I need to change it to allow pass some special characters (one or more) for example: #,*,%
how to change this regular expression?
Use
[^\p{L}\p{N}#*%]+
This matches one or more characters that are neither letters nor digits nor any of #, * or %.
Another option, you can use charcter class subtractioninfo, for example to remove # from the character class:
[\W_-[#]]+
Just add other accepted special chars after the #. Live example here: http://rextester.com/rundotnet?code=YFQ40277
How about this one:
[^a-zA-Z0-9#*%]+
If you are using unicode you can do (as Tim's answer):
[^\p{L}\p{N}#*%]+
Use this.
([^\w#*%]|_)
Add any other special characters after the %.
It is basically saying, match any character that is not (^) a word character(\w), #, * or % OR match _.
It seems this way is the best solution for you
#"(?!.*[^\w#*%])"
You can use set subtraction for that:
#"[\W_-[#*%]]+"
This matches the set of all non-word characters and the underscore, minus the set of #, * and %.
Note that you don't have to use | for "or" in a character class, since that's implied. In fact, the | in your regex just matches |.
Note also that in .NET, \w matches a few other "connector punctuation" characters besides the underscore. If you want to match the other characters too, you can use
#"[\W\p{Pc}-[#*%]]+"

Problem with regex, how do I get all with \S up until a special character?

Ive got the text:
192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)
And im trying to get the uniquePlayerReference and the videoId
Ive tried this regular expression:
(?<=uniquePlayerReference=)\S*
but it matches:
81781956||videoId=1)
And then I try and get the video id with this:
(?<=videoId=)\S*
But it matches the ) after the videoId.
My question is two fold:
1) How do I use the \S character and get it to stop at a character? (essentially what is the regex to do what i want) I cant get it to stop at a defined character, I think I need to use a positive lookahead to match but not include the double pipe).
2) When should I use brackets?
The problem is the mul;tiplicity operator you have here - the * - which means "as many as possible". If you have an explicit number in mind you can use the operator {a,b} where a is a minimum and b a maximum number fo matches, but if you have an unknown number, you can't use \S (which is too generic).
As for brackets, if you mean () you use them to capture a part of a match for backreferencing. Bit complicated, think you need to use a reference for that.
I think you want something like this:
/uniquePlayerReference=(\d+)||videoId=(\d+)/i
and then backreference to \1 and \2 respectively.
Given that both id's are numeric you are probably better off using \d instead of \S. \d only matches numeric digits whereas \S matches any non-whitespace character.
What you might also do is a non gready match up till the character you do not want to match like so:
uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)
Note that I have escaped both the | and ) characters because otherwise they would have a special meaning inside a regex.
In C# you would use this like so: (which also answers your question what the brackets are for, they are meant to capture parts of the matched result).
Regex regex = new Regex(#"uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)");
Match match = regex.Match(
"192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)");
if (match.Success)
{
string playerReference = match.Groups[1].Value;
string videoId = match.Groups[2].Value;
// Etc.
}
If the ID isn't just digits then you could use [^|] instead of \S, i.e.
(?<=uniquePlayerReference=)[^|]*
Then you can use
(?<=videoId=)[^)]*
For the video ID
The \S means it matches any non-whitespace character, including the closing parenthesis. So if you had to use \S, you would have to explicitly say stop at the closing parenthesis, like this:
videoId=(\S+)\)
Therefore, you are better off using the \d, since what you are looking for are numeric:
uniquePlayerReference=(\d+)
videoId=(\d+)

Categories