Regular expression for matching php's constant definition - c#

I wrote a regular expression for matching php's constant definition.
Example:
define('Symfony∞DI', SYS_DIRECTORY_PUBLIC . SYS_DIRECTORY_INCLUDES . SYS_DIRECTORY_CLASSES . SYS_DIRECTORY_EXTERNAL . 'symfony/di/');
Here is the regular expression:
define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);
When I executed with ActionScript it works fine. But when I executed with C# it gives me the following error:
parsing "define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);" - Cannot include class \s in character range.
Could you help me resolve this issue?

You seem to be using regexes in a completely convoluted way:
character classes: the - is special and it there to compute an interval; I guess you have an ordering inversion which .Net doesn't handle whereas PHP handles it (or maybe the collating order is different in PHP). Your character class should read [\w.∞] instead of [\w-.-∞], just to quote the first example;
no need to put a group around \s: \s+, not (\s)+; similarly, , instead of (,).
' is not special in a regex, and if you want to match two characters, use a character class, not a group + alternative: ['\"] instead of (\'|\") -- and note that the '"' is escaped only because you are in a doubly quoted string;
your regex is not anchored at the beginning and it looks like you want to match define at the beginning of the output: ^define and not define.
The 1. is probably the source of your problems.
Rewriting your regex with all of the above gives this (in double quotes):
"^define\(([\"'][\w.∞]+[\"'],(\s+[\"']+[\w'\".∞\s\\/]+)+[\"']|([\w\s.∞\\/]+))\);"
which definitely doesn't look that it will ever match your input...
Try this instead:
"^define\(\s*(['\"])[\w.∞]+\1\s*,\s*([\w/]+(\s*\.\s*[\w/]+)*\s*\);$"

See fge's answer for the error you're having. Without knowing what your tring to do and not deviating too much from your original, here is an alternative regex:
define\(\s*(["'])\s*[\w.∞]+\s*\1(?:\s*[.,]\s*(["']?)\s*[\w/]+\s*\2)*\s*\);
define
\(
\s* (["'])
\s* [\w.∞]+
\s* \1
(?:
\s* [.,]
\s* (["']?)
\s* [\w/]+
\s* \2
)*
\s*
\);

Related

Regular expression in RegularExpressionAttribute behavior

I am using this regular expression: #"[ \]\[;\/\\\?:*""<>|+=]|^[.]|[.]$"
First part [ \]\[;\/\\\?:*""<>|+=] should match any of the characters inside the brackets.
Next part ^[.] should match if the string starts with a 'dot'
Last part [.]$ should match if the string ends with a 'dot'
This works perfectly fine if I use Regex.IsMatch() function. However if I use RegularExpressionAttribute in ASP.NET MVC, I always get invalid model. Does anyone have any clue why this behavior occurs?
Examples:
"abcdefg" should not match
".abcdefg" should match
"abc.defg" should not match
"abcdefg." should match
"abc[defg" should match
Thanks in advance!
EDIT:
The RegularExpressionAttribute Specifies that a data field value in ASP.NET Dynamic Data must match the specified regular expression..
Which means. I need the "abcdef" to match, and ".abcdefg" to not match. Basically negate the whole expression I have above.
You need to make sure the pattern matches the entire string.
In a general case, you may append/prepend the pattern with .*.
Here, you may use
.*[ \][;/\\?:*"<>|+=].*|^[.].*|.*[.]$
Or, to make it a bit more efficient (that is, to reduce backtracking in the first branch) a negated character class will perform better:
[^ \][;/\\?:*"<>|+=]*[ \][;\/\\?:*"<>|+=].*|^[.].*|.*[.]$
But it is best to put the branches matching text at the start/end of the string as first branches:
^[.].*|.*[.]$|[^ \][;/\\?:*"<>|+=]*[ \][;/\\?:*"<>|+=].*
NOTE: You do not have to escape / and ? chars inside the .NET regex since you can't use regex delimiters there.
C# declaration of the last pattern will look like
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*"
See this .NET regex demo.
RegularExpressionAttrubute:
[RegularExpression(
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*",
ErrorMessage = "Username cannot contain following characters: ] [ ; / \\ ? : * \" < > | + =")
]
Your regex is an alternation which matches 1 character out of 3 character classes, the first consisting of more than 1 characters, the second a dot at the start of the string and the third a dot at the end of the string.
It works fine because it does match one of the alternations, only not the whole string you want to match.
You could use 3 alternations where the first matches a dot followed by repeating the character class until the end of the string, the second the other way around but this time the dot is at the end of the string.
Or the third using a positive lookahead asserting that the string contains at least one of the characters [\][;\/\\?:*"<>|+=]
^\.[a-z \][;\/\\?:*"<>|+=]+$|^[a-z \][;\/\\?:*"<>|+=]+\.$|^(?=.*[\][;\/\\?:*"<>|+=])[a-z \][;\/\\?:*"<>|+=]+$
Regex demo

Regex pattern in C# with empty space

I am having issue with a reg ex expression and can't find the answer to my question.
I am trying to build a reg ex pattern that will pull in any matches that have # around them. for example #match# or #mt# would both come back.
This works fine for that. #.*?#
However I don't want matches on ## to show up. Basically if there is nothing between the pound signs don't match.
Hope this makes sense.
Thanks.
Please use + to match 1 or more symbols:
#+.+#+
UPDATE:
If you want to only match substrings that are enclosed with single hash symbols, use:
(?<!#)#(?!#)[^#]+#(?!#)
See regex demo
Explanation:
(?<!#)#(?!#) - a # symbol that is not preceded with a # (due to the negative lookbehind (?<!#)) and not followed by a # (due to the negative lookahead (?!#))
[^#]+ - one or more symbols other than # (due to the negated character class [^#])
#(?!#) - a # symbol not followed with another # symbol.
Instead of using * to match between zero and unlimited characters, replace it with +, which will only match if there is at least one character between the #'s. The edited regex should look like this: #.+?#. Hope this helps!
Edit
Sorry for the incorrect regex, I had not expected multiple hash signs. This should work for your sentence: #+.+?#+
Edit 2
I am pretty sure I got it. Try this: (?<!#)#[^#].*?#. It might not work as expected with triple hashes though.
Try:
[^#]?#.+#[^#]?
The [^ character_group] construction matches any single character not included in the character group. Using the ? after it will let you match at the beginning/end of a string (since it matches the preceeding character zero or more times. Check out the documentation here

Match a string until it meets a '('

I've managed to get everything (well, all letters) up to a whitespace using the following:
#"^.*([A-Z][a-z].*)]\s"
However, I want to to match to a ( instead of a whitespace... how can I manage this?
Without having the '(' in the match
If what you want is to match any character up until the ( character, then this should work:
#"^.*?(?=\()"
If you want all letters, then this should do the trick:
#"^[a-zA-Z]*(?=\()"
Explanation:
^ Matches the beginning of the string
.*? One or more of any character. The trailing ? means 'non-greedy',
which means the minimum characters that match, rather than the maximum
(?= This means 'zero-width positive lookahead assertion'. That means that the
containing expression won't be included in the match.
\( Escapes the ( character (since it has special meaning in regular
expressions)
) Closes off the lookahead
[a-zA-Z]*? Zero or more of any character from a to z, or from A to Z
Reference: Regular Expression Language - Quick Reference (MSDN)
EDIT: Actually, instead of using .*?, as Casimir has noted in his answer it's probably easier to use [^\)]*. The ^ used inside a character class (a character class is the [...] construct) inverts the meaning, so instead of "any of these characters", it means "any except these characters". So the expression using that construct would be:
#"^[^\(]*(?=\()"
Using a constraining character class is the best way
#"^[^(]*"
[^(] means all characters but (
Note that you don't need a capture group since that you want is the whole pattern.
You can use this pattern:
([A-Z][a-z][^(]*)\(
The group will match a capital Latin letter, followed by a lower-case Latin letter, followed by any number of characters other than an open parenthesis. Note that ^.* is not necessary.
Or this, which produces the same basic behavior but uses a non-greedy quantifier instead:
([A-Z][a-z].*?)\(

C# Regex - How to remove multiple paired parentheses from string

I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.
Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)
You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.
How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"
Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.

Problem with regex, how do I get all with \S up until a special character?

Ive got the text:
192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)
And im trying to get the uniquePlayerReference and the videoId
Ive tried this regular expression:
(?<=uniquePlayerReference=)\S*
but it matches:
81781956||videoId=1)
And then I try and get the video id with this:
(?<=videoId=)\S*
But it matches the ) after the videoId.
My question is two fold:
1) How do I use the \S character and get it to stop at a character? (essentially what is the regex to do what i want) I cant get it to stop at a defined character, I think I need to use a positive lookahead to match but not include the double pipe).
2) When should I use brackets?
The problem is the mul;tiplicity operator you have here - the * - which means "as many as possible". If you have an explicit number in mind you can use the operator {a,b} where a is a minimum and b a maximum number fo matches, but if you have an unknown number, you can't use \S (which is too generic).
As for brackets, if you mean () you use them to capture a part of a match for backreferencing. Bit complicated, think you need to use a reference for that.
I think you want something like this:
/uniquePlayerReference=(\d+)||videoId=(\d+)/i
and then backreference to \1 and \2 respectively.
Given that both id's are numeric you are probably better off using \d instead of \S. \d only matches numeric digits whereas \S matches any non-whitespace character.
What you might also do is a non gready match up till the character you do not want to match like so:
uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)
Note that I have escaped both the | and ) characters because otherwise they would have a special meaning inside a regex.
In C# you would use this like so: (which also answers your question what the brackets are for, they are meant to capture parts of the matched result).
Regex regex = new Regex(#"uniquePlayerReference=(.*?)\|\|videoId=(.*?)\)");
Match match = regex.Match(
"192.168.20.31 Url=/flash/56553550_hi.mp4?token=(uniquePlayerReference=81781956||videoId=1)");
if (match.Success)
{
string playerReference = match.Groups[1].Value;
string videoId = match.Groups[2].Value;
// Etc.
}
If the ID isn't just digits then you could use [^|] instead of \S, i.e.
(?<=uniquePlayerReference=)[^|]*
Then you can use
(?<=videoId=)[^)]*
For the video ID
The \S means it matches any non-whitespace character, including the closing parenthesis. So if you had to use \S, you would have to explicitly say stop at the closing parenthesis, like this:
videoId=(\S+)\)
Therefore, you are better off using the \d, since what you are looking for are numeric:
uniquePlayerReference=(\d+)
videoId=(\d+)

Categories