I have a regular expression designed to extract a number from between two parenthesis. It had been working fine until we made the input string customizable. Now, if a number is found somewhere else in the string, the last number is taken. My expression is below:
int icorrespid = Convert.ToInt32(Regex.Match(subject, #"(\d+)(?!.#\d)", RegexOptions.RightToLeft).Value);
If I send the string This (12) is a test, it works fine, extracting the 12. However, if I send This (12) is a test2, the result is 2. I realize I can change the RightToLeft to LeftToRight, which will fix this instance, but I only want to get the number between the parenthesis.
I am sure this will be easy for anyone with any regular expression experience (which is obviously not me). I am hoping you could show me how to correct this to get what I want, but also give a brief explanation of what I am doing wrong so I can hopefully improve.
Thank you.
Additional Information
I appreciate all of the responses. I have taken the agreed upon advice, and tried each of these formats:
int icorrespid = Convert.ToInt32(Regex.Match(subject, #"(\(\d+\))(?!.#\d)", RegexOptions.RightToLeft).Value);
int icorrespid = Convert.ToInt32(Regex.Match(subject, #"(\(\d+\))", RegexOptions.RightToLeft).Value);
int icorrespid = Convert.ToInt32(Regex.Match(subject, #"\(\d+\)", RegexOptions.RightToLeft).Value);
Unfortunately, with each, I get an exception stating that the input string was not in a correct format. I did not get that before. I'm sure that I could resolve this without using a regular expression in a minute or two, but my stubbornness has kicked in.
Thank you everyone for your comments.
you need to escape parenthesis in regex, because they mean something
#"(\(\d+\))(?!.#\d)
or, if you didn't actually intend your number to be caught in a group
#"\(\d+\)(?!.#\d)
Try this regular expression:
\(#(\d+)\)
The brackets are escaped \( and \) and inside them is the normal search for numbers.
If you use the .Value property, it will give you the number surrounded by brackets. Instead you need to use the Groups collection. So to use in your code, you do this: (now with added error checking!)
var match = Regex.Match("hgf", #"\(#(\d+)\)", RegexOptions.RightToLeft).Groups[1].Value;
if(!string.IsNullOrEmpty(match))
{
var icorrespid = Convert.ToInt32(match);
}
else
{
//No match found
}
Use:
\(\d+\)(?!.#\d)
( and ) are reserved characters known as a capture group.
Parentheses have a meaning in regex, so you need to escape them:
\(\d+\)
The actual meaning is to create a capture group, so if you're relying on a capture group in your code, you need another pair of parentheses like this:
\((\d+)\)
I'm not quite sure what the purpose of the (?!.#\d) part is from your question, but if you do need it, you can leave it where it is (just append it to the end of either of the versions above)
Related
I have an awful time with regular expressions, so I usually resort to lousy kludges and workarounds when parsing strings. I need to get better at using regex. This one seems simple to me, but I don't even know where to start.
Here's the string output from my device:
testString = IP:192.168.5.210\rPlaylist:1\rEnable:On\rMode:HDMI\rLineIn:unbal\r
Example:
I want to find if the device is off or on. I need to search for the string "Enable:" then locate the carriage return and determine if the word between Enable: and \r is off or on. It seems like that's what regex is for or do I totally misunderstand it.
Can someone point me in the right direction?
Additional information - Maybe I need to expand on the question.
Based on the answers, finding whether or not the device is Enabled appears to be fairly simple. Since I get a return string is similar to a key/value pair what's more vexing determining the substring between the : and the carriage return. A number of these pairs have a response with lengths that vary significantly, such as DeviceLocation, DeviceName, IPAddress. In fact, the device responds to every command sent to it by returning the entire status list, 48 key/value pairs, which I then must parse even if I only need to know one property.
Also based on your answers .... regular expressions is not the way to go.
Thanks for any help.
Norm
I would suggest for a simple line as shown, ask for one or the other, but verify as well. Based partially off Ken White's suggestions.
if(input.Contains(":On")){
//DoWork()
}else{
if(input.Contains(":Off"))
//DoOtherWork
}
This presumes that ":On" and ":Off" will not appear anywhere else in the string, even with a different string.
Consider the following code:
// This regular expression matches text 'Enabled: ' followed by one or more non '\r' followed by '\r'
// RegexOptions.Multiline is optional but MAY be necessary on other platforms.
// Also, '\r' is not a line break. '\n' is.
Regex regex = new Regex("Enable: ([^\r]+)\r", RegexOptions.Multiline);
string input = "IP:192.168.5.210\rPlaylist: 1\rEnable: On\rMode: HDMI\rLineIn: unbal\r";
var matches = regex.Match(input);
Debug.Assert(matches != Match.Empty);
// The match variable will contain 2 Groups:
// First will be 'Enabled: On\r'
// The other is 'On' since we enclosed ([^\r]+) in ().
Console.WriteLine(matches.Groups[1]);
I really know very little about regex's.
I'm trying to test a password validation.
Here's the regex that describes it (I didn't write it, and don't know what it means):
private static string passwordField = "[^A-Za-z0-9_.\\-!##$%^&*()=+;:'\"|~`<>?\\/{}]";
I've tried a password like "dfgbrk*", and my code, using the above regex, allowed it.
Is this consistent with what the regex defines as acceptable, or is it a problem with my code?
Can you give me an example of a string that validation using the above regex isn't suppose to allow?
Added: Here's how the original code uses this regex (and it works there):
public static bool ValidateTextExp(string regexp, string sText)
{
if ( sText == null)
{
Log.WriteWarning("ValidateTextExp got null text to validate against regExp {0} . returning false",regexp);
return false;
}
return (!Regex.IsMatch(sText, regexp));
}
It seems I'm doing something wrong..
Thanks.
Your regex matches a value that contains any single character which is not in that list.
Your test value matches because it has spaces in it, which do not appear to be in your expression.
The reason it's not is because your character class starts with ^. The reason it matches any value that contains any single character that is not that is because you did not specify the beginning or end of the string, or any quantifiers.
The above assumes I'm not missing the importance of any of the characters in the middle of the character soup :)
This answer is also dependent on how you actually use the Regex in code.
If your intention was for that Regex string to represent the only characters that are actually allowed in a password, you would change the regex like so:
string pattern = "^[A-Z0-9...etc...]+$";
The important parts there are:
The ^ has been removed from inside the bracket, to outside; where it signifies the start of the whole string.
The $ has been added to the end, where it signifies the end of the whole string.
Those are needed because otherwise, your pattern will match anything that contains the valid values anywhere inside - even if invalid values are also present.
finally, I've added the + quantifier, which means you want to find any one of those valid characters, one or more times. (this regex would not permit a 0-length password)
If you wanted to permit the ^ character also as part of the password, you would add it back in between the brackets, but just *not as the first thing right after the opening bracket [. So for example:
string pattern = "^[A-Z0-9^...etc...]+$";
The ^ has special meaning in different places at different times in Regexes.
[^A-Za-z0-9_.\-!##$%^&*()=+;:'\"|~`?\/{}]
----------------------^
Looks fine to me, at least in regards to your question title. I'm not clear yet on why the spaces in your sample don't trip it up.
Note that I'm assuming the purpose of this expression is to find invalid characters. Thus, if the expression is a positive match, you have a bad password that you must reject. Since there appears to be some confusion about this, perhaps I can clear it up with a little psuedo-code:
bool isGoodPassword = !Regex.IsMatch(#"[^A-Za-z0-9_.\-!...]", requestedPassword);
You could re-write this for a positive match (without the negation) like so:
bool isGoodPassword = Regex.IsMatch(#"^[A-Za-z0-9_.\-!...]+$", requestedPassword);
The new expression matches a string that from the beginning of the string is filled with one or more of any of the characters in the list all the way the way to end. Any character not in the list would cause the match to fail.
You regular expression is just an inverted character class and describes just one single character (but that can’t be *). So it depends on how you use that character class.
Depends on how you apply it. It describes exactly one character, however, the ^ in the beginning buggs me a little, as it prohibits every other character, so there is probably something terribly fishy there.
Edit: as pointed out in other answers, the reason for your string to match is the space, not the explanation that was replaced by this line.
So if I write a regex it's matches I can get the match or I can access its groups. This seems counter intuitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?
Regex quickCheck = new Regex(#"(\D+)\d+");
string source = "abc123";
m.Value //Equals source
m.Groups.Count //Equals 2
m.Groups[0]) //Equals source
m.Groups[1]) //Equals "abc"
I agree - it is a little strange, however I think there are good reasons for it.
A Regex Match is itself a Group, which in turn is a Capture.
But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.
But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:
[TestMethod]
public void UnMinifyExample()
{
string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
string result = Regex.Replace(toUnMinify, #"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}
The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.
Okay - you might wonder why you'd bother doing this with a regex - but humour me :)
If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!
The documentation for Match says that the first group is always the entire match so it's not an implementation detail.
It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?
FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)
Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.
I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)
Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.
It might be redundant, however it has some nice properties.
For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.
Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.
Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().
I have two simple questions about regular expressions.
Having the string $10/$50, I want to get the 50, which will always be at the end of the string. So I made: ([\\d]*$)
Having the string 50c/70c I want to get the 70, which will always be at the end of the string(i want it without the c), so I made: ([\\d]*)c$
Both seem do to what I want, but I actually would like to do 2 things with it:
a) I'd like to put both on the same
string(is it possible?). I tried with
the | but it didn't seem to work.
**b)**If indeed it is possible to do a),
i'd like to know if it's possible to
format the text. As you can see, both
for dollars and cents, I will
retrieve with the regular expression
the value the string shows. But while
in the first case we are dealing with
dollars, in the second we're dealing
with cents, so I'd like to transform
50 cents into 0,5. Is it possible, or
will I have to code that by myself?
For matching both cases you're basically saying that the "c" is optional. Then use the "?" which means "zero or one match of the preceeding char". This should give you the following:
([\d]*)c?$
Hope that helps.
(a) is easy:
(\d+)c?$
(b) you can't do with regular expressions.
Providing these are Perl-style regexes (I don't actually know C#/.net):
(\d+(?=c$)|(?<=\$)\d+$)
Formatting the text would probably be something to do outside of the regex matching.
(\$?[\d]+)c?
The following will match both in one hit from a paragraph. Use the following tool to see with the following example text
http://gskinner.com/RegExr/
Example Text
This is the string 50c/70c $10/$50
This is the string 50c/70c $9/$50
This is the string 50c/70c $8/$50
This is the string 50c/70c $7/$50
This is the string 50c/70c $6/$50
This is the string 50c/70c $5/$50
Your existings regexes can be simplified:
([\d]*$) ([\d]*)c$
you don't need the square backets
(\d*$) (\d*)c$
But I'd recommend that you demand to have at least on digit in your number, so use + instead of *
(\d+$) (\d+)c$
Now you can join the two together:
(\d+)(c?)$
I would't recommend doing the calculation inside the regex.
We captured the 'c' in the second parenthesis, so we can work with this information.
This is what the whole thing would look like in perl, please adapt apropriately:
if( m/(\d+)(c?)$/ ) {
if( $2 eq 'c' ) {
$dollars = $1/100;
} else {
$dollars = $1;
}
print "$_ are $dollars dollars\n"
}
As I commented above: the calculatino could be done in a regex / subsitution
s/.*?(\d+)(c?)$/$1*($2?0.01:1)/e
but that might bit a bit obfuscated
Others have pointed out that (\d+)(c?)$ satisfies a).
But I see no reason why you can't follow that with substitution statements to do the formatting:
s/(\d)0c/0,$1/
s/(\d\d)c/0,$1/
s/(\d)c/0,0$1/
I've created the following regex pattern in an attempt to match a string 6 characters in length ending in either "PRI" or "SEC", unless the string = "SIGSEC". For example, I want to match ABCPRI, XYZPRI, ABCSEC and XYZSEC, but not SIGSEC.
(\w{3}PRI$|[^SIG].*SEC$)
It is very close and sort of works (if I pass in "SINSEC", it returns a partial match on "NSEC"), but I don't have a good feeling about it in its current form. Also, I may have a need to add more exclusions besides "SIG" later and realize that this probably won't scale too well. Any ideas?
BTW, I'm using System.Text.RegularExpressions.Regex.Match() in C#
Thanks,
Rich
Assuming your regex engine supports negative lookaheads, try this:
((?!SIGSEC)\w{3}(?:SEC|PRI))
Edit: A commenter pointed out that .NET does support negative lookaheads, so this should work fine (thanks, Charlie).
To help break down Dan's (correct) answer, here's how it works:
( // outer capturing group to bind everything
(?!SIGSEC) // negative lookahead: a match only works if "SIGSEC" does not appear next
\w{3} // exactly three "word" characters
(?: // non-capturing group - we don't care which of the following things matched
SEC|PRI // either "SEC" or "PRI"
)
)
All together: ((?!SIGSEC)\w{3}(?:SEC|PRI))
You can try this one:
#"\w{3}(?:PRI|(?<!SIG)SEC)"
Matches 3 "word" characters
Matches PRI or SEC (but not after SIG i.e. SIGSEC is excluded) (? < !x)y - is a negative lookbehind (it mathces y if it's not preceded by x)
Also, I may have a need to add more
exclusions besides "SIG" later and
realize that this probably won't scale
too well
Using my code, you can easily add another exceptions, for example following code excludes SIGSEC and FOOSEC
#"\w{3}(?:PRI|(?<!SIG|FOO)SEC)"
Why not use more readable code? In my opinion this is much more maintainable.
private Boolean HasValidEnding(String input)
{
if (input.EndsWith("SEC",StringComparison.Ordinal) || input.EndsWith("PRI",StringComparison.Ordinal))
{
if (!input.Equals("SIGSEC",StringComparison.Ordinal))
{
return true;
}
}
return false;
}
or in one line
private Boolean HasValidEnding(String input)
{
return (input.EndsWith("SEC",StringComparison.Ordinal) || input.EndsWith("PRI",StringComparison.Ordinal)) && !input.Equals("SIGSEC",StringComparison.Ordinal);
}
It's not that I don't use regular expressions, but in this case I wouldn't use them.
Personally, I'd be inclined to build-up the exclusion list using a second variable, then include it into the full expression - it's the approach I've used in the past when having to build any complex expression.
Something like exclude = 'someexpression'; prefix = 'list of prefixes'; suffix = 'list of suffixes'; expression = '{prefix}{exclude}{suffix}';
You may not even want to do the exclusions in the regex. For example, if this were Perl (I don't know C#, but you can probably follow along), I'd do it like this
if ( ( $str =~ /^\w{3}(?:PRI|SEC)$/ ) && ( $str ne 'SIGSEC' ) )
to be clear. It's doing exactly what you wanted:
Three word characters, followed by PRI or SEC, and
It's not SIGSEC
Nobody says you have to force everything into one regex.