Simple Regular Expression Help in c# - c#

Not sure if it is possible, I need to improve my RegExp. I have the following:
\b(?:http|https)://www\.domain\.co\.za/.*
It is fine for all my purpases except I would like it to also validate for:
http://www.domain.co.za (No Backslash at the end)
But should NOT validate for:
http://www.domain.co.zaaaaa
And then This Expression:
\b(?:https?)://[.0-9a-z-]*domain\.co\.za
To validate for (Currently Working)
http://domain.co.za
http://sub1.domain.co.za
http://wwww.domain.co.za
But it should NOT validate for:
http://abcdomain.co.za

That's pretty easy:
\b(?:http|https)://www\.domain\.co\.za/?\b
Demo
.* is useless since it always matches, I just removed it, made the / optional and inserted a \b.
The second case is similar:
\b(?:https?)://[.0-9a-z-]*\bdomain\.co\.za
Demo
Just use that magic \b :)
Or, if you want a more strict pattern, this would be better:
\b(?:https?)://(?>[0-9a-z-]+\.)*domain\.co\.za
Demo
because it enforces runs of characters separated by . for the subdomains. The atomic group ((?>...)) is here to avoid catastrophic backtracking.

Related

Use OR in Regex Expression

I have a regex to match the following:
somedomain.com/services/something
Basically I need to ensure that /services is present.
The regex I am using and which is working is:
\/services*
But I need to match /services OR /servicos. I tried the following:
(\/services|\/servicos)*
But this shows 24 matches?! https://regex101.com/r/jvB1lr/1
How to create this regex?
The (\/services|\/servicos)* matches 0+ occurrences of /services or /servicos, and that means it can match an empty string anywhere inside the input string.
You can group the alternatives like /(services|servicos) and remove the * quantifier, but for this case, it is much better to use a character class [oe] as the strings only differ in 1 char.
You want to use the following pattern:
/servic[eo]s
See the regex demo
To make sure you match a whole subpart, you may append (?:/|$) at the pattern end, /servic[eo]s(?:/|$).
In C#, you may use Regex.IsMatch with the pattern to see if there is a match in a string:
var isFound = Regex.IsMatch(s, #"/servic[eo]s(?:/|$)");
Note that you do not need to escape / in a .NET regex as it is not a special regex metacharacter.
Pattern details
/ - a /
servic[eo]s - services or servicos
(?:/|$) - / or end of string.
Well the * quantifier means zero or more, so that is the problem. Remove that and it should work fine:
(\/services|\/servicos)
Keep in mind that in your example, you have a typo in the URL so it will correctly not match anything as it stands.
Here is an example with the typo in the URL fixed, so it shows 1 match as expected.
First off you specify C# (really .Net is the library which holds regex not the language) in this post but regex101 in your example is set to PHP. That is providing you with invalid information such as needed to escape a forward slash / with \/ which is unnecessary in .Net regular expressions. The regex language is the same but there are different tools which behave differently and php is not like .Net regex.
Secondly the star * on the ( ) is saying that there may be nothing in the parenthesis and your match is getting null nothing matches on every word.
Thirdly one does not need to split the whole word. I would just extract the commonality in the words into a set [ ]. That will allow the "or-ness" you need to match on either services or servicos. Such as
(/servic[oe]s)
Will inform you if services are found or not. Nothing else is needed.

Regex lookbeaind only when contains colon

Today I use c# Regex.IsMatch function to matching key:value format.
I have some code that checking if string format is: key:value (like: H:15).
The Regex pattern that I am using today is: [D,H,M,S]:[1-9]+\d?
I what to add the option for default key, when the input is 15, I would like to consider it like: H:15
So, I need to improve my Regex to support key:value or only value (without colon), H:15 is good and 15 is also good
I tried to use the or regex condition (|) something like : ([D,H,M,S]:[1-9]+\d?)|([1-9]+\d?)
But now it match more thinks like :1 and H:01 that are bad input for me.
I try to use also lookbehind regex without success
Any help would be greatly appreciated,
Nadav.
This should do the trick:
\b(?:[DHMS]:|(?<!:))[1-9][0-9]*\b
Demo
So, either match [DHMS]: or a word boundary not preceded by :.
Also, [1-9]+\d? looks very suspicious to me, so I replaced it with [1-9][0-9]*. Note that in .NET \d is not equivalent to [0-9] because it includes Unicode digits as well.
Looks like Avinash just beat me to it, but I added word boundaries with this expression, which works well in tests.
\b(?<=[DHMS]:)?[1-9]\d*\b
Seems like you wants something like this,
#"^(?:[DHMS]:)?[1-9]\d*$"
[DHMS] matches a single character from the given list. ? after the non-capturing group will turn the key part to an optional one. \d* matches zero or more digit characters.

Regular expression not capturing matches in the middle of a string

The regular expression I'm starting with is:
^(((http|ftp|https|www)://)?([\w+?.\w+])+([a-zA-Z0-9\~!\##\$\%\^\&*()_-\=+\/\?.\:\;\'\,]*)?)$
I'm using this to find URLs in the middle of user-supplied text and replace it with a hyperlink. This works fine and matches the following:
http://www.google.com
www.google.com
google.com
www.google.com?id=5
etc...
However, it doesn't find a match if there is any text on either side of it (kind of defeats the purpose of what I'm doing). :)
No match:
Go to www.google.com
www.google.com is the best.
I go to www.google.com all the time.
etc...
How can I change this so that it will match no matter where in the string it appears? I'm terrible with regular expressions...
You have a bug in your original regex. The square brackets make \w+?\.\w+ a character class:
(((http|ftp|https|www)://)?([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?)
^ ^
After removing them (and the anchors ^ and $), your regex will not match obvious non-URLs.
I suggest using http://regexpal.com/ for testing regexes, as it has syntax highlighting within the regex.
i think you should use a positive look ahead, that is going to search for a given url to first of all check two possibilities, either is at the beginning or at the middile of the whole string.
but you should you use something like ^((?=url)?|.?(?=url).*?$))
that is just the beginning , i am not giving you an answer, just an idea.
i would do it, but at the moment i am lazy and your regex looks for a 20 minutes analisis.
stackoverflow erase some things of my example

How does non-backtracking subexpression work "(?>exp)"

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)
As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)
The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.
Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.

How to use non capture groups with the "or" character in regular expressions

So basically I have this giant regular expression pattern, and somewhere in the middle of it is the expression (?:\s(\d\d\d)|(\d\d\d\d)). At this part of the parse I'm wanting to capture either 3 digits that follows a space or 4 digits, but I don't want the capture that comes from using the parenthesis around the whole thing (doesn't ?: make something non-capture). I have to use parenthesis so that the "or" logic works (I think).
So potential example inputs would be something like...
input1= giantexpression 123more characters after
input2= giantexpression1234blahblahblah
I tried (?:\s(\d\d\d)|(\d\d\d\d)) and it gave an extra capture at least in the case where I have 4 digits. So am I doing this right or am I messed up somewhere?
Edit:
To go into more detail... here's the current regular expression I'm working with.
pattern = #".?(\d{1,2})\s*(\w{2}).?.?.?(?:\s(\d\d\d)|(\d\d\d\d)).*"
There's a bit of parsing I have to do at the beginning. I think Sean Johnson's answer would still work because I wouldn't need to use "or". But is there a way to do it in which you DO use "or"? I think eventually I'll need that capability.
This should work:
(?:\s(\d{3,4}))
If you aren't doing any logic on that subpattern, you don't even need the parenthesis surrounding it if all you want to do is capture the digits. The following pattern:
\s(\d{3,4})
will capture three or four digits directly following a space character.

Categories