Regex to validate domain name with port - c#

I am new developer and don't have much exposure on Regular Expression. Today I assigned to fix a bug using regex but after lots of effort I am unable to find the error.
Here is my requirement.
My code is:
string regex = "^([A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9] {1,3}\\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\\*)$";
Regex _hostEndPointRegex = new Regex(regex);
bool isTrue = _hostEndPointRegex.IsMatch(textBox1.Text);
It's throwing an error for the domain name like "nikhil-dev.in.abc.ni:8080".
I am not sure where the problem is.

Your regex is a bit redundant in that you or in some stuff that is already included in the other or block.
I just simplified what you had to
(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
and it works just fine...
I'm not sure why you had \ in the allowed characters as I am pretty sure \ is not allowed in a host name.
Your problem is that your or | breaks things up like this...
[A-Za-z0-9\\-]+
or
[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}
or
\*
Which as the commentor said was not including "-" in the 2nd block.
So perhaps you intended
^((?:[A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3})\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\*)$
However the first to two or'ed items would be redundant as + includes {1-3}.
ie. [A-Za-z0-9\-]+ would also match anything that this matches [A-Za-z0-9]{1,3}
You can use this tool to help test your Regex:
http://regexpal.com/
Personally I think every developer should have regexbuddy
The regex above although it works will allow non-valid host names.
it should be modified to not allow punctuation in the first character.
So it should be modified to look like this.
(?:[A-Za-z0-9][A-Za-z0-9-]+\.)(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
Also in theory the host isn't allowed to end in a hyphen.
it is all so complicated I would use the regex only to capture the parts and then use Uri.CheckHostName to actually check the Uri is valid.
Or you can just use the regex suggested by CodeCaster

Related

Regular Expression; Finding a word(s) between 2 specific words in a single string

I have the following string:
Account has been reset and assigned the following temporary password:
NvSYVZeg
Please log into your account using your email address and the temporary password.
What I want to do is grab the temporary password (the NvSYVZeg part) and put it into it's own string variable. I've looked for hours and tried multiple expressions that I've found online (tested them at RegexStorm.com & RegExr.com) and they never seem to locate the temp password. I tried:
(?<=password:\s)(.*?)(?=\sPlease)
and
(password:)(.*?)(Please)
But neither of them worked for me. Any ideas?
You have more than 1 whitespace in between the password and leading/trailing delimiters. Thus, you need to use \s* (any 0 or more whitespace symbols):
password:\s*(.*?)\s*Please
^^^ ^^^
or
(?<=password:\s*).*?(?=\s*Please)
^^^ ^^^
See the regex demo.
The first one might be used like
var m = Regex.Match(s, #"password:\s*(.*?)\s*Please");
if (m.Success)
{
Console.WriteLine(m.Groups[1].Value);
}
If you use the second one, just access m.Value.
By using the regex option of RightToLeft we tell the parser to work backwards instead of forwards which in my mind makes it easier to craft a pattern. Note that though the parser works in reverse our pattern is still represented as a forward manner as usual.
If we mentally, in our pattern, work backwards from an anchor ^Password we just have to eat all spaces by specifying \s+. Then all we have to do once all spaces are consumed is to extract all until a newline. The pattern would be
(?<Password>[^\r\n]+)\s+^Please
Then extract the actual password in C# by using the named group myMatch.Groups["Password"].value.
Note with the ^ in the pattern we will also need the option Multiline as well.

Do backreferences need to come after the group they reference?

While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag> after the first:
var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, #"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>
But this will not:
Regex.Replace(input, #"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
Similarly, this will remove all occurences of <tag> before the last:
Regex.Replace(input, #"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>
But this will not:
Regex.Replace(input, #"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
So this got me thinking…
In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?
Your question got me thinking too, so I ran a few tests with RegexBuddy and to my surprise the second regex (?<=\1.*)(<[^>]+>) which you said didn't work actually worked and the others worked exactly like you said. I then tried the same expression - the second one - in C# code but it didn't work like what happened with you.
This got me confused, then I noticed that my RegexBuddy version dates back to 2008 so there must have been some change in how the .NET engine works, but this shed the light on a fact I though is rational, it seems that before 2008 lookbehinds were evaluated after the rest of the expression matched. I felt this behavior is a bit acceptable with lookbehinds since you need to match something before you look behind to match something before it.
Nevertheless, the engines these days seem to evaluate lookarounds when it encounters them and I was able to find this out by using the following expression which is like the reverse situation of your case:
(?<=(\w))\1
As you can see I captured a word character inside the regex and referenced it outside it. I tested this on the string hello and it matched at the second l character as expected and this proves that the lookbehind was executed before attempting to match the rest of the expression.
Conclusion: Yes, a back reference need to appear after the group it references or it will have no match semantics.

How to use regex in order to catch a power statement

How can I use regex to catch a power statement, here are some examples:
24
(2*5)x
y(y+1)
or more complex ones such as x4+(x*2)(x+1) in which case it has 2 matches ("x4" and "(x*2)(x+1)")
I managed to get it working without the parenthesis using the expression:
Regex rPower = new Regex(#"\w\^\w");
But to deal with the possible existence of parenthesis I was thinking of something along these lines, but it still isn't working...
Regex rPower = new Regex(#"(?(?=\()(.*?(?=\)))|(\w))\^(?(?=\()(.*?(?=\)))|(\w))");
Any help/explanation that includes the thought process behind it would be deeply appreciated since I don't know much about regex and and I'm just now starting to learn it.
Thanks in advance
EDIT: For clarity what I intend to do is:
If in the string there is a substring which may start with an "(" in which case it should read everything from that "(" until it find and ")" otherwise assume it's an "\w", separated by a "^" which in turn follows another pattern just like the one it started with.
Basically it will match the expression "(random_Expression)(random_Expression)", but it may not actually be a complex expression, if it does not contain any parenthesis I will assume it's a simple "\w".
I hope I made myself clear :S
You may use this:
(\([^)]*\)|\w)\^(\([^)]*\)|\w)
Sample matches:
2^2 matches 2^2
a+b^c matches b^c
(a+b)^(c+d) matches (a+b)^(c+d)
2^(a+b) matches 2^(a+b)
(a+b)^2 matches (a+b)^2
(a+b)^2+5^2-(3+2)^(2+3) matches (a+b)^2, 5^2, (3+2)^(2+3)
Obviously, you may find bugs on the expression if stuff like nested operations is used. If you are going to work with complex expressions, I guess you will have to parse them carefully with a more elaborated method.
Could you please edit or reply with an explanation even if brief of
how the expression is working?
It is similar to your original expression \w\^\w, but it changes each \w with (\([^)]*\)|\w). If you look closely, that matches either "something inside parentheses" (given by\([^)]*\), which doesn't work for nested brackets) or "a simple word" (\w).
Hope that helps a bit :)

Regular expression not capturing matches in the middle of a string

The regular expression I'm starting with is:
^(((http|ftp|https|www)://)?([\w+?.\w+])+([a-zA-Z0-9\~!\##\$\%\^\&*()_-\=+\/\?.\:\;\'\,]*)?)$
I'm using this to find URLs in the middle of user-supplied text and replace it with a hyperlink. This works fine and matches the following:
http://www.google.com
www.google.com
google.com
www.google.com?id=5
etc...
However, it doesn't find a match if there is any text on either side of it (kind of defeats the purpose of what I'm doing). :)
No match:
Go to www.google.com
www.google.com is the best.
I go to www.google.com all the time.
etc...
How can I change this so that it will match no matter where in the string it appears? I'm terrible with regular expressions...
You have a bug in your original regex. The square brackets make \w+?\.\w+ a character class:
(((http|ftp|https|www)://)?([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?)
^ ^
After removing them (and the anchors ^ and $), your regex will not match obvious non-URLs.
I suggest using http://regexpal.com/ for testing regexes, as it has syntax highlighting within the regex.
i think you should use a positive look ahead, that is going to search for a given url to first of all check two possibilities, either is at the beginning or at the middile of the whole string.
but you should you use something like ^((?=url)?|.?(?=url).*?$))
that is just the beginning , i am not giving you an answer, just an idea.
i would do it, but at the moment i am lazy and your regex looks for a 20 minutes analisis.
stackoverflow erase some things of my example

Regex for ANY string except "www"? (subdomain)

I was wondering if someone out there could help me with a regex in C#. I think it's fairly simple but I've been wracking my brain over it and not quite sure why I'm having such a hard time. :)
I've found a few examples around but I can't seem to manipulate them to do what I need.
I just need to match ANY alphanumeric+dashes subdomain string that is not "www", and just up to the "."
Also, ideally, if someone were to type "www.subdomain.domain.com" I would like the www to be ignored if possible. If not, it's not a huge issue.
In other words, I would like to match:
(test).domain.com
(test2).domain.com
(wwwasdf).domain.com
(asdfwww).domain.com
(w).domain.com
(wwwwww).domain.com
(asfd-12345-www-bananas).domain.com
www.(subdomain).domain.com
And I don't want to match:
(www).domain.com
It seems to me like it should be easy, but I'm having troubles with the "not match" part.
For what it's worth, this is for use in the IIS 7 URL Rewrite Module, to rewrite for all non-www subdomains.
Thanks!
Is the remainder of the domain name constant, like .domain.com, as in your examples? Try this:
\b(?!www\.)(\w+(?:-\w+)*)(?=\.domain\.com\b)
Explanation:
\w+(?:-\w+)* matches a generic domain-name component as you described (but a little more rigorously).
(?=\.domain\.com\b) makes sure it's the first subdomain (i.e., the last one before the actual domain name).
\b(?!www\.) makes sure it isn't www. (without the \b, it could skip over the first w and match just the ww.).
In my tests, this regex matches precisely the parts you highlighted in your examples, and does not match the www. in either of the last two examples.
EDIT: Here's another version which matches the whole name, capturing the pieces in different groups:
^((?:\w+(?:-\w+)*\.)*)((?!www\.)\w+(?:-\w+)*)(\.domain\.com)$
In most cases, group $1 will contain an empty string because there's nothing before the subdomain name, but here's how it breaks down www.subdomain.domain.com:
$1: "www."
$2: "subdomain"
$3: ".domain.com"
^www\.
And invert the logic for this bit, so if it matches, then your string does not meet your requirements.
This works:
^(?!www\.domain\.com)(?:[a-z\-\.]+\.domain\.com)$
Or, with the necessary backslashes for Java (or C#?) strings:
"^(?!www\\.domain\\.com)(?:[a-z\\-\\.]+\\.domain\\.com)$"
There may be a more concise way (i.e. only typing domain.com once), but this works ..
Just substitute the original with everything after the www, if present (pseudocode):
str = re.sub("(www\.)?(.+)", "\2", str)
Or if you just want to match those which are "wrong" use this:
(www\.([^.]+)\.([^.]+))
And if you must match all those which are good use this:
(([^w]|w[^w]|ww[^w]|www[^.]|www\.([^.]+)\.([^.]+)\.).+)
Just thinking aloud here:
^(?:www\.)?([^\.]+)\.([^\.]+)\.
where...
(?:www\.)? looks for a possible "www" at the start, non-capturing
([^\.]+)\. looks for the sub-domain (anything except a dot at least once until a dot)
([^\.]+)\. looks for the domain, ending with a dot (anything except a dot at least once until a dot)
Note: This expression will not work with double sub-domains:
www.subsub.sub.domain.com
This:
^(?:www\.)?([^.]*)
It matches exactly what you put in parentheses in your question. You will find your answers sitting in group(1). You have to anchor it to the beginning of the line. Use this:
^(?:www\.)?(.*)
If you want everything in the URL except the "www.". One example you did not include in your test cases was "alpha.subdomain.domain.com". In the event you need to match everything, except "www.", that is not in the "domain.com" part of the string, use this:
^(?:www\.)?(.+)((?:\.(?:[^./\?]+)){2})
It will solve all of your cases, but in addition, will also return "alpha.subdomain" from my additional test case. And, for an encore, places ".domain.com" in group 2 and will not match beyond that if there are directories or parameters in the url.
I verified all of these responses here.
Finally, for the sake of overkill, if you want to reject addresses that begin with "www.", you can use negative lookbehind:
^....(?<!www\.).*
Thought i'd share this.
(\\.[A-z]{2,3}){1,2}$
Removes any '.com.au' '.co.uk' from the end. Then you can do an additional lookup to detect whether a URL contains a subdomain.
E.g.
subdaomin1.sitea.com.au
subdaomin2.siteb.co.uk
subdaomin3.sitec.net.au
all become:
subdomain1.sitea
subdomain2.siteb
subdomain3.sitec

Categories