I was wondering if someone out there could help me with a regex in C#. I think it's fairly simple but I've been wracking my brain over it and not quite sure why I'm having such a hard time. :)
I've found a few examples around but I can't seem to manipulate them to do what I need.
I just need to match ANY alphanumeric+dashes subdomain string that is not "www", and just up to the "."
Also, ideally, if someone were to type "www.subdomain.domain.com" I would like the www to be ignored if possible. If not, it's not a huge issue.
In other words, I would like to match:
(test).domain.com
(test2).domain.com
(wwwasdf).domain.com
(asdfwww).domain.com
(w).domain.com
(wwwwww).domain.com
(asfd-12345-www-bananas).domain.com
www.(subdomain).domain.com
And I don't want to match:
(www).domain.com
It seems to me like it should be easy, but I'm having troubles with the "not match" part.
For what it's worth, this is for use in the IIS 7 URL Rewrite Module, to rewrite for all non-www subdomains.
Thanks!
Is the remainder of the domain name constant, like .domain.com, as in your examples? Try this:
\b(?!www\.)(\w+(?:-\w+)*)(?=\.domain\.com\b)
Explanation:
\w+(?:-\w+)* matches a generic domain-name component as you described (but a little more rigorously).
(?=\.domain\.com\b) makes sure it's the first subdomain (i.e., the last one before the actual domain name).
\b(?!www\.) makes sure it isn't www. (without the \b, it could skip over the first w and match just the ww.).
In my tests, this regex matches precisely the parts you highlighted in your examples, and does not match the www. in either of the last two examples.
EDIT: Here's another version which matches the whole name, capturing the pieces in different groups:
^((?:\w+(?:-\w+)*\.)*)((?!www\.)\w+(?:-\w+)*)(\.domain\.com)$
In most cases, group $1 will contain an empty string because there's nothing before the subdomain name, but here's how it breaks down www.subdomain.domain.com:
$1: "www."
$2: "subdomain"
$3: ".domain.com"
^www\.
And invert the logic for this bit, so if it matches, then your string does not meet your requirements.
This works:
^(?!www\.domain\.com)(?:[a-z\-\.]+\.domain\.com)$
Or, with the necessary backslashes for Java (or C#?) strings:
"^(?!www\\.domain\\.com)(?:[a-z\\-\\.]+\\.domain\\.com)$"
There may be a more concise way (i.e. only typing domain.com once), but this works ..
Just substitute the original with everything after the www, if present (pseudocode):
str = re.sub("(www\.)?(.+)", "\2", str)
Or if you just want to match those which are "wrong" use this:
(www\.([^.]+)\.([^.]+))
And if you must match all those which are good use this:
(([^w]|w[^w]|ww[^w]|www[^.]|www\.([^.]+)\.([^.]+)\.).+)
Just thinking aloud here:
^(?:www\.)?([^\.]+)\.([^\.]+)\.
where...
(?:www\.)? looks for a possible "www" at the start, non-capturing
([^\.]+)\. looks for the sub-domain (anything except a dot at least once until a dot)
([^\.]+)\. looks for the domain, ending with a dot (anything except a dot at least once until a dot)
Note: This expression will not work with double sub-domains:
www.subsub.sub.domain.com
This:
^(?:www\.)?([^.]*)
It matches exactly what you put in parentheses in your question. You will find your answers sitting in group(1). You have to anchor it to the beginning of the line. Use this:
^(?:www\.)?(.*)
If you want everything in the URL except the "www.". One example you did not include in your test cases was "alpha.subdomain.domain.com". In the event you need to match everything, except "www.", that is not in the "domain.com" part of the string, use this:
^(?:www\.)?(.+)((?:\.(?:[^./\?]+)){2})
It will solve all of your cases, but in addition, will also return "alpha.subdomain" from my additional test case. And, for an encore, places ".domain.com" in group 2 and will not match beyond that if there are directories or parameters in the url.
I verified all of these responses here.
Finally, for the sake of overkill, if you want to reject addresses that begin with "www.", you can use negative lookbehind:
^....(?<!www\.).*
Thought i'd share this.
(\\.[A-z]{2,3}){1,2}$
Removes any '.com.au' '.co.uk' from the end. Then you can do an additional lookup to detect whether a URL contains a subdomain.
E.g.
subdaomin1.sitea.com.au
subdaomin2.siteb.co.uk
subdaomin3.sitec.net.au
all become:
subdomain1.sitea
subdomain2.siteb
subdomain3.sitec
Related
I'm trying to come up with an example where positive look-around works but
non-capture groups won't work, to further understand their usages. The examples I"m coming up with all work with non-capture groups as well, so I feel like I"m not fully grasping the usage of positive look around.
Here is a string, (taken from a SO example) that uses positive look ahead in the answer. The user wanted to grab the second column value, only if the value of the
first column started with ABC, and the last column had the value 'active'.
string ='''ABC1 1.1.1.1 20151118 active
ABC2 2.2.2.2 20151118 inactive
xxx x.x.x.x xxxxxxxx active'''
The solution given used 'positive look ahead' but I noticed that I could use non-caputure groups to arrive at the same answer.
So, I'm having trouble coming up with an example where positive look-around works, non-capturing group doesn't work.
pattern =re.compile('ABC\w\s+(\S+)\s+(?=\S+\s+active)') #solution
pattern =re.compile('ABC\w\s+(\S+)\s+(?:\S+\s+active)') #solution w/out lookaround
If anyone would be kind enough to provide an example, I would be grateful.
Thanks.
The fundamental difference is the fact, that non-capturing groups still consume the part of the string they match, thus moving the cursor forward.
One example where this makes a fundamental difference is when you try to match certain strings, that are surrounded by certain boundaries and these boundaries can overlap. Sample task:
Match all as from a given string, that are surrounded by bs - the given string is bababaca. There should be two matches, at positions 2 and 4.
Using lookarounds this is rather easy, you can use b(a)(?=b) or (?<=b)a(?=b) and match them. But (?:b)a(?:b) won't work - the first match will also consume the b at position 3, that is needed as boundary for the second match. (note: the non-capturing group isn't actually needed here)
Another rather prominent sample are password validations - check that the password contains uppercase, lowercase letters, numbers, whatever - you can use a bunch of alternations to match these - but lookaheads come in way easier:
(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!?.])
vs
(?:.*[a-z].*[A-Z].*[0-9].*[!?.])|(?:.*[A-Z][a-z].*[0-9].*[!?.])|(?:.*[0-9].*[a-z].*[A-Z].*[!?.])|(?:.*[!?.].*[a-z].*[A-Z].*[0-9])|(?:.*[A-Z][a-z].*[!?.].*[0-9])|...
I'm trying to create a regular expression that would match files of this pattern:
Id_Name_processID_timestamp_logName.txt
Example of filename: abcd_Service_11234_15112013_Log.txt
I don't need perfect matching something that would match anything_anything_anything_anything_anything.txt would work for me.
I haven't tried anything just lost time starring at this Regex Tutorial for quite a long time, i don t know where to start :(.
Go to this site: http://regexpal.com/
Put abcd_Service_11234_15112013_Log.txt in the lower box.
Start writing your rexex on the top box, until it matches (it's a simple one, really, chars, underscore, rinse and repeat) ... You'll be ok ...
My regex, a short simple one.
^\w+_\w+.txt
Edit:
I do agree with the 1st answer: You really need to try something on your own but that website must be the least userfriendly page on regex. You get my answer out of sympathy ;)
Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.
How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.
Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.
I am new developer and don't have much exposure on Regular Expression. Today I assigned to fix a bug using regex but after lots of effort I am unable to find the error.
Here is my requirement.
My code is:
string regex = "^([A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9] {1,3}\\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\\*)$";
Regex _hostEndPointRegex = new Regex(regex);
bool isTrue = _hostEndPointRegex.IsMatch(textBox1.Text);
It's throwing an error for the domain name like "nikhil-dev.in.abc.ni:8080".
I am not sure where the problem is.
Your regex is a bit redundant in that you or in some stuff that is already included in the other or block.
I just simplified what you had to
(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
and it works just fine...
I'm not sure why you had \ in the allowed characters as I am pretty sure \ is not allowed in a host name.
Your problem is that your or | breaks things up like this...
[A-Za-z0-9\\-]+
or
[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}
or
\*
Which as the commentor said was not including "-" in the 2nd block.
So perhaps you intended
^((?:[A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3})\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\*)$
However the first to two or'ed items would be redundant as + includes {1-3}.
ie. [A-Za-z0-9\-]+ would also match anything that this matches [A-Za-z0-9]{1,3}
You can use this tool to help test your Regex:
http://regexpal.com/
Personally I think every developer should have regexbuddy
The regex above although it works will allow non-valid host names.
it should be modified to not allow punctuation in the first character.
So it should be modified to look like this.
(?:[A-Za-z0-9][A-Za-z0-9-]+\.)(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
Also in theory the host isn't allowed to end in a hyphen.
it is all so complicated I would use the regex only to capture the parts and then use Uri.CheckHostName to actually check the Uri is valid.
Or you can just use the regex suggested by CodeCaster
I have a regex that I am using to validate email addresses. I like this regex because it is fairly relax and has proven to work quite well.
Here is the regex:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#[^\.][\w\.\-]+\.[A-Za-z]{2,}>?
Ok great, basically all reasonably valid email addresses that you can throw at it will validate. I know that maybe even some invalid ones will fall through but that is ok for my specific use-case.
Now it happens to be the case that joe#x.com does not validate. And guess what x.com is actually a domain name that exists (owned by paypall).
Looking at the regex part that validates the domain name:
#[^\.][\w\.\-]+
It looks like this should be able to parse the x.com domain name, but it doesn't. The culprit is the part that checks that a domain name can not begin with a dot (such as test#.test.com)
#[^\.]
If I remove the [^.] part of my regex the domain x.com validates but now the regex allows domains names beginning with a dot, such as .test.com; this is a little bit too relax for me ;-)
So my question is how can the negative character list part affect my single character check, basically the way I am reading the regex is: "make sure this string does not start with a dot", but apparantly it does more.
Any help would be appreciated.
Regards,
Waseem
As Luis suggested, you can use [^\.][\w\.\-]* to match the domtain name, however it will now also match addresses like john#x.....com and john##.com. You might want to make sure that there is only one period at a time, and that the first character after the # is more restricted than just not being a period.
Match the domain name and the period (and subdomains and their periods) using:
([\w\-]+\.)+
So your pattern would be:
(['\"]{1,}.+['\"]{1,}\s+)?<?[\w\.\-]+#([\w\-]+\.)+[A-Za-z]{2,}>?
If you change [^\.][\w\.\-]+ to [^\.][\w\.\-]*, it will work as you expect!
The reason is: [^\.] will match a single character which is not a dot (in your case, the "x" on "x.com", then you will try to reach 1 or more characters, and then a dot. You will match the dot after the x, and there are no more dots to match. The * will match 0 or more characters after the first one, which is what you want.
Change the quantifier +, meaning one or more, to *, meaning zero or more.
Change #[^\.][\w\.\-]+ to #[^\.][\w\.\-]*
The reason you need this is that [^\.] says match a single character that is not a dot. Now there are no more characters left so the [\w\.\-]+ has nothing to match, even though the plus sign requires a minimum of one character. Changing the plus to a star fixes this.
Look at the broader context in your pattern:
#[^\.][\w\.\-]+\.[A-Za-z]{2,}
So for joe#x.com,
[^.] matches x
[\w.-]+ matches .
\. needs a dot but finds c
Change this part to #[^.][\w-]*\.[A-Za-z]{2,}