Regular Expression to validate a URL or domain name. - c#

Can someone please let me know what is wrong with my regular expression? I’m trying to just validate the beginning to URLs, mainly just host names (i.e. www.yahoo.com).
Regular Expression: ^(((ht|f)tp(s?))\:\/\/)?(www.)?([a-zA-Z0-9\-\.]{1,63})+\.([a-zA-Z]{2,5})$
Testing Values:
test.com – passes
test.c2om – fails
test.test.com – passes
test.test.c2om – fails
test.test.test.com – passes
test.test.test.c2om – INVALID REGEX PATTERN
This should return false, but instead returns nothing, both using javascript and c#… If you remove the {1,63} restriction on the size of the subdomain, it works…

You've created a catastrophic pattern - The engine will try to match ([a-zA-Z0-9\-\.]{1,63})+ in many ways until it fails. A simple solution is to remove {1,63}, as you've noted, it doesn't seem to be adding anything anyway.
Another option is to use the dots as anchors, so you cannot backtrack between them (this only gives you one way to match the text, and assumably, what you're trying to do):
([a-zA-Z0-9\-]{1,63}\.)*[a-zA-Z0-9\-]{1,63}
Keep in mind that isnt very correct anymore to assume all-ASCII-English letters in domain names. For example http://אתר.קום is a legal (and working) url.

Related

Why is this Email regex so slow on Mvc?

I am currently building a system using Asp.net, c#, Mvc2 which uses the following regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$
This is an e-mail regex that validates a 'valid' e-mail address format. My code is as follows:
if (!Regex.IsMatch(model.Email, #"^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$"))
ModelState.AddModelError("Email", "The field Email is invalid.");
The Regex works fine for validating e-mails however if a particularly long string is passed to the regex and it is invalid it causes the system to keep on 'working' without ever resolving the page. For instance, this is the data that I tried to pass:
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
The above string causes the system to essentially lock up. I would like to know why and if I can use a regex that accomplishes the same thing in maybe a simpler manner. My target is that an incorrectly formed e-mail address like for instance the following isn't passed:
host.#.host..com
You have nested repetition operators sharing the same characters, which is liable to cause catastrophic backtracking.
For example: ([-.\w]*[0-9a-zA-Z])*
This says: match 0 or more of -._0-9a-zA-Z followed by a single 0-9a-zA-Z, one or more times.
i falls in both of these classes.
Thus, when run on iiiiiiii... the regex is matching every possible permuation of (several "i"s followed by one "i") several times (which is a lot of permutations).
In general, validating email addresses with a regular expression is hard.

Can I add a regular expression into a .Net Assertion?

I'm trying to pull out page source from a set of pages and run an assertion on the results, this is a Test that runs to check that we are crawling specific pages in our site. Sometimes the results come back with a different case for the URL string, I'd like to account for that in the Assertion where I am checking page source. This is probably the wrong way to do this but I was wondering if there is a way to add in the .Net regex commands to the Assertion text. I have this as an assertion:
Assert.IsTrue(driver.PageSource.Contains("/explore"));
But is there a way to be sure that I can capture explore, Explore or EXPLORE? I though I could use (?i) here but that doesn't seem to work. I'm more used to Perl and it's regex capabilities but with C# and .Net I'm a little lost on where I can and can't use the inline regex commands.
Anthonys answer is valid, you don't really need regex. But if you do want to use it, you can use
Regex.IsMatch(driver.PageSource, "/explore", RegexOptions.IgnoreCase)
You don't need a regular expression to perform a case-insensitive check. Use IndexOf and compare that the result is greater than -1. IndexOf has overloads that allow you to specify if casing matters. Something like
bool containsExplore = driver.PageSource.IndexOf("/explore", StringComparison.InvariantCultureIgnoreCase) > -1;
Assert.IsTrue(containsExplore);
Try:
RegEx.Match("string", "regexp", RegExOptions.IgnoreCase).Success
How about using
StringAssert.Matches(string, regex);
In your case, that would translate to
StringAssert.Matches("drive.PageSource", "\/explore");

Optimistic RegEx Matching for User Text Entry

I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?
What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.
Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]
... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.
It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...
With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.
Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories