Given this regex:
^((https?|ftp):(\/{2}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1})
Reformatted for readability:
#"^((https?|ftp):(\/{2}))?" + // http://, https://, ftp:// - Protocol Optional
#"(" + // Begin URL payload format section
#"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" + // IPv4 Address support
#")|("+ // Delimit supported payload types
#"((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1}" + // FQDNs
#")"; // End URL payload format section
How can I make it fail (i.e. not match) on this "fail" test case?
http://www.google
As I am specifying {1} on the TLD section, I would think it would fail without the extension. Am I wrong?
Edit: These are my PASS conditions:
"http://www.zi255.com?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&ID=4",
"http://www.zi255.com/?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com/?Req=Post&ID=4"
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com/Post.aspx?Req=Post&ID=4",
"http://www.zi255.com/Post.aspx?Req=Post&PID=4",
"http://www.zi255.com/Post.aspx?Req=Post&Post=4",
"http://www.zi255.com/Post.aspx?Req=Post&Title=Random%20Post%20Name"
"http://www.zi255.com/?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com",
"http://www.damnednice.com"
These are my FAIL conditions:
"http://.com",
"http://.com/",
"http:/www.google.com",
"http:/www.google.com/",
"http://www.google",
"http://www.googlecom",
"http://www.google.c",
".com",
"https://www..."
I'll throw out an alternative suggestion. You may want to use a combination of the parsing of the built-in System.Uri class and a couple targeted regexes (or simple string checks when appropriate).
Example:
string uriString = "...";
Uri uri;
if (!Uri.TryCreate(uriString, UriKind.Absolute, out uri))
{
// Uri is totally invalid!
}
else
{
// validate the scheme
if (!uri.Scheme.Equals("http", StringComparison.OrdinalIgnoreCase))
{
// not http!
}
// validate the authority ('www.blah.com:1234' portion)
if (uri.Authority // ...)
{
}
// ...
}
Sometimes, one catch-all reqex is not the best solution, however tempting. While debugging this regex is feasible (see Greg Hewgills answer), consider doing a couple of tests for different categories of problems, e.g. one test for numerical addresses and one test for named addresses.
You need to force your regex to match up until the end of the string. Add a $ at the very end of it. Otherwise, your regex is probably just matching http://, or something else shorter than your whole string.
The "validate a url" problem has been solved* numerous times. I suggest you use the System.Uri class, it validates more cases than you can shake a stick at.
The code Uri uri = new Uri("http://whatever"); throws a UriFormatException if it fails validation. That is probably what you'd want.
*) Or kind of solved. It's actually pretty tricky to define what is a valid url.
Its all about definitions, a "valid url" should provide you with a IP address when you do a DNS Lookup. The IP should be connected to and when a request is send out, you get a reply in the form of a HTML information that you can use.
So what we are looking for is a "valid URL Format" and that is where the system.uri comes in very handy. BUT, if the URL is hidden in a large piece of tekst, you would first like to find something that validates as a valid URL-Format.
The thing that distinquishes a URL from any given readable tekst is the dot not followed by whitespace. "123.com" could validate as a real URL.
Using the regex
[a-z_\.\-0-9]+\.[a-z]+[^ ]*
to find any possible valid url in a text and then do a system.uri check to see if its a valid URL format and then do a lookup. Only when the lookup gives you a result then you know the URL is valid.
Related
I have to download a file (using existing Flurl-Http endpoints [1]) whose name contains a "#" which of course has to be escaped to %23 to not conflict with uri-fragment detection.
But Flurl always escapes the rest but not this character, resulting in a non working uri where half of the path and all query params are missing because they got parsed as uri-fragment:
Url url = "http://server/api";
url.AppendPathSegment("item #123.txt");
Console.WriteLine(url.ToString());
Returns: http://server/api/item%20#123.txt
This means a http request (using Flurl.Http) would only try to download the non-existing resource http://server/api/item%20.
Even when I pre-escape the segment, the result still becomes exactly the same:
url.AppendPathSegment("item %23123.txt");
Console.WriteLine(url.ToString());
Again returns: http://server/api/item%20#123.txt.
Any way to stop this "magic" happen?
[1] This means I have delegates/interfaces where input is an existing Flurl.Url instance which I have to modify.
It looks like you've uncovered a bug. Here are the documented encoding rules Flurl follows:
Query string values are fully URL-encoded.
For path segments, reserved characters such as / and % are not encoded.
For path segments, illegal characters such as spaces are encoded.
For path segments, the ? character is encoded, since query strings get special treatment.
According to the 2nd point, it shouldn't encode # in the path, so how it handles AppendPathSegment("item #123.txt") is correct. However, when you encode the # to %23 yourself, Flurl certainly shouldn't unencode it. But I've confirmed that's what's happening. I invite you to create an issue on GitHub and it'll be addressed.
In the mean time, you could write your own extension method to cover this case. Something like this should work (and you wouldn't even need to pre-encode #):
public static Url AppendFileName(this Url url, string fileName) {
url.Path += "/" + WebUtility.UrlEncode(fileName);
return url;
}
I ended up using Uri.EscapeDataString(foo) because suggested WebUtility.UrlEncode replaces space with '+' which I didn't want to.
The MSDN page for UrlPathEncode states the UrlPathEncode shouldn't be used, and that I should use UrlEncode instead.
Do not use; intended only for browser compatibility. Use UrlEncode.
But UrlEncode does not do the same thing as UrlPathEncode.
My use case is that I want to encode a file system path so that a file can be downloaded. The spaces in a path need to be escaped, but not the forward slashes etc. UrlPathEncode does exactly this.
// given the path
string path = "Directory/Path to escape.exe";
Console.WriteLine(System.Web.HttpUtility.UrlPathEncode(path));
// returns "Installer/My%20Installer.msi" <- This is what I require
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path));
// returns "Installer%2fMy+Installer.msi"
// none of these return what I require, either
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.ASCII));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.BigEndianUnicode));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.Default));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.UTF32));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.UTF7));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.UTF8));
Console.WriteLine(System.Web.HttpUtility.UrlEncode(path, Encoding.Unicode));
Another method I've tried is using Uri.EscapeDataString, but this escapes the slashes.
// returns Directory%2FPath%20to%20escape.exe
Console.WriteLine(Uri.EscapeDataString(path));
Question:
If I'm not supposed to use UrlPathEncode, and UrlEncode doesn't produce the required output, what method is equivalent and recommended?
It's funny that when trying to write a question properly, you find your answer:
Uri.EscapeUriString(path);
Produces the required output.
I do think the MSDN page should reflect this, though.
Edit (2020-11-22)
I've recently come across this again, but needing to URL encode URLs with special characters (instead of file names with spaces), but it's essentially the same thing. The approach I used this time was to instantiate the Uri class:
var urlWithSpecialChars = "https://www.example.net/something/contàins-spécial-chars?query-has=spécial-chars-as-well";
var uri = new Uri(urlWithSpecialChars);
// outputs "https://www.example.net/something/contàins-spécial-chars?query-has=spécial-chars-as-well"
Debug.WriteLine(uri.OriginalString);
// outputs "https://www.example.net/something/cont%C3%A0ins-sp%C3%A9cial-chars?query-has=sp%C3%A9cial-chars-as-well"
Debug.WriteLine(uri.AbsoluteUri);
// outputs "/something/cont%C3%A0ins-sp%C3%A9cial-chars?query-has=sp%C3%A9cial-chars-as-well"
Debug.WriteLine(uri.PathAndQuery);
This give you quite a bit of useful Uri properties that are likely to cover most/many Uri processing requirements:
In my application, I must read a URL and do something if the URL contains Basic authentication credentials. An example of such a URL is
http://username:password#example.com
Is the regular expression below a good fit for my task? I am to capture four groups into local variables. The URL is passed to another internal library that will do further work to ensure the URL is valid before opening a connection.
^(.+?//)(.+?):(.+?)#(.+)$
It looks ok, and I think that a regular expression is good to use in this case. A couple of suggestions:
1) I think that named groups would make your code more readable, i.e:
^(?<protocol>.+?//)(?<username>.+?):(?<password>.+?)#(?<address>.+)$
Then you can simply write
Match match = Regex.Match(string, pattern);
if (match.Success) {
string user = match.Groups["username"];
2) then you could make the expression a little more strict, e.g. using \w when possible instead of .:
^(?<protocol>\w+://)...
Your regex seems OK, but why not use the thoroughly-tested and nearly-compliant Uri class? It's then trivial to access the pieces you want without worrying about spec-compatibility:
var url = new Uri("http://username:password#example.com");
var userInfo = url.UserInfo.Split(':');
var username = userInfo[0];
var password = userInfo[1];
How to validate by a single regular expression the urls:
http://83.222.4.42:8880/listen.pls
http://www.my_site.com/listen.pls
http://www.my.site.com/listen.pls
to be true?
I see that I formulated the question not exactly :(, sorry my mistake. The idea is that I want to validate with the help of regexp valid urls, let it be an external ip address or the domain name. This is the idea, other valid urls can be considered:
http://93.122.34.342/
http://193.122.34.342/abc/1.html
http://www.my_site.com/listen2.pls
http://www.my.site.com/listen.php
and so on.
The road to hell is paved with string parsing.
URL parsing in particular is the source of many, many exploited security issues. Don't do it.
For example, do you want this to match?
Note the uppercase scheme section. Remember that some parts of a URL are case sensitive, and some are not. Then there's encoding rules. Etc.
Start by using System.Uri to parse the URLs you provide:
var uri = new Uri("http://83.222.4.42:8880/listen.pls");
Then you can write things like:
if (uri.Scheme == "http" &&
uri.Host == "83.222.4.42" &&
uri.AbsolutePath == "/listen.pls"
)
{
// ...
}
^http://.+/listen\.pls$
If there are strictly only 3 of them don't bother with a regular expression because there is not necessarily a good pattern match when everything is already strictly known - in fact you might accidentally match more than these three urls - which becomes a problem if the urls are intended for security purposes or something equally important. Instead, test the three cases directly - maybe put them in a configuration file.
In the future if you want to add more URLs to the list you'll likely end up with an overly complicated regular expression that's increasingly hard to maintain and takes the place of a simpler check against a small list.
You won't necessarily get speed gains by running Regex to find these three strings - in fact it might be quite expensive.
Note: If you wantUri regular expressions also try websites hosting libraries like Regex Library - there are many to pick and choose from if your needs change.
/^http:\/\/[-_a-zA-Z0-9.]+(:\d+)?\/listen\.pls$/
Do you mean any URL ending with /listen.pls? In that case try this:
^http://[^/]+/listen\.pls$
or if the protocol identifier must be optional:
^[http://]?[^/]+/listen\.pls$
Anyway take a look here, maybe it is useful for you: Url and Email validation using Regex
A modified version base upon Jay Bazuzi's solution above since I can't post code in comment, it checks a blacklisted extensions (I do this only for demonstration purpose, you should strongly consider to build a whitelist rather than a blacklist) :
string myurl = "http://www.my_site.com/listen.pls";
Uri myUri = new Uri(myurl);
string[] invalidExtensions = {
".pls",
".abc"
};
foreach(string invalidExtension in invalidExtensions) {
if (invalidExtension.ToLower().Equals(System.IO.Path.GetExtension(myUri.AbsolutePath))) {
//Logic here
}
}
I have an MVC route like this www.example.com/Find?Key= with the Key being a Base64 string. The problem is that the Base64 string sometimes has a trailing equal sign (=) such as:
huhsdfjbsdf2394=
When that happens, for some reason my route doesn't get hit anymore.
What should I do to resolve this?
My route:
routes.MapRoute(
"FindByKeyRoute",
"Find",
new { controller = "Search", action = "FindByKey" }
);
If I have http://www.example.com/Find?Key=bla then it works.
If I have http://www.example.com/Find?Key=bla= then it doesn't work anymore.
Important Addition:
I'm writing against an IIS7 instance that doesn't allow % or similar encoding. That's why I didn't use UrlEncode to begin with.
EDIT: Original suggestion which apparently doesn't work
I'm sure the reason is that it thinks it's a query parameter called Key. Could you make it a parameter, with that part being the value, e.g.
www.example.com/Find?Route=Key=
I expect that would work (as the parser would be looking for an & to start the next parameter) but it's possible it'll confuse things still.
Suggestion which I believe will work
Alternatively, replace "=" in the base64 encoded value with something else on the way out, and re-replace it on the way back in, if you see what I mean. Basically use a different base64 decodabet.
Alternative suggestion which should work
Before adding base64 to the URL:
private static readonly char[] Base64Padding = new char[] { '=' };
...
base64 = base64.TrimEnd(Base64Padding);
Then before calling Convert.FromBase64String() (which is what I assume you're doing) on the inbound request:
// Round up to a multiple of 4 characters.
int paddingLength = (4 - (base64.Length % 4)) % 4;
base64 = base64.PadRight(base64.Length + paddingLength, '=');
IF you're passing data in the URL you should probably URL Encode it which would take care of the trailing =.
http://www.albionresearch.com/misc/urlencode.php
UrlEncode the encrypted (it is encrypted, right?) parameter.
If it is an encrypted string, beware that spaces and the + character will also get in your way.
Ok, so IIS 7 won't allow some special characters as part of your path. However, it would allow them if they were part of the querystring.
It is apparently, possible, to change this with a reg hack, but I wouldn't recommend that.
What I would suggest, then, is to use an alternate token, as suggested by Mr Skeet, or simply do not use it in your path, use it as querystring, where you CAN url encode it.
If it is an encrypted string, you haven't verified that it is or is not, you may in some cases get other 'illegal' characters. Querystring really would be the way to go.
Except your sample shows it as querystring... So what gives? Where did you find an IIS that won't allow standard uri encoding as part of the querystring??
Ok then. Thanks for the update.
RequestFiltering?
I see. Still that mentions double-encoded values that it blocks. Someone created a URL Sequence to deny any request with the '%' characters? At that point you might want to not use the encrypted string at all, but generate a GUID or something else that is guaranteed to not contain special characters, yet is not trivial to guess.