How to parse and validate Domain & Subdomain roots in C#

How to parse and validate Domain & Subdomain roots in C# - c#

Though I've been looking through some of the classes I've been having a hard time finding an efficient way to parse/regex domains (both root and subdomains, while including things like .co.uk, etc).
Is there a function that can validate whether or not it is a proper domain/url without actually connecting to the site? My goal is to use this for a large list of URL's to grab pretty much anything before (and including) the TLD.

You'll have to tweak the regex for your particular situation, but this gives you a point where to start:
const string pattern = #"http[s]?://(?<Domain>([a-zA-Z0-9\-]+?\.)*([a-zA-Z0-9\-]+\.)*([a-zA-Z]{3,61}|[a-zA-Z]{1,}\.[a-zA-Z]{2,3}))"; //";
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
var match = regex.Match(myURL);
var domain = match.Groups["Domain"].Value;

Related

Regex for HTTP URL with Basic authentication

In my application, I must read a URL and do something if the URL contains Basic authentication credentials. An example of such a URL is
http://username:password#example.com
Is the regular expression below a good fit for my task? I am to capture four groups into local variables. The URL is passed to another internal library that will do further work to ensure the URL is valid before opening a connection.
^(.+?//)(.+?):(.+?)#(.+)$

It looks ok, and I think that a regular expression is good to use in this case. A couple of suggestions:
1) I think that named groups would make your code more readable, i.e:
^(?<protocol>.+?//)(?<username>.+?):(?<password>.+?)#(?<address>.+)$
Then you can simply write
Match match = Regex.Match(string, pattern);
if (match.Success) {
string user = match.Groups["username"];
2) then you could make the expression a little more strict, e.g. using \w when possible instead of .:
^(?<protocol>\w+://)...

Your regex seems OK, but why not use the thoroughly-tested and nearly-compliant Uri class? It's then trivial to access the pieces you want without worrying about spec-compatibility:
var url = new Uri("http://username:password#example.com");
var userInfo = url.UserInfo.Split(':');
var username = userInfo[0];
var password = userInfo[1];

C# trying to extract URLs from a webpage that have extentions .com, .net and .org

I am trying to extract .com, .net and .org links from a single webpage that contains various numbers of these. I am just learning about Regex using C#, but I am not sure how to setup a pattern that looks for the just .com, .net and .org extensions. Then print those urls with those endings. Any suggestions or websites that you can direct me to help me would be great.
here is what i got so far
WebClient client = new WebClient();
string extPattern = #"?.com|?.net|?.org; //but i think i am not doing this right.
string source = client.DounloadString(url) //read the Url and store the pages.
//then not sure what to do.
Thanks

Try this regex:
string extPattern = #"(http://)?[a-z0-9\-\.]+(\.com|\.net|\.org)";
Anyway this is not the perfect way to achieve your goal because url are very different (could have http or https, with or without www).

It partially depends on the format you expect the input string to be in. The following pattern assumes each URL to be on a separate line:
(.+\.com|.+\.net|.+\.org)\s
This may or may not be what you need depending on the input format. You'll need to give more information if you want anything more useful.
Some decent online resources for testing .NET regexes are:
http://gskinner.com/RegExr/
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Or is the problem that you don't know how to use the .NET regex classes? There are many questions on this very site that could help you there.
If you're just looking for a regex to match a URL then you will find one on here:
http://regexlib.com/DisplayPatterns.aspx?cattabindex=1&categoryId=2

Convert your downloaded data into a string and use regex like this.
Regex myRegex = new Regex(#"(http://)?[a-z0-9\-]+(\.com|\.net|\.org)");
MatchCollection collection = myRegex.Matches(downloadedData);
for (int i = 0; i < collection.Count; i++)
{
Debug.WriteLine(collection[0]);
}

Regex : domains separated by semicolon

I need a regex to use in a C# App with the follow structure:
domains separated by semicolon
Valid Example:
domain1.com;domain2.org;subdomain.domain.net
How can i do that with a single Regex?

Given my aversion to regex in general, I am compelled to post an alternative (banking on the fact that you're going to need the individual domain representations as separate entities at some point anyway):
string[] domains = delimitedDomains.Split(';');
Where delimitedDomains is a string of domains in the format mentioned and the result being an array of each individual domain entry found.
It may well be that there are more cases to your scenario than precisely detailed, however, if this is the extent of your goal and the input then this should suffice quite well.

Just:
var domains = example.Split(";".ToCharArray(),
StringSplitoptions.RemoveEmptyEntries);

Use this one:
[^\;]+

Could try
^([^\.;]+\.[^\.;]+;)*$
Depending on if you want specific domain names you may have to alter it

You can use [^;]+ but C# has a split function which will work well for this(if you can avoid regex I would do so)
http://coders-project.blogspot.com/2010/05/split-function-in-c.html

Using Mr. Disappointment's suggestion, I don't know what all the Uri.IsWellFormedUriString method does, but, in theory, if you want to perform both steps (separate and validate) in one, I would think you could use LINQ to do something like this:
(editted this to use Uri.CheckHostName instead of Uri.IsWellFormedUriString)
string src = "domain1.net;domain2.net; ...";
string[] domains = src.Split(';').Where(x => Uri.CheckHostName(x) != UriHostNameType.Unknown).ToArray();

Conditional Regex Replace in C# without MatchEvaluator

So, Im trying to make a program to rename some files. For the most part, I want them to look like this,
[Testing]StupidName - 2[720p].mkv
But, I would like to be able to change the format, if so desired. If I use MatchEvaluators, you would have to recompile every time. Thats why I don't want to use the MatchEvaluator.
The problem I have is that I don't know how, or if its possible, to tell Replace that if a group was found, include this string. The only syntax for this I have ever seen was something like (?<group>:data), but I can't get this to work. Well if anyone has an idea, im all for it.
EDIT:
Current Capture Regexes =
^(\[(?<FanSub>[^\]\)\}]+)\])?[. _]*(?<SeriesTitle>[\w. ]*?)[. _]*\-[. _]*(?<EpisodeNumber>\d+)[. _]*(\-[. _]*(?<EpisodeName>[\w. ]*?)[. _]*)?([\[\(\{](?<MiscInfo>[^\]\)\}]*)[\]\)\}][. _]*)*[\w. ]*(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*[Ss](?<SeasonNumber>\d+)[Ee](?<EpisodeNumber>\d+).*?(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*(?<SeasonNumber>\d)(?<EpisodeNumber>\d{2}).*?(?<Extension>\.[a-zA-Z]+)$
Current Replace Regex = [${FanSub}]${SeriesTitle} - ${EpisodeNumber} [${MiscInfo}]${Extension}
Using Regex.Replace, the file TestFile 101.mkv, I get []TestFile - 1[].mkv. What I want to do is make it so that [] is only included if the group FanSub or MiscInfo was found.
I can solve this with a MatchEvaluator because I actually get to compile a function. But this would not be a easy solution for users of the program. The only other idea I have to solve this is to actually make my own Regex.Replace function that accepts special syntax.

It sounds like you want to be able to specify an arbitrary format dynamically rather than hard-code it into your code.
Perhaps one solution is to break your filename parts into specific groups then pass in a replacement pattern that takes advantage of those group names. This would give you the ability to pass in different replacement patterns which return the desired filename structure using the Regex.Replace method.
Since you didn't explain the categories of your filename I came up with some random groups to demonstrate. Here's a quick example:
string input = "Testing StupidName Number2 720p.mkv";
string pattern = #"^(?<Category>\w+)\s+(?<Name>.+?)\s+Number(?<Number>\d+)\s+(?<Resolution>\d+p)(?<Extension>\.mkv)$";
string[] replacePatterns =
{
"[${Category}]${Name} - ${Number}[${Resolution}]${Extension}",
"${Category} - ${Name} - ${Number} - ${Resolution}${Extension}",
"(${Number}) - [${Resolution}] ${Name} [${Category}]${Extension}"
};
foreach (string replacePattern in replacePatterns)
{
Console.WriteLine(Regex.Replace(input, pattern, replacePattern));
}
As shown in the sample, named groups in the pattern, specified as (?<Name>pattern), are referred to in the replacement pattern by ${Name}.
With this approach you would need to know the group names beforehand and pass these in to rearrange the pattern as needed.

Need C# regexp for URL validation

How to validate by a single regular expression the urls:
http://83.222.4.42:8880/listen.pls
http://www.my_site.com/listen.pls
http://www.my.site.com/listen.pls
to be true?
I see that I formulated the question not exactly :(, sorry my mistake. The idea is that I want to validate with the help of regexp valid urls, let it be an external ip address or the domain name. This is the idea, other valid urls can be considered:
http://93.122.34.342/
http://193.122.34.342/abc/1.html
http://www.my_site.com/listen2.pls
http://www.my.site.com/listen.php
and so on.

The road to hell is paved with string parsing.
URL parsing in particular is the source of many, many exploited security issues. Don't do it.
For example, do you want this to match?
Note the uppercase scheme section. Remember that some parts of a URL are case sensitive, and some are not. Then there's encoding rules. Etc.
Start by using System.Uri to parse the URLs you provide:
var uri = new Uri("http://83.222.4.42:8880/listen.pls");
Then you can write things like:
if (uri.Scheme == "http" &&
uri.Host == "83.222.4.42" &&
uri.AbsolutePath == "/listen.pls"
)
{
// ...
}

^http://.+/listen\.pls$

If there are strictly only 3 of them don't bother with a regular expression because there is not necessarily a good pattern match when everything is already strictly known - in fact you might accidentally match more than these three urls - which becomes a problem if the urls are intended for security purposes or something equally important. Instead, test the three cases directly - maybe put them in a configuration file.
In the future if you want to add more URLs to the list you'll likely end up with an overly complicated regular expression that's increasingly hard to maintain and takes the place of a simpler check against a small list.
You won't necessarily get speed gains by running Regex to find these three strings - in fact it might be quite expensive.
Note: If you wantUri regular expressions also try websites hosting libraries like Regex Library - there are many to pick and choose from if your needs change.

/^http:\/\/[-_a-zA-Z0-9.]+(:\d+)?\/listen\.pls$/

Do you mean any URL ending with /listen.pls? In that case try this:
^http://[^/]+/listen\.pls$
or if the protocol identifier must be optional:
^[http://]?[^/]+/listen\.pls$
Anyway take a look here, maybe it is useful for you: Url and Email validation using Regex

A modified version base upon Jay Bazuzi's solution above since I can't post code in comment, it checks a blacklisted extensions (I do this only for demonstration purpose, you should strongly consider to build a whitelist rather than a blacklist) :
string myurl = "http://www.my_site.com/listen.pls";
Uri myUri = new Uri(myurl);
string[] invalidExtensions = {
".pls",
".abc"
};
foreach(string invalidExtension in invalidExtensions) {
if (invalidExtension.ToLower().Equals(System.IO.Path.GetExtension(myUri.AbsolutePath))) {
//Logic here
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse and validate Domain & Subdomain roots in C# - c#

Related

Regex for HTTP URL with Basic authentication

C# trying to extract URLs from a webpage that have extentions .com, .net and .org

Regex : domains separated by semicolon

Conditional Regex Replace in C# without MatchEvaluator

Need C# regexp for URL validation

Categories

Resources