Regular Expression Help Needed to Parse Domain Name - c#

I have a regular expression that returns the top level domain of a URL regardless of whether it is .com, .com.au, etc. and parses out any subdomains. I need to modify it to return both the top level domain and the first subdomain. So basically if I have for the input
http://test1.hello.mydomain.com.au
it should return
hello.mydomain
Can someone help me with this? Here is what I have for grabbing just the top level domain:
(?<=^(?:(?:ht|f)tps?)?://)[^/]+?(?=(?:\.(?:[a-z]{2,3}?\.[a-z]{2}|[a-z]{2,3}))(?:/|$))

This is not a problem that can be solved using regular expressions alone. You are looking for the Public Suffix List, which contains program-readable information about how to split up domain names in the way you describe.

Related

How to Exclude Subdomain From URI [duplicate]

I get url as
http://orders.mealsandyou.com/default.php
i dont want to use string functions to use it to get the main domain ie
mealsandyou.com
is there any function in c# to do that, UrilAuthority and all gives subdomain too...
Suggestions welcome, not workarounds
.Net doesn't provide a built-in feature to extract specific parts from Uri.Host. You will have to use string manipulation or a regular expression yourself.
The only constant part of the domain string is the TLD. The TLD is the very last bit of the domain string, eg .com, .net, .uk etc. Everything else under that depends on the particular TLD for its position (so you can't assume the next to last part is the "domain name" as, for .co.uk it would be .co.
In any case I think you're taking the wrong approach. URL rewriting is far more suited to this sort of thing. Have a read of this: learn.iis.net/page.aspx/460/using-the-url-rewrite-module

Match expressions in Strings

I have a database here with certain rules I need to apply to a a bunch of Strings, they're expressions that can occur within the Strings. They are expressed like
(word1 AND word2) OR (word3)
I can't hardcode those (because they may be changed in the database), so I thought about programmatically turning those expressions into Regex patterns.
Has anybody done such a task yet or has an idea on how to do this the best way?
I'm not wuite sure about how to deal with more complex expressions, how to take them apart and so on.
Edit: I'm using C# in VisualStudio / .NET.
The data is basically directory paths, a customer wants to get their documents organized, so the String I'm having are paths, the expressions in the DB could look like:
(office OR headquarter) AND (official OR confidential)
So if the file's directory path contains office and confidential, it should match.
Hope this makes it clearer.
EDIT2:
Heres some dummy examples:
The paths could look like:
c:\documents\official\johnmeyer\court\out\letter.doc
c:\documents\internal\appointments\court\in\september.doc
c:\documents\official\stevemiller\meeting\in\letter.doc
And the expressions like:
(meyer or miller) AND (court OR jail)
So this expression would match the 1st path/ file, but not the 2nd and 3rd one.
No answer, but a good hint:
The expressions you have are actual trees constructed by the parentheses. You need a stack machine to parse the text into a (binary) tree structure, where each node is an AND or OR element and the leaves are the words.
Afterwards, you can simply construct your regex in whatever language you need by walking the tree using depth first search and adding prefix and suffix data as needed before/after reading the subtree.
Consider an abstract class TreeNode having a method GenerateExpression(StringBuilder result).
Each actual TreeNode item will be either an CombinationTreeNode (with a CombinationMode And/Or) or an SearchTextTreeNode (with an SearchText property).
GenerateExpression(StringBuilder result) for CombinationTreeNode will look similar like that:
result.Append("(");
rightSubTree.GenerateExpression(result);
result.Append(") " + this.CombinationMode.ToString() + " (");
rightSubTree.GenerateExpression(result);
result.Append(")");
GenerateExpression(StringBuilder result) for SearchTextTreeNode is much easier:
result.Append(this.SearchText);
Of course, your code will produce a regular expression instead of the input text, as mine does.

Regular Expression to validate a URL or domain name.

Can someone please let me know what is wrong with my regular expression? I’m trying to just validate the beginning to URLs, mainly just host names (i.e. www.yahoo.com).
Regular Expression: ^(((ht|f)tp(s?))\:\/\/)?(www.)?([a-zA-Z0-9\-\.]{1,63})+\.([a-zA-Z]{2,5})$
Testing Values:
test.com – passes
test.c2om – fails
test.test.com – passes
test.test.c2om – fails
test.test.test.com – passes
test.test.test.c2om – INVALID REGEX PATTERN
This should return false, but instead returns nothing, both using javascript and c#… If you remove the {1,63} restriction on the size of the subdomain, it works…
You've created a catastrophic pattern - The engine will try to match ([a-zA-Z0-9\-\.]{1,63})+ in many ways until it fails. A simple solution is to remove {1,63}, as you've noted, it doesn't seem to be adding anything anyway.
Another option is to use the dots as anchors, so you cannot backtrack between them (this only gives you one way to match the text, and assumably, what you're trying to do):
([a-zA-Z0-9\-]{1,63}\.)*[a-zA-Z0-9\-]{1,63}
Keep in mind that isnt very correct anymore to assume all-ASCII-English letters in domain names. For example http://אתר.קום is a legal (and working) url.

Need C# regexp for URL validation

How to validate by a single regular expression the urls:
http://83.222.4.42:8880/listen.pls
http://www.my_site.com/listen.pls
http://www.my.site.com/listen.pls
to be true?
I see that I formulated the question not exactly :(, sorry my mistake. The idea is that I want to validate with the help of regexp valid urls, let it be an external ip address or the domain name. This is the idea, other valid urls can be considered:
http://93.122.34.342/
http://193.122.34.342/abc/1.html
http://www.my_site.com/listen2.pls
http://www.my.site.com/listen.php
and so on.
The road to hell is paved with string parsing.
URL parsing in particular is the source of many, many exploited security issues. Don't do it.
For example, do you want this to match?
Note the uppercase scheme section. Remember that some parts of a URL are case sensitive, and some are not. Then there's encoding rules. Etc.
Start by using System.Uri to parse the URLs you provide:
var uri = new Uri("http://83.222.4.42:8880/listen.pls");
Then you can write things like:
if (uri.Scheme == "http" &&
uri.Host == "83.222.4.42" &&
uri.AbsolutePath == "/listen.pls"
)
{
// ...
}
^http://.+/listen\.pls$
If there are strictly only 3 of them don't bother with a regular expression because there is not necessarily a good pattern match when everything is already strictly known - in fact you might accidentally match more than these three urls - which becomes a problem if the urls are intended for security purposes or something equally important. Instead, test the three cases directly - maybe put them in a configuration file.
In the future if you want to add more URLs to the list you'll likely end up with an overly complicated regular expression that's increasingly hard to maintain and takes the place of a simpler check against a small list.
You won't necessarily get speed gains by running Regex to find these three strings - in fact it might be quite expensive.
Note: If you wantUri regular expressions also try websites hosting libraries like Regex Library - there are many to pick and choose from if your needs change.
/^http:\/\/[-_a-zA-Z0-9.]+(:\d+)?\/listen\.pls$/
Do you mean any URL ending with /listen.pls? In that case try this:
^http://[^/]+/listen\.pls$
or if the protocol identifier must be optional:
^[http://]?[^/]+/listen\.pls$
Anyway take a look here, maybe it is useful for you: Url and Email validation using Regex
A modified version base upon Jay Bazuzi's solution above since I can't post code in comment, it checks a blacklisted extensions (I do this only for demonstration purpose, you should strongly consider to build a whitelist rather than a blacklist) :
string myurl = "http://www.my_site.com/listen.pls";
Uri myUri = new Uri(myurl);
string[] invalidExtensions = {
".pls",
".abc"
};
foreach(string invalidExtension in invalidExtensions) {
if (invalidExtension.ToLower().Equals(System.IO.Path.GetExtension(myUri.AbsolutePath))) {
//Logic here
}
}

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]
... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.
It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...
With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.
Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

Categories