How to Extract Domain name from string with Regex in C#?

How to Extract Domain name from string with Regex in C#? - c#

I want extract Top-Level Domain names and Country top-level domain names from string with Regex. I tested many Regex like this code:
var linkParser = new Regex(#"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = linkParser.Match(Url);
Console.WriteLine(m.Value);
But none of these codes could do it properly.
The text string entered by the user can be in the following statements:
jonasjohn.com
http://www.jonasjohn.de/snippets/csharp/
jonasjohn.de
www.jonasjohn.de/snippets/csharp/
http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people
http://www.apple.com
https://www.cnn.com.au
http://www.downloads.news.com.au
https://ftp.android.co.nz
http://global.news.ca
https://www.apple.com/
https://ftp.android.co.nz/
http://global.news.ca/
https://www.apple.com/
https://johnsmith.eu
ftp://johnsmith.eu
johnsmith.gov.ae
johnsmith.eu
www.jonasjohn.de
www.jonasjohn.ac.ir/snippets/csharp
http://www.jonasjohn.de/
ftp://www.jonasjohn.de/
https://subdomain.abc.def.jonasjohn.de/test.htm
The Regex I tested:
^(?:https?:\/\/)?(?:[^#\/\n]+#)?(?:www\.)?([^:\/\n]+)"
\b(?:https?://|www\.)\S+\b
://(?<host>([a-z\\d][-a-z\\d]*[a-z\\d]\\.)*[a-z][-a-z\\d]+[a-z])
and also too many
I just need the domain name and I don't need a protocol or a subdomain.
Like:
Domainname.gTLD or DomainName.ccTLD or DomainName.xyz.ccTLD
I got list of them from PUBLIC SUFFIX
Of course, I've seen a lot of posts on stackoverflow.com, but none of it answered me.

You don't need a Regex to parse a URL. If you have a valid URL, you can use one of the Uri constructors or Uri.TryCreate to parse it:
if(Uri.TryCreate("http://google.com/asdfs",UriKind.RelativeOrAbsolute,out var uri))
{
Console.WriteLine(uri.Host);
}
www.jonasjohn.de/snippets/csharp/ and jonasjohn.de/snippets/csharp/ aren't valid URLs though. TryCreate can still parse them as relative URLs, but reading Host throws System.InvalidOperationException: This operation is not supported for a relative URI.
In that case you can use the UriBuilder class, to parse and modify the URL eg:
var bld=new UriBuilder("jonasjohn.com");
Console.WriteLine(bld.Host);
This prints
jonasjohn.com
Setting the Scheme property produces a valid,complete URL:
bld.Scheme="https";
Console.WriteLine(bld.Uri);
This produces:
https://jonasjohn.com:80/

According to Lidqy answer, I wrote this function, which I think supports most possible situations, and if the input value is out of this, you can make it an exception.
public static string ExtractDomainName(string Url)
{
var regex = new Regex(#"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");
Match match = regex.Match(Url);
if (match.Success)
{
string domain = match.Groups["domain"].Value;
int freq = domain.Where(x => (x == '.')).Count();
while (freq > 2)
{
if (freq > 2)
{
var domainSplited = domain.Split('.', 2);
domain = domainSplited[1];
freq = domain.Where(x => (x == '.')).Count();
}
}
return domain;
}
else
{
return String.Empty;
}
}

var rx = new Regex(#"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");
var data = new[] { "jonasjohn.com",
"http://www.jonasjohn.de/snippets/csharp/",
"jonasjohn.de",
"www.jonasjohn.de/snippets/csharp/",
"http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people",
"http://www.apple.com",
"https://www.cnn.com.au",
"http://www.downloads.news.com.au",
"https://ftp.android.co.nz",
"http://global.news.ca",
"https://www.apple.com/",
"https://ftp.android.co.nz/",
"http://global.news.ca/",
"https://www.apple.com/",
"https://johnsmith.eu",
"ftp://johnsmith.eu",
"johnsmith.gov.ae",
"johnsmith.eu",
"www.jonasjohn.de",
"www.jonasjohn.ac.ir/snippets/csharp",
"http://www.jonasjohn.de/",
"ftp://www.jonasjohn.de/",
"https://subdomain.abc.def.jonasjohn.de/test.htm"
};
foreach (var dat in data) {
var match = rx.Match(dat);
if (match.Success)
Console.WriteLine("{0} => {1}", dat, match.Groups["domain"].Value);
else {
Console.WriteLine("{0} => NO MATCH", dat);
}
}

Related

How to get all files ending with the extension "_\<fileNum>of\<totalFileNum>" and sometimes without? [duplicate]

a user specifies a file name that can be either in the form "<name>_<fileNum>of<fileNumTotal>" or simply "<name>". I need to somehow extract the "<name>" part from the full file name.
Basically, I am looking for a solution to the method "ExtractName()" in the following example:
string fileName = "example_File"; \\ This var is specified by user
string extractedName = ExtractName(fileName); // Must return "example_File"
fileName = "example_File2_1of5";
extractedName = ExtractName(fileName); // Must return "example_File2"
fileName = "examp_File_3of15";
extractedName = ExtractName(fileName); // Must return "examp_File"
fileName = "example_12of15";
extractedName = ExtractName(fileName); // Must return "example"
Edit: Here's what I've tried so far:
ExtractName(string fullName)
{
return fullName.SubString(0, fullName.LastIndexOf('_'));
}
But this clearly does not work for the case where the full name is just "<name>".
Thanks

This would be easier to parse using Regex, because you don't know how many digits either number will have.
var inputs = new[]
{
"example_File",
"example_File2_1of5",
"examp_File_3of15",
"example_12of15"
};
var pattern = new Regex(#"^(.+)(_\d+of\d+)$");
foreach (var input in inputs)
{
var match = pattern.Match(input);
if (!match.Success)
{
// file doesn't end with "#of#", so use the whole input
Console.WriteLine(input);
}
else
{
// it does end with "#of#", so use the first capture group
Console.WriteLine(match.Groups[1].Value);
}
}
This code returns:
example_File
example_File2
examp_File
example
The Regex pattern has three parts:
^ and $ are anchors to ensure you capture the entire string, not just a subset of characters.
(.+) - match everything, be as greedy as possible.
(_\d+of\d+) - match "_#of#", where "#" can be any number of consecutive digits.

Heading identification with Regex

I'm wondering how I can identify headings with differing numerical marking styles with one or more regular expressions assuming sometimes styles overlap between documents. The goal is to extract all the subheadings and data for a specific heading in each file, but these files aren't standardized. Is regular expressions even the right approach here?
I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.
private string GetMarkingSection(string text)
{
int startIndex = 0;
int endIndex = 0;
bool startIndexFound = false;
Regex rx = new Regex(HEADINGREGEX);
foreach (Match match in rx.Matches(text))
{
if (startIndexFound)
{
endIndex = match.Index;
break;
}
if (match.ToString().ToLower().Contains("marking"))
{
startIndex = match.Index;
startIndexFound = true;
}
}
return text.Substring(startIndex, (endIndex - startIndex));
}
Once the marking section is found, I use this to find subsections.
private Dictionary<string, string> GetSubsections(string text)
{
Dictionary<string, string> subsections = new Dictionary<string, string>();
string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);
string title = "";
string content = "";
foreach(string s in unprocessedSubSecs)
{
if(s != "") //sometimes it pulls in empty strings
{
Match m = Regex.Match(s, SUBSECTIONREGEX);
if (m.Success)
{
title = s;
}
else
{
content = s;
if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))
{
subsections.Add(title, content);
}
}
}
}
return subsections;
}
Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me.
These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:
Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$
Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.
My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.
That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)
I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.
Here is my updated GetMarkingSection method:
private Dictionary<string, string> GetMarkingSection(string text)
{
var headingRegex = HEADING1REGEX;
var subheadingRegex = HEADING2REGEX;
Dictionary<string, string> markingSection = new Dictionary<string, string>();
if (Regex.Matches(text, HEADING1REGEX, RegexOptions.Multiline | RegexOptions.Singleline).Count > 0)
{
foreach (Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if (Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
else
{
headingRegex = HEADING2REGEX;
subheadingRegex = HEADING3REGEX;
foreach(Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if(Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
return null;
}
Here are some example PDF files:

See if this approach works:
var heading1Regex = #"^(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\s|\Z)";
Demo
var heading2Regex = #"^(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\s|\Z)";
Demo
var heading3Regex = #"^(\d+)\.(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";
Demo
For each pdf file:
var headingRegex = heading1Regex;
var subHeadingRegex = heading2Regex;
if there are any matches for headingRegex
{
for each match, find matches for subHeadingRegex
}
else
{
var headingRegex = heading2Regex;
var subHeadingRegex = heading3Regex;
//repeat same steps
}
1. Edge case 1: after 5.2, comes 7.1.3
As shown here,
get main section match using heading2Regex.
convert group1 of the match to integer
int.TryParse(match.group1, out var headingIndex);
get sub section matches for heading3Regex
for each subsection match, convert group1 to integer.
int.TryParse(match.group1, out var subHeadingIndex);
check if headingIndex is equal to subHeadingIndex. if not handle accordingly.

How to avoid large switch statements and/or regular expressions when converting code from one language to another

I have to convert a few hundred test cases written in Java to code in C#. At the moment all I could think of is define a set of regular expressions, try to match it on a line and do an action based on which regex matched.
Any better ideas (this still stinks).
An example of from and to:
Java:
Request request = new Request(testRunner)
request.setUsername("userName")
request.setPassword("password")
log.info(request.getRequest())
C#
var request = new LoginRequest(LoginParams);
request.Username = "userName";
request.Password = "password";
var LoginResponse = Account.ExecuteCall(request, pathToApi);

The source I'm trying to convert is from SoapUI and the bits of script involved are within TestSteps of a humongous XML file. Also, most of them are simply forming some sort of request and checking for a specific response so there shouldn't be too many types to implement.
What I ended up doing was defined a base class (Map) that has a Pattern property, a Success indicator and the lines of Code that it results to after a successful match. In some cases a certain line can be simply replaced by another one but in other cases (setUserName) I need to extract content from the original script to put in the c# code. In other cases, a single line might be replaced with more than one. The transformation is all defined in the Match function.
public class SetUserName : Map
{
internal override string Pattern { get { return #"request.setUsername\(""(.*)""\)"; } }
public override void Match(string line)
{
Match match = Regex.Match(line, Pattern);
if (match.Success)
{
Success = true;
CodeLines = new Code<CodeLine>
{new CodeLine("request.Username = \"" + match.Groups[1].Value + "\"")};
}
}
}
Then I put the maps in a list ordered by occurrence and loop through each line of script:
foreach (string scriptLine in scriptLines)
{
string line = Strip(scriptLine);
if (string.IsNullOrEmpty(line) || Regex.Match(line, #"^\s+$").Success)
{
continue;
}
Map[] RegExes =
{
new Request(),
new SetUserName(),
new SetPassword(),
new RunRequest()
};
foreach (Map map in RegExes)
{
map.Match(line);
if (map.Success)
{
codeList.AddRange(map.CodeLines);
break;
}
}
}

RegEx to extract a sub level from url

i have the following set of Urls:
http://test/mediacenter/Photo Gallery/Conf 1/1.jpg
http://test/mediacenter/Photo Gallery/Conf 2/3.jpg
http://test/mediacenter/Photo Gallery/Conf 3/Conf 4/1.jpg
All i want to do is to extract the Conf 1, Conf 2,Conf 3 from the urls, the level after 'Photo Gallery' (Urls are not static, they share common level which is Photo Gallery)
Any help is appreciated

Is it necessary to use Regex? You can get it without using Regex like this
string str= #"http://test/mediacenter/Photo Gallery/Conf 1/1.jpg";
var z=qq.Split('/')[5];
or
var x= new Uri(str).Segments[3];

This ought to do you:
var s = #"http://test/mediacenter/Photo Gallery/Conf 11/1.jpg";
var regex = new Regex(#"(Conf \d*)");
var match = regex.Match(s);
Console.WriteLine(match.Groups[0].Value); // Prints a
Of course, you'd have to be confident the 'Conf x' (where x is a number) wasn't going to be elsewhere in the URL.
This will improve it slightly by stripping off multiple folders (Conf 3/Conf 4) in your example.
var regex = new Regex(#"((Conf \d*/*)+)");
It leaves the trailing / though.

No need for regex.
string testCase = "http://test/mediacenter/Photo Gallery/Conf 1/1.jpg";
string urlBase = "http://test/mediacenter/Photo Gallery/";
if(!testCase.StartsWith(urlBase))
{
throw new Exception("URL supplied doesn't belong to base URL.");
}
Uri uriTestCase = new Uri(testCase);
Uri uriBase = new Uri(urlBase);
if(uriTestCase.Segments.Length > uriBase.Segments.Length)
{
System.Console.Out.WriteLine(uriTestCase.Segments[uriBase.Segments.Length]);
}
else
{
Console.Out.WriteLine("No child segment...");
}

Try a RegEx like this.
Conf[^\/]*
This should give you all "Conf" Parts of the URLs.
I hope that helps.

How can I split this string into an array?

My string is as follows:
smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
I need back:
smtp:jblack#test.com
SMTP:jb#test.com
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
The problem is the semi-colons seperate the addresses and also part of the X400 address. Can anyone suggest how best to split this?
PS I should mentioned the order differs so it could be:
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com
There can be more than 3 address, 4, 5.. 10 etc including an X500 address, however they do all start with either smtp: SMTP: X400 or X500.

EDIT: With the updated information, this answer certainly won't do the trick - but it's still potentially useful, so I'll leave it here.
Will you always have three parts, and you just want to split on the first two semi-colons?
If so, just use the overload of Split which lets you specify the number of substrings to return:
string[] bits = text.Split(new char[]{';'}, 3);

May I suggest building a regular expression
(smtp|SMTP|X400|X500):((?!smtp:|SMTP:|X400:|X500:).)*;?
or protocol-less
.*?:((?![^:;]*:).)*;?
in other words find anything that starts with one of your protocols. Match the colon. Then continue matching characters as long as you're not matching one of your protocols. Finish with a semicolon (optionally).
You can then parse through the list of matches splitting on ':' and you'll have your protocols. Additionally if you want to add protocols, just add them to the list.
Likely however you're going to want to specify the whole thing as case-insensitive and only list the protocols in their uppercase or lowercase versions.
The protocol-less version doesn't care what the names of the protocols are. It just finds them all the same, by matching everything up to, but excluding a string followed by a colon or a semi-colon.

Split by the following regex pattern
string[] items = System.Text.RegularExpressions.Split(text, ";(?=\w+:)");
EDIT: better one can accept more special chars in the protocol name.
string[] items = System.Text.RegularExpressions.Split(text, ";(?=[^;:]+:)");

http://msdn.microsoft.com/en-us/library/c1bs0eda.aspx
check there, you can specify the number of splits you want. so in your case you would do
string.split(new char[]{';'}, 3);

Not the fastest if you are doing this a lot but it will work for all cases I believe.
string input1 = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string input2 = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com";
Regex splitEmailRegex = new Regex(#"(?<key>\w+?):(?<value>.*?)(\w+:|$)");
List<string> sets = new List<string>();
while (input2.Length > 0)
{
Match m1 = splitEmailRegex.Matches(input2)[0];
string s1 = m1.Groups["key"].Value + ":" + m1.Groups["value"].Value;
sets.Add(s1);
input2 = input2.Substring(s1.Length);
}
foreach (var set in sets)
{
Console.WriteLine(set);
}
Console.ReadLine();
Of course many will claim Regex: Now you have two problems. There may even be a better regex answer than this.

You could always split on the colon and have a little logic to grab the key and value.
string[] bits = text.Split(':');
List<string> values = new List<string>();
for (int i = 1; i < bits.Length; i++)
{
string value = bits[i].Contains(';') ? bits[i].Substring(0, bits[i].LastIndexOf(';') + 1) : bits[i];
string key = bits[i - 1].Contains(';') ? bits[i - 1].Substring(bits[i - 1].LastIndexOf(';') + 1) : bits[i - 1];
values.Add(String.Concat(key, ":", value));
}
Tested it with both of your samples and it works fine.

This caught my curiosity .... So this code actually does the job, but again, wants tidying :)
My final attempt - stop changing what you need ;=)
static void Main(string[] args)
{
string fneh = "X400:C=US400;A= ;P=Test;O=Exchange;S=Jack;G=Black;x400:C=US400l;A= l;P=Testl;O=Exchangel;S=Jackl;G=Blackl;smtp:jblack#test.com;X500:C=US500;A= ;P=Test;O=Exchange;S=Jack;G=Black;SMTP:jb#test.com;";
string[] parts = fneh.Split(new char[] { ';' });
List<string> addresses = new List<string>();
StringBuilder address = new StringBuilder();
foreach (string part in parts)
{
if (part.Contains(":"))
{
if (address.Length > 0)
{
addresses.Add(semiColonCorrection(address.ToString()));
}
address = new StringBuilder();
address.Append(part);
}
else
{
address.AppendFormat(";{0}", part);
}
}
addresses.Add(semiColonCorrection(address.ToString()));
foreach (string emailAddress in addresses)
{
Console.WriteLine(emailAddress);
}
Console.ReadKey();
}
private static string semiColonCorrection(string address)
{
if ((address.StartsWith("x", StringComparison.InvariantCultureIgnoreCase)) && (!address.EndsWith(";")))
{
return string.Format("{0};", address);
}
else
{
return address;
}
}

Try these regexes. You can extract what you're looking for using named groups.
X400:(?<X400>.*?)(?:smtp|SMTP|$)
smtp:(?<smtp>.*?)(?:;+|$)
SMTP:(?<SMTP>.*?)(?:;+|$)
Make sure when constructing them you specify case insensitive. They seem to work with the samples you gave

Lots of attempts. Here is mine ;)
string src = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
Regex r = new Regex(#"
(?:^|;)smtp:(?<smtp>([^;]*(?=;|$)))|
(?:^|;)x400:(?<X400>.*?)(?=;x400|;x500|;smtp|$)|
(?:^|;)x500:(?<X500>.*?)(?=;x400|;x500|;smtp|$)",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
foreach (Match m in r.Matches(src))
{
if (m.Groups["smtp"].Captures.Count != 0)
Console.WriteLine("smtp: {0}", m.Groups["smtp"]);
else if (m.Groups["X400"].Captures.Count != 0)
Console.WriteLine("X400: {0}", m.Groups["X400"]);
else if (m.Groups["X500"].Captures.Count != 0)
Console.WriteLine("X500: {0}", m.Groups["X500"]);
}
This finds all smtp, x400 or x500 addresses in the string in any order of appearance. It also identifies the type of address ready for further processing. The appearance of the text smtp, x400 or x500 in the addresses themselves will not upset the pattern.

This works!
string input =
"smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string[] parts = input.Split(';');
List<string> output = new List<string>();
foreach(string part in parts)
{
if (part.Contains(":"))
{
output.Add(part + ";");
}
else if (part.Length > 0)
{
output[output.Count - 1] += part + ";";
}
}
foreach(string s in output)
{
Console.WriteLine(s);
}

Do the semicolon (;) split and then loop over the result, re-combining each element where there is no colon (:) with the previous element.
string input = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G="
+"Black;;smtp:jblack#test.com;SMTP:jb#test.com";
string[] rawSplit = input.Split(';');
List<string> result = new List<string>();
//now the fun begins
string buffer = string.Empty;
foreach (string s in rawSplit)
{
if (buffer == string.Empty)
{
buffer = s;
}
else if (s.Contains(':'))
{
result.Add(buffer);
buffer = s;
}
else
{
buffer += ";" + s;
}
}
result.Add(buffer);
foreach (string s in result)
Console.WriteLine(s);

here is another possible solution.
string[] bits = text.Replace(";smtp", "|smtp").Replace(";SMTP", "|SMTP").Replace(";X400", "|X400").Split(new char[] { '|' });
bits[0],
bits[1], and
bits[2]
will then contains the three parts in the order from your original string.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to Extract Domain name from string with Regex in C#? - c#

Related

How to get all files ending with the extension "_\<fileNum>of\<totalFileNum>" and sometimes without? [duplicate]

Heading identification with Regex

How to avoid large switch statements and/or regular expressions when converting code from one language to another

RegEx to extract a sub level from url

How can I split this string into an array?

Categories

Resources