Remove subdomains from URI

Remove subdomains from URI - c#

I want to remove subdomain names from a URI.
Example:
I want to return 'baseurl.com' from the Uri "subdomain.sub2.baseurl.com".
Is there a way of doing this using URI class or is Regex the only solution?
Thank you.

This should get it done:
var tlds = new List<string>()
{
//the second- and third-level TLDs you expect go here, set to null if working with single-level TLDs only
"co.uk"
};
Uri request = new Uri("http://subdomain.domain.co.uk");
string host = request.Host;
string hostWithoutPrefix = null;
if (tlds != null)
{
foreach (var tld in tlds)
{
Regex regex = new Regex($"(?<=\\.|)\\w+\\.{tld}$");
Match match = regex.Match(host);
if (match.Success)
hostWithoutPrefix = match.Groups[0].Value;
}
}
//second/third levels not provided or not found -- try single-level
if (string.IsNullOrWhiteSpace(hostWithoutPrefix))
{
Regex regex = new Regex("(?<=\\.|)\\w+\\.\\w+$");
Match match = regex.Match(host);
if (match.Success)
hostWithoutPrefix = match.Groups[0].Value;
}

Related

How can I pick out a number from a string in C#

I have this string:
http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2
How can I pick out the number between "q=" and "&amp" as an integer?
So in this case I want to get the number: 1007040

What you're actually doing is parsing a URI - so you can use the .Net library to do this properly as follows:
var str = "http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2";
var uri = new Uri(str);
var query = uri.Query;
var dict = System.Web.HttpUtility.ParseQueryString(query);
Console.WriteLine(dict["amp;q"]); // Outputs 1007040
If you want the numeric string as an integer then you'd need to parse it:
int number = int.Parse(dict["amp;q"]);

Consider using regular expressions
String str = "http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2";
Match match = Regex.Match(str, #"q=\d+&amp");
if (match.Success)
{
string resultStr = match.Value.Replace("q=", String.Empty).Replace("&amp", String.Empty);
int.TryParse(resultStr, out int result); // result = 1007040
}

Seems like you want a query parameter for a uri that's html encoded. You could do:
Uri uri = new Uri(HttpUtility.HtmlDecode("http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2"));
string q = HttpUtility.ParseQueryString(uri.Query).Get("q");
int qint = int.Parse(q);

A regex approach using groups:
public int GetInt(string str)
{
var match = Regex.Match(str,#"q=(\d*)&amp");
return int.Parse(match.Groups[1].Value);
}
Absolutely no error checking in that!

HtmlAgilityPack can't find element

I need to parse a site and I know where to find the element I'm searching: it's a span with class="metadata_with_icon-tags-primary_tag".
My C# code:
var page = new HtmlWeb().Load(url).DocumentNode.Descendants("span").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("metadata_with_icon-tags-primary_tag"));
Item that I need:

To get your span with class="metadata_with_icon-tags-primary_tag":
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='metadata_with_icon-tags-primary_tag']");

Try this
HtmlWeb website = new HtmlWeb();
var html = website.Load("https://genius.com/Eminem-space-bound-lyrics").DocumentNode.InnerHtml;
Regex rgx = new Regex(#"<script\b[^>]*>([\s\S]*?)<\/script>", RegexOptions.IgnoreCase);
var matches = rgx.Matches(html);
var g = matches[14].Value;
Regex regex = new Regex(
#"(\[{.*}\])",
RegexOptions.Multiline
);
Match match = regex.Match(g);
var json = match.Value;

Working with Regex to get 2 strings out of a source code [duplicate]

I am using webrequest to download a source from a page and then I need to use Regex to grab the string and store it in a string:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
also need:
bpvsid=nvnN2JFJqJc.&dcz=1
Both out of:
<td style="cursor:pointer;" class="" onclick="NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')" onmouseout="HideHover();"><img src="gfx/info.gif" alt="" tipwidth="450" ajaxtip="openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=" /></td>
It keep giving me errors like not enough )'s?
Thanks in advance.
Current code, probably wrong in every way. Really new to this:
Regex rx = new Regex("(?<=class=\"\" onclick=\"NewWindow(').*(?=')");
longId = (rx.Match(textBox2.Text).Value);
textBox1.Text = longId;

var match = Regex.Match(s, #"onclick=""NewWindow\('([^']*)',\s*'([^']*)',.*");
if (match.Success)
{
string longId = match.Groups[1].Value;
string other = match.Groups[2].Value;
}
That will give you two groups with values:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
bpvsid=nvnN2JFJqJc.&dcz=1

The regex NewWindow\('([^']*)', '([^']*) will match what you require. The two strings required will be in Groups[1] and Groups[2].
var match = Regex.Match(textBox2.Text, "NewWindow\('([^']*)', '([^']*)");
var id1 = match.Groups[1].Value;
var id2 = match.Groups[2].Value;

Note that you could also use simply string functions instead of a regex:
var s = "<td style=\"cursor:pointer;\" class=\"\" onclick=\"NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')\" onmouseout=\"HideHover();\"><img src=\"gfx/info.gif\" alt=\"\" tipwidth=\"450\" ajaxtip=\"openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=\" /></td>";
var tmp = s.Substring(s.IndexOf("NewWindow('")).Split('\'');
var value1 = tmp[1]; // U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
var value2 = tmp[3]; // bpvsid=nvnN2JFJqJc.&dcz=1

I would use HtmlAgilityPack to parse HTML, then this non-regex approach works:
string html = // get your html ...
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // doc.Load can also consume a response-stream directly
var result = Enumerable.Empty<string>();
var firstTD = doc.DocumentNode.SelectNodes("//td").FirstOrDefault();
if (firstTD != null)
{
if (firstTD.Attributes.Contains("onclick"))
{
string onclick = firstTD.Attributes["onclick"].Value;
int newWindowIndex = onclick.IndexOf("newWindow(", StringComparison.OrdinalIgnoreCase);
if (newWindowIndex >= 0)
{
string functionBody = onclick.Substring(newWindowIndex + "newWindow(".Length);
string[] tokens = functionBody.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
result = tokens.Take(2).Select(s => s.Trim(' ', '\''));
}
}
}

Extract group name from DirectoryEntry

when i use the code below to get list of groups
i get a long string represent the group name
CN=group.xy.admin.si,OU=Other,OU=Groups,OU=03,OU=UWP Customers,DC=WIN,DC=CORP,DC=com
But i just want to get the group name which is in this case group.xy.admin.si
public static List<string> GetGroups(DirectoryEntry de)
{
var memberGroups = de.Properties["memberOf"].Value;
var groups = new List<string>();
if (memberGroups != null)
{
if (memberGroups is string)
{
groups.Add((String)memberGroups);
}
else if (memberGroups.GetType().IsArray)
{
var memberGroupsEnumerable = memberGroups as IEnumerable;
if (memberGroupsEnumerable != null)
{
foreach (var groupname in memberGroupsEnumerable)
{
groups.Add(groupname.ToString());
}
}
}
}
return groups;
}

There are two options here:
use distinguishedName you got to retrieve group object from AD, use its 'name' attribute
use regex to extract group name
pseudo-code for regular expression:
string Pattern = #"^CN=(.*?)(?<!\\),.*";
string group = Regex.Replace(groupname.ToString(), Pattern, "$1");
groups.Add(group);
Name can contain "," that is escaped by "\", so this regex should work fine even if you have groups named "Foo, Bar"

Find two strings using regex in any order

For example I have an input: "Test your Internet connection bandwidth. Test your Internet connection bandwidth." (two times repeated) and I want to search for strings internet and bandwidth.
string keyword = tbSearch.Text //That holds value: "internet bandwidth"
string input = "Test your Internet connection bandwidth. Test your Internet connection bandwidth.";
Regex r = new Regex(keyword.Replace(' ', '|'), RegexOptions.IgnoreCase);
if (r.Matches(input).Count == siteKeyword.Split(' ').Length)
{
//Do something
}
This doesn't work cause it finds 2 "internet" and 2 "bandwidth", so it count 4 but the keyword length is 2. So what I can do?

var pattern = keyword.Split()
.Aggregate(new StringBuilder(),
(sb, s) => sb.AppendFormat(#"(?=.*\b{0}\b)", Regex.Escape(s)),
sb => sb.ToString());
if (Regex.IsMatch(input, pattern, RegexOptions.IgnoreCase))
{
// contains all keywords
}
First part is generating pattern from your keywords. If there is two keywords "internet bandwidth", then generated regex pattern will look like:
"(?=.*\binternet\b)(?=.*\bbandwidth\b)"
It will match following inputs:
"Test your Internet connection bandwidth."
"Test your Internet connection bandwidth. Test your Internet bandwidth."
Following inputs will not match (not all words contained):
"Test your Internet2 connection bandwidth bandwidth."
"Test your connection bandwidth."
Another option (verifying each keyword separately):
var allWordsContained = keyword.Split().All(word =>
Regex.IsMatch(input, String.Format(#"\b{0}\b", Regex.Escape(word)), RegexOptions.IgnoreCase));

Not sure what you are trying to do, but you could try something like this:
public bool allWordsContained(string input, string keyword)
{
bool result = true;
string[] words = keyword.Split(' ');
foreach (var word in words)
{
if (!input.Contains(word))
result = false;
}
return result;
}
public bool atLeastOneWordContained(string input, string keyword)
{
bool result = false;
string[] words = keyword.Split(' ');
foreach (var word in words)
{
if (input.Contains(word))
result = true;
}
return result;
}

Here is the solution. Clue is to get a list of results and make Distinct()...
string keyword = "internet bandwidth";
string input = "Test your Internet connection bandwidth. Test your Internet connection bandwidth.";
Regex r = new Regex(keyword.Replace(' ', '|'), RegexOptions.IgnoreCase);
MatchCollection mc = r.Matches(input);
List<string> res = new List<string>();
for (int i = 0; i < mc.Count;i++ )
{
res.Add(mc[i].Value);
}
if (res.Distinct().Count() == keyword.Split(' ').Length)
{
//Do something
}

Regex r = new Regex(keyword.Replace(' ', '|'), RegexOptions.IgnoreCase);
int distinctKeywordsFound = r.Matches(input)
.Cast<Match>()
.Select(m => m.Value)
.Distinct()
.Count();
if (distinctKeywordsFound == siteKeyword.Split(' ').Length)
{
//Do something
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove subdomains from URI - c#

I want to remove subdomain names from a URI. Example: I want to return 'baseurl.com' from the Uri "subdomain.sub2.baseurl.com". Is there a way of doing this using URI class or is Regex the only solution? Thank you.

Related

How can I pick out a number from a string in C#

HtmlAgilityPack can't find element

Working with Regex to get 2 strings out of a source code [duplicate]

Extract group name from DirectoryEntry

Find two strings using regex in any order

Categories

Resources