Finding valid URLs - c#

I have a C# script that finds URLs and checks each one to see if it is valid. To be valid, it must have an IP address which means it will return info if queried in a nslookup. Not all valid URLs have a webpage, but they will have an IP address. That being the case, test for a website will not work. I searched for solutions but did not find a simple one.
My current method does a system call to nslookup and places it into a List. Then I loop through the list and check for "Non-existent domain". This works but I prefer not to use system calls if a C# alternative is available.
I can use HtmlAgilityPack and do a call to "https://www.whois.com/whois/" but some foreign URLs are not listed there and it seems like a lot of overhead for this kind of search.
I've tried the following System.Net method but no matter what URL I use, it fails.
string validURL = "a good URL";
try {
Uri myUri = new Uri(validURL);
var ip = Dns.GetHostAddresses(myUri.Host)[0];
Console.WriteLine("Found it");
} catch {
Console.WriteLine("Failed");
}
What is a good low cost method to determine if a URL is valid or fails?

Got it figured out. Pretty simple.
string validURL = "any url or ip adddress";
try {
IPHostEntry hostEntry = Dns.GetHostEntry(validURL);
}
catch {
Console.WriteLine("Failed");
}
If I have a need for the Host name and IP addresses it's done with just a few extra lines of code in the try section.

Related

How to maintain the right URL in C#/ASP.NET?

I am given a code and on one of its pages which shows a "search result" after showing different items, it allows user to click on one of records and it is expected to bring up a page so that specific selected record can be modified.
However, when it is trying to bring up the page I get (by IE) "This page cannot be displayed".
It is obvious the URL is wrong because first I see something http://www.Something.org/Search.aspx then it turns into http://localhost:61123/ProductPage.aspx
I did search in the code and found the following line which I think it is the cause. Now, question I have to ask:
What should I do to avoid using a static URL and make it dynamic so it always would be pointing to the right domain?
string url = string.Format("http://localhost:61123/ProductPage.aspx?BC={0}&From={1}", barCode, "Search");
Response.Redirect(url);
Thanks.
Use HttpContext.Current.Request.Url in your controller to see the URL. Url contains many things including Host which is what you're looking for.
By the way, if you're using the latest .Net 4.6+ you can create the string like so:
string url = $"{HttpContext.Current.Request.Url.Host}/ProductPage.aspx?BC={barCode}&From={"Search"}";
Or you can use string.Format
string host = HttpContext.Current.Request.Url.Host;
string url = string.Format("{0}/ProductPage.aspx?BC={1}&From={2}"), host, barCode, "Search";
You can store the Host segment in your AppSettings section of your Web.Config file (per config / environment like so)
Debug / Development Web.Config
Production / Release Web.Config (with config override to replace the localhost value with something.org host)
and then use it in your code like so.
// Creates a URI using the HostUrlSegment set in the current web.config
Uri hostUri = new Uri(ConfigurationManager.AppSettings.Get("HostUrlSegment"));
// does something like Path.Combine(..) to construct a proper Url with the hostName
// and the other url segments. The $ is a new C# construct to do string interpolation
// (makes for readable code)
Uri fullUri = new Uri(hostUri, $"ProductPage.aspx?BC={barCode}&From=Search");
// fullUrl.AbosoluteUri will contain the proper Url
Response.Redirect(fullUri.AbsoluteUri);
The Uri class has a lot of useful properties and methods to give you Relative Url, AbsoluteUrl, your Url Fragments, Host name etc etc.
This should do it.
string url = string.Format("ProductPage.aspx?BC={0}&From={1}", barCode, "Search");
Response.Redirect(url);
If you are using .Net 4.6+ you can also use this string interpolation version
string url = $"ProductPage.aspx?BC={barcode}&From=Search";
Response.Redirect(url);
You should just be able to omit the hostname to stay on the current domain.

How to get website name from domain name?

I fetch the domain from the URL as follows:
var uri = new Uri("Http://www.google.com");
var host = uri.Host;
//host ="www.google.com"
But I want only google.com in Host,
host = "google.com"
Given the accepted answer I guess the issue was not knowing how to manipulate strings rather than how to deal with uris... but for anyone else who ends up here:
The Uri class does not have this property so you will have to parse it yourself.
Presumably you do not know what the subdomain is before time so a simple replace may not be possible.
This is not trivial since the TLDs are so varied (http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains), and there maybe be multiple parts to the url (eg http://pre.subdomain.domain.co.uk).
You will have to decide exactly what you want to get and how complex you want the solution to be.
simple - do a string replace, see ekad's answer
medium - regex that works most of the time, see Strip protocol and subdomain from a URL
or complex - refer to a list of suffixes in order to figure out what is subdomain and what is domain eg
Get the subdomain from a URL
If host begins with "www.", you can replace "www." with an empty string using String.Replace Method like this:
var uri = new Uri("Http://www.google.com");
var host = uri.Host.ToLower();
if (host.StartsWith("www."))
{
host = host.Replace("www.", "");
}

Parsing string for Domain / hostName

Out customers can enter websites from domain names. They also can enter mailadresses from their contacts.
Know we need to find customers which websited whoose domain can be associated to the domains of the mailadresses.
So my idea is to extract the host from the webadress and from the url and compare them
So what's the most reliable algorithm to get the hostname from a url?
for example a host can be:
foo.com
www.foo.com
http://foo.com
https://foo.com
https://www.foo.com
The result should always be foo.com
Rather than relying on unreliable regex use System.Uri to do the parsing for you. Use a code like this:
string uriStr = "www.foo.com";
if (!uriStr.Contains(Uri.SchemeDelimiter)) {
uriStr = string.Concat(Uri.UriSchemeHttp, Uri.SchemeDelimiter, uriStr);
}
Uri uri = new Uri(uriStr);
string domain = uri.Host; // will return www.foo.com
Now to get just the top-level domain you can use:
string tld = uri.GetLeftPart( UriPartial.Authority ); // will return foo.com
Here's a regular expression that will match the url's you have provided. Basically http and https etc are optional, as is the www Everything is then matched up to a possible path;
var expression = /(https?:\/\/)?(www\.)?([^\/]*)(\/.*)?$/;
This would mean that;
var result = 'https://www.foo.com.vu/blah'.replace(expression, '$3')
Would evaluate to
result === 'foo.com.vu'
There is already a url parser in c# for extracting this information
Here are some examples http://www.stev.org/post/2011/06/27/C-HowTo-Parse-a-URL.aspx
See this url. The Host property, unlike the Authority will not include the port number.
http://msdn.microsoft.com/en-us/library/system.uri.host(v=vs.110).aspx

ASP.NET Site Redirection help

I am following the code over here https://web.archive.org/web/20211020203216/https://www.4guysfromrolla.com/articles/072810-1.aspx
to redirect http://somesite.com to http://www.somesite.com
protected void Application_BeginRequest(object sender, EventArgs e)
{
if (Request.Url.Authority.StartsWith("www"))
return;
var url = string.Format("{0}://www.{1}{2}",
Request.Url.Scheme,
Request.Url.Authority,
Request.Url.PathAndQuery);
Response.RedirectPermanent(url, true);
}
How can I use this code to handle situations where http://abc.somesite.com should redirect to www.somesite.com
I'd suggest the best way to handle this would be in the dns record, if you have control of it.
If you don't know what the values will be ahead of time, you can use substring with indexof for the Url path to parse out the value you want and replace it.
If you do know what it is ahead of time, you can always just do Request.Url.PathAndQuery.Replace("abc", "www");
You can also do a dns check as #aceinthehole suggested after you have parsed what you need to make sure you haven't made any mistakes.
assuming you have a string like http://abc.site.com and you want to turn abc into www then you could do something like.
string pieceToReplace = Request.Url.PathAndQuery.substring(0, Request.Url.PathAndQuery.IndexOf(".") + 1);
//here I use the scheme and entire url to make sure we don't accidentally replace an "abc" that belongs later in the url like in a word "GHEabc.com" or something.
string newUrl = Request.Url.ToString().Replace(Request.Url.Scheme + "://" + pieceToReplace, Request.Url.Scheme + "://www");
Response.Redirect(newUrl);
p.s. I don't remember if the Request.Url.Scheme already has the "://" in it or not so you will need to edit accordingly.
I don't think you can do it without access to the DNS. It sounds like you need a wildcard DNS entry:
http://en.wikipedia.org/wiki/Wildcard_DNS_record
Along with IIS configured without host headers (IP only). Then you can use code similar to the above to do what you want.
if (!Request.Url.Host.StartsWith ("www") && !Request.Url.IsLoopback)
Response.Redirect('www.somesite.com');
Perhaps tighten it up some to prevent wwww.somesite.com from getting through. Anything that starts with www including wwwmonkeys.somesite.com would get through the above check. It is just an example.
asp.net mvc: How to redirect a non www to www and vice versa

Why does Request["host"] == "dev.testhost.com:1234" whereas Request.Url.Host == "localhost"

Hi all, I seem to have found a discrepancy when testing ASP.NET applications locally on the built-in web server with Visual Studio 2008 (Cassini).
I've set up a host on my local machine associating dev.testhost.com with 127.0.0.1, since I have an application that needs to change its appearance depending on the host header used to call it.
However, when I request my test application using http://dev.testhost.com:1234/index.aspx, the value of Request.Url.Host is always "localhost". Whereas the value of Request.Headers["host"] is "dev.testhost.com:1234" (as I would expect them both to be).
I'm NOT concerned that the second value includes the port number, but I am mighty confused as to why the HOST NAMES are completely different! Does anyone know if this is a known issue, or by design? Or am I being an idiot?!
I'd rather use Request.Url.Host, since that avoids having to strip out the port number when testing... - Removed due to possibly causing confusion! - Sam
Request.Headers["host"] is the value received from the application that connects to the server, while the other value is the one the server gets when it tries to get the domain name.
The browser uses in the request the domain name entered because that is used in the case of virtual domains. The server reports the one set in the server preferences, or the first one it finds.
EDIT: Looking at the code of Cassini to see if it uses some particular settings, I noticed the following code:
public string RootUrl {
get {
if (_port != 80) {
return "http://localhost:" + _port + _virtualPath;
}
else {
return "http://localhost" + _virtualPath;
}
}
}
//
// Socket listening
//
public void Start() {
try {
_socket = CreateSocketBindAndListen(AddressFamily.InterNetwork, IPAddress.Loopback, _port);
}
catch {
_socket = CreateSocketBindAndListen(AddressFamily.InterNetworkV6, IPAddress.IPv6Loopback, _port);
}
// …
}
The explanation seems to be that Cassini makes explicit reference to localhost, and doesn't try to make a reverse DNS lookup. Differently, it would not use return "http://localhost" + _virtualPath;.
The Request.Headers["host"] is the host as specified in the http header from the browser. (e.g. this is what you'd see if you examined the traffic with Fiddler or HttpWatch)
However, ASP.NET loasds this (and other request info) into a System.Uri instance, which parses the request string into its constituent parts. In this case, "Host" refers to literally the host machine part of the original request (e.g. with the tcp port being in the Port) property.
This System.Uri class is a very useful helper class that takes all the pain out of splitting your request into it's parts, whereas the "Host:" (and for that matter the "GET") from the http header are just raw request data.
Although they both have the same name, they are not meant to be the same thing.
It's a matter of what the w3 specs say versus what the Microsoft Uri.Host property is supposed to contain. The naming does not imply an attempt by MS to provide identical functionality. The function that does include port numbers is Uri.Authority.
With the update you posted, you're still facing the same problem, just examining a different aspect of it. The Uri.Host property is not explicity or implicity stated to perform the same function as the headers that are defined in the w3 specs. In long form, here are some quotes from the Uri.Host MSDN page:
Uri.Host Property
Gets the host component of this instance.
Property Value
Type: System.String
A String that contains the host name. This is usually the DNS host name or IP address of the server.
There's no guarantee that this will match what is in the headers, just that it represents the host name in some form.

Categories