Check is url is valid by using HttpWebRequest - c#

I'd like to create a tool to check if an url is valid (valid: it returns a 200). I have two examples of check in pages of airlines, and both works correctly in the browser. However the British Airlines always throws an exception becuase of a 500 response. What is wrong with my code?
static void Main(string[] args)
{
var testUrl1 = new Program().UrlIsValid("https://www.klm.com/ams/checkin/web/kl/nl/nl");
var testUrl2 = new Program().UrlIsValid("https://www.britishairways.com/travel/olcilandingpageauthreq/public/en_gb");
Console.WriteLine(testUrl1 + "\t - https://www.klm.com/ams/checkin/web/kl/nl/nl");
Console.WriteLine(testUrl2 + "\t - https://www.britishairways.com/travel/olcilandingpageauthreq/public/en_gb");
}
public bool UrlIsValid(string onlineCheckInUrl)
{
try
{
var request = (HttpWebRequest)WebRequest.Create(onlineCheckInUrl);
request.Method = "GET";
var response = (HttpWebResponse)request.GetResponse();
return (response.StatusCode == HttpStatusCode.OK);
}
catch (Exception e)
{
return false;
}
}

A lot of sites block obvious bot activity. The British Airways url you show works for me if I set a valid User-Agent request header:
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0";
Keep in mind that 200 OK is not the only response that means the URL is valid and your method of testing will always be unreliable at best. You may have to narrow your definition of what a valid URL means or at least expect things to change on a site-by-site basis.

Related

401 Unauthorised Error Received

I am using the habbo api to check if a name is valid. I'm receiving a 401 Unauthorised Error.
Below is the code I'm using. It worked when I copied my Cookie header in chrome and added that as a header. But is there another way and an actual fix?
private void Form1_Load(object sender, EventArgs e)
{
try
{
using(WebClient WebClient = new WebClient())
{
WebClient.Headers.Add("User-Agent", "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30");
MessageBox.Show(WebClient.DownloadString("https://www.habbo.com/api/user/avatars/check-name?name=123"));
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
}
The API https://www.habbo.com/api/user/avatars/check-name what you are referring wont load without proper authorization token, since that one is not public available.
To further test, use the public API https://www.habbo.com/api/public/users?name=
You will be able to get response without any issues.
The 401 error indicates that you need to add (basic) authentication to your HTTP request (and remove the cookie that you added):
String username = “username”;
String password = “password”;
String credentials = Convert.ToBase64String(Encoding.ASCII.GetBytes(username + “:” + password));
WebClient.Headers[HttpRequestHeader.Authorization] = “Basic ” + credentials;

HttpClient request returns status 410 Gone

I'm practicing my skills with httpclient .NET 4.5 , but I ran into some trouble trying to get html content from a webpage which I can view no problem with my browser, but I will get a 410 Gone when I use httpclient
There is the first page that I can actually get with httpclient
https://selfservice.mypurdue.purdue.edu/prod/bwckschd.p_disp_detail_sched?term_in=201610&crn_in=20172
But links inside of this above url, like the url represented by "View Catalog Entry "
when I try to access with httpclient, I will get a 410 Gone.
I used Tuple to set up the series of requests
var requests = new List<Tuple<Httpmethod, string, FormUrlEncodedContent, string>>()
{
new Tuple<Httpmethod, string, FormUrlEncodedContent, string>(Httpmethod.GET, "https://selfservice.mypurdue.purdue.edu/prod/bwckschd.p_disp_detail_sched?term_in=201610&crn_in=20172", null, ""),
new Tuple<Httpmethod, string, FormUrlEncodedContent, string>(Httpmethod.GET, "https://selfservice.mypurdue.purdue.edu/prod/bwckctlg.p_display_courses?term_in=201610&one_subj=EPCS&sel_crse_strt=10100&sel_crse_end=10100&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=", null, ""),
new Tuple<Httpmethod, string, FormUrlEncodedContent, string>(Httpmethod.GET, "https://selfservice.mypurdue.purdue.edu/prod/bwckctlg.p_disp_listcrse?term_in=201610&subj_in=EPCS&crse_in=10100&schd_in=%", null, "")
};
Then use the foreach to loop through pass each element in requests to a function
HttpClientHandler handler = new HttpClientHandler()
{
CookieContainer = cookies,
AllowAutoRedirect = false,
AutomaticDecompression = DecompressionMethods.GZip
};
HttpClient client = new HttpClient(handler as HttpMessageHandler)
{
BaseAddress = new Uri(url),
Timeout = TimeSpan.FromMilliseconds(20000)
};
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate, sdch");
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36");
client.DefaultRequestHeaders.Connection.Add("keep-alive");
if (referrer.Length > 0)
client.DefaultRequestHeaders.Referrer = new Uri(referrer);
System.Diagnostics.Debug.WriteLine("Navigating to '" + url + "...");
HttpResponseMessage result = null;
//List<Cookie> cook = GetAllCookies(cookies);
try
{
switch (method)
{
case Httpmethod.POST:
result = await client.PostAsync(url, post_content);
break;
case Httpmethod.GET:
result = await client.GetAsync(url);
break;
}
}
catch (HttpRequestException ex)
{
throw new ApplicationException(ex.Message);
}
I have no clue what is wrong with my method, since I have successfully handled requests that needs to be logged in with this method.
I intended to build some sort of web crawler using this.
Please help, thanks in advance.
I forgot to mention, the links inside of the above url could be accessed by clicking on them on a web browser, simply copy them to address bar will get 410 Gone as well.
EDIT: There is some JavaScript code I saw in page source, it's creating an XMLHttpRequest to get the content, so it might be that it only refresh part of the page instead of creating a new webpage. But the url is changed. How do I get httpclient do the clicking thing?

c# screen scraping and getting all the cookies for secured access to a website

I'm trying to access a website through c# program. There seems to be three cookies needed to access the website yet I only receive two in my cookie container so when I try to access other parts the website I can't. I first do a GET then a POST. The reason I programmed it this way because it seemed from the Chrome Dev tools I determined that it first used a GET for the first two and then a POST to login and get the third one. The POST shows a 302 Moved Temporarily and then right after that it's a redirect. Which I believe is the reason I can't obtain the last cookie can anyone shed any light?
cookieJar = new CookieContainer();
string formParams = string.Format("USERNAME={0}&PASSWORD={1}", username, password);
Console.Write(" \n 1st count before anything : " + cookieJar.Count + "\n"); // 0 cookies
//First go to the login page to obtain cookies
HttpWebRequest loginRequest = (HttpWebRequest)HttpWebRequest.Create("https://server.com/login/login.jsp");
loginRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
//.Connection = "keep-alive";
loginRequest.Method = "GET";
loginRequest.UseDefaultCredentials = true;
loginRequest.CookieContainer = cookieJar;
loginRequest.AllowAutoRedirect = false;
HttpWebResponse loginResponse = (HttpWebResponse)loginRequest.GetResponse();
Console.Write(" \n 2nd count after first response : " + cookieJar.Count + "\n"); // Only 2 are recorded.
//Create another request to actually log into website
HttpWebRequest doLogin = (HttpWebRequest)HttpWebRequest.Create("https://server.com/login/login.jsp");
doLogin.Method = "POST";
doLogin.ContentType = "application/x-www-form-urlencoded";
doLogin.AllowAutoRedirect = false;
byte[] bytes = Encoding.ASCII.GetBytes(formParams);
doLogin.ContentLength = bytes.Length;
using (Stream os = doLogin.GetRequestStream())
{
os.Write(bytes, 0, bytes.Length);
}
oLogin.CookieContainer = cookieJar;
doLogin.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36";
doLogin.Referer = "https://server.com/login/login.jsp";
HttpWebResponse Response = (HttpWebResponse)doLogin.GetResponse();
Console.Write(" \n 3rd count after second repsonse : " + cookieJar.Count + "\n"); // still two
HttpWebRequest had a problem with cookies.
The problem was that a cookie that was assigned to "server.com" would be changed to ".server.com". However, "server.com" did not match ".server.com".
If you are using a framework older than (I think it is 3) you are probably experiencing this problem.
The work around is to use e.g. "www.server.com" in your request, this will match cookies assigned to ".server.com".

Screen Scrape a page of a web app - Internal Server Error

I am tring to screen scrape a page of a web app that just contains text and is hosted by a 3rd party. It's not a properly formed HTML page, however the text that is diplayed will tell us if the web app is up or down.
When I try to scrape the sreen it returns an error when it tries the WebRequest. The error is "The remote server returned an error: (500) Internal Server Error."
public void ScrapeScreen()
{
try
{
var url = textBox1.Text;
var request = WebRequest.Create(url);
var response = request.GetResponse();
var stream = response.GetResponseStream();
var reader = new StreamReader(stream);
var result = reader.ReadToEnd();
stream.Dispose();
reader.Dispose();
richTextBox1.Text = result;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message);
}
}
Any ideas how I can get the text from the page?
Some sites don't like the default UserAgent. Consider changing it to something real, like:
((HttpWebRequest)request).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4"
First, try this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
However, if you're just looking for text and not having to do any POST-ing of data to the server, you may want to look at the webClient class. It more closely resembles a real browser, and takes care of a lot of HTTP header stuff that you may end up having to twek if you stick with the HttpWebRequest class.

Why Does my HttpWebRequest Return 400 Bad request?

The following code fails with a 400 bad request exception. My network connection is good and I can go to the site but I cannot get this uri with HttpWebRequest.
private void button3_Click(object sender, EventArgs e)
{
WebRequest req = HttpWebRequest.Create(#"http://www.youtube.com/");
try
{
//returns a 400 bad request... Any ideas???
WebResponse response = req.GetResponse();
}
catch (WebException ex)
{
Log(ex.Message);
}
}
First, cast the WebRequest to an HttpWebRequest like this:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(#"http://www.youtube.com/");
Then, add this line of code:
req.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
Set UserAgent and Referer in your HttpWebRequest:
var request = (HttpWebRequest)WebRequest.Create(#"http://www.youtube.com/");
request.Referer = "http://www.youtube.com/"; // optional
request.UserAgent =
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; " +
"Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; " +
".NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618; " +
"InfoPath.2; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)";
try
{
var response = (HttpWebResponse)request.GetResponse();
using (var reader = new StreamReader(response.GetResponseStream()))
{
var html = reader.ReadToEnd();
}
}
catch (WebException ex)
{
Log(ex);
}
There could be many causes for this problem. Do you have any more details about the WebException?
One cause, which I've run into before, is that you have a bad user agent string. Some websites (google for instance) check that requests are coming from known user agents to prevent automated bots from hitting their pages.
In fact, you may want to check that the user agreement for YouTube does not preclude you from doing what you're doing. If it does, then what you're doing may be better accomplished by going through approved channels such as web services.
Maybe you've got a proxy server running, and you haven't set the Proxy property of the HttpWebRequest?

Categories