How to get the favicon from a 403 page - c#

I am writing a tool that allows the user to input a URL, to which the program responds by attempting to show that website's favicon. I have this working for many sites but one site that is giving me trouble is my self-hosted Trac site. It seems that Trac's normal behaviour, until the end user is autenticated, is to show a custom 403 page (Forbidden), inviting the user to log in. Accessing Trac from a web browser, the favicon displays in the browser's tab, even though I'm not logged in (and Firebug, for instance, shows a 403 for the page content). If I view source from the browser, the favicon's location is right there in the source. However, from my application, requesting the Trac website with request.GetResponse() throws a WebException containing a 403, giving me no opportunity to read the response stream that contains the vital information required to find the favicon.
I already have code to download a website's HTML and extract the location of its favicon. What I am stuck with is downloading a site's HTML even when it responds with a 403.
I played with various UserAgent, Accept and AcceptLanguage properties of the HttpWebRequest object but it didn't help. I also tried following any redirects myself as I read somewhere that .NET doesn't do them well. Still no luck.
Here's what I have:
public static MemoryStream DownloadHtml(
string urlParam,
int timeoutMs = DefaultHttpRequestTimeoutMs,
string userAgent = "",
bool silent = false
)
{
MemoryStream result = null;
HttpWebRequest request = null;
HttpWebResponse response = null;
try
{
Func<string, HttpWebRequest> createRequest = (urlForFunc) =>
{
var requestForAction = (HttpWebRequest)HttpWebRequest.Create(urlForFunc);
// This step is now required by Wikipedia (and others?) to prevent periodic or
// even constant 403's (Forbidden).
requestForAction.UserAgent = userAgent;
requestForAction.Accept = "text/html";
requestForAction.AllowAutoRedirect = false;
requestForAction.Timeout = timeoutMs;
return requestForAction;
};
string urlFromResponse = "";
string urlForRequest = "";
do
{
if(response == null)
{
urlForRequest = urlParam;
}
else
{
urlForRequest = urlFromResponse;
response.Close();
}
request = createRequest(urlForRequest);
response = (HttpWebResponse)request.GetResponse();
urlFromResponse = response.Headers[HttpResponseHeader.Location];
}
while(urlFromResponse != null
&& urlFromResponse.Length > 0
&& urlFromResponse != urlForRequest);
using(var stream = response.GetResponseStream())
{
result = new MemoryStream();
stream.CopyTo(result);
}
}
catch(WebException ex)
{
// Things like 404 and, well, all other web-type exceptions.
Debug.WriteLine(ex.Message);
if(ex.InnerException != null) Debug.WriteLine(ex.InnerException.Message);
}
catch(System.Threading.ThreadAbortException)
{
// Let ac.Thread handle some cleanup.
throw;
}
catch(Exception)
{
if(!silent) throw;
}
finally
{
if(response != null) response.Close();
}
return result;
}

The stream content is stored in Exception object.
var resp = new StreamReader(ex.Response.GetResponseStream()).ReadToEnd();

Related

A simple reverse proxy using ASP.NET, C# with authentication

I am attempting to forward custom parameters to a RESTful API server and return the proxied response to the client-facing server. I don't want the client to have access to or be able to read the API HTTP request/response interactions, so I decided to perform this action using a reverse proxy. I have no problem forwarding the request and returning a response. The problem lies in the authentication. The client-facing server always wants to redirect to the login page because it doesn't believe the client is authenticated. I have tried using HTTPS and HTTP with similar results.
I have been researching this problem for quite some time and found quite a variety of answers, none of which seem to quite encompass my specific use case. I am following this example, which is the closest to what I specifically need. However, the credentials portion the author commented out (//request.Credentials = CredentialCache.DefaultCredentials;) doesn't seem to cover the authentication portion I am attempting to implement. Please help me understand this problem and solution.
Here is the code I am using from the controller:
public ActionResult ProxyEndpoint(string custom_string, string another_custom_string)
{
//Bunch of code here to grab the remoteUrl from AppConfig and do stuff to the parameters and store them in queryString, unnecessary to show here.
//Here's the important bits:
remoteUrl = remoteUrl + "?" + queryString; // create my remoteUrl
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(remoteUrl);
request.Credentials = CredentialCache.DefaultCredentials;
// Also tried this to no avail:
request.Credentials = CredentialCache.DefaultNetworkCredentials;
return ProxyActionResult(request.GetResponse());
}
Here is the ProxyActionResult class:
public class ProxyActionResult : ActionResult
{
WebResponse _response;
public ProxyActionResult(WebResponse response)
{
_response = response;
}
public override void ExecuteResult(ControllerContext controllerContext)
{
HttpContextBase httpContext = controllerContext.HttpContext;
WebResponse response = _response;
// Read the byte stream from the response:
Stream responseStream = response.GetResponseStream();
// Pulled this next piece from http://www.codeproject.com/Articles/7135/Simple-HTTP-Reverse-Proxy-with-ASP-NET-and-IIS
// Seemed to fit our use case.
if ((response.ContentType.ToLower().IndexOf("html") >= 0) || (response.ContentType.ToLower().IndexOf("javascript") >= 0))// || (response.ContentType.ToLower().IndexOf("image") >= 0))
{
//If the response is HTML Content, parse it like HTML:
StreamReader readStream = new StreamReader(responseStream, Encoding.Default);
String content;
content = ParseHtmlResponse(readStream.ReadToEnd(), httpContext.Request.ApplicationPath);
//Write the updated HTML to the client(and then close the response):
httpContext.Response.Write(content);
httpContext.Response.ContentType = response.ContentType;
response.Close();
httpContext.Response.End();
}
else
{
// If the response is not HTML Content, write the stream directly to the client:
var buffer = new byte[1024];
int bytes = 0;
while ((bytes = responseStream.Read(buffer, 0, 1024)) > 0)
{
httpContext.Response.OutputStream.Write(buffer, 0, bytes);
}
// from http://www.dotnetperls.com/response-binarywrite
httpContext.Response.ContentType = response.ContentType; // Set the appropriate content type of the response stream.
// and close the stream:
response.Close();
httpContext.Response.End();
}
//throw new NotImplementedException();
}
// Debating whether we need this:
public string ParseHtmlResponse(string html, string appPath)
{
html = html.Replace("\"/", "\"" + appPath + "/");
html = html.Replace("'/", "'" + appPath + "/");
html = html.Replace("=/", "=" + appPath + "/");
return html;
}
It turns out that nothing is wrong with the reverse proxy code. The remote server was an ArcGIS OpenLayers API and it had a setting that said crossOrigin: anonymous. I commented out this setting and it worked perfectly.
Check out the documentation if you have this particular ArcGIS OpenLayers problem:
http://openlayers.org/en/v3.14.2/apidoc/ol.source.ImageWMS.html

Passing data from WebResponse to different method

I am trying to pass HttpWebResponse data from a method which checks whether web address written by user exists to another method which will use a StreamReader to get the html sourcecode and later working with it but even though it doesn't show any error I am not getting the sourcode written in prepared listbox. There is as well a button click event which I am not including and shouldn't have any impact on the problem.
protected bool ZkontrolujExistenciStranky(string WebovaStranka)
{
try
{
var pozadavek = WebRequest.Create(WebovaStranka) as HttpWebRequest;
pozadavek.Method = "HEAD";
using (var odezva = (HttpWebResponse)pozadavek.GetResponse())
{
GetData(odezva);
return odezva.StatusCode == HttpStatusCode.OK;
}
}
catch
{
return false;
}
}
protected void GetData(HttpWebResponse ziskanaOdezva)
{
using (Stream strm = ziskanaOdezva.GetResponseStream())
{
StreamReader reader = new StreamReader(strm);
string prochazec;
while ((prochazec = reader.ReadLine()) != null)
{
listBox1.Items.Add(prochazec);
}
}
}
You are using the HEAD method, whose whole point is not to return a body; only headers are returned. Use GET if you want the body.
HTTP HEAD method:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request.

Status code 301 not showing correctly in C#

I am able to get numbers with enum as suggested by dtb in Getting Http Status code number (200, 301, 404, etc.) from HttpWebRequest and HttpWebResponse. However, for moved permanently site also i am getting 200 (OK). What I want to see is 301 instead. Please help. My code is below. What could be wrong/needs to be corrected?
public int GetHeaders(string url)
{
//HttpStatusCode result = default(HttpStatusCode);
int result = 0;
var request = HttpWebRequest.Create(url);
request.Method = "HEAD";
try
{
using (var response = request.GetResponse() as HttpWebResponse)
{
if (response != null)
{
result = (int)response.StatusCode; // response.StatusCode;
response.Close();
}
}
}
catch (WebException we)
{
if (we.Response != null)
{
result = (int)((HttpWebResponse)we.Response).StatusCode;
}
}
return result;
}
The tool where i am using this code is capable of showing 404, not existing domains but it is ignoring the redirects and shows the details about the redirected url. e.g if i put my older domain easytipsandtricks.com in the text field, it shows the results for tipscow.com (if you check easytipsandtricks.com in any checker tool online, you will notice that it is giving the correct redirect message - 301 Moved). Please help.
You need to set HttpWebRequest.AllowAutoRedirect to false (default is true) for it to not automatically follow redirects (30x responses).
If AllowAutoRedirect is set to false, all responses with an HTTP status code from 300 to 399 is returned to the application.
Sample:
var request = (HttpWebRequest)HttpWebRequest.Create(url);
request.Method = "HEAD";
request.AllowAutoRedirect = false;

Screen scraping web page after delay

I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?
Here is my current code:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream,
CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream,
CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{
//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
Here's a sample URL:
http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4
You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.
To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.
Webkit also has .NET bindings but I haven't used them.
The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).
Try this question:
What's a good tool to screen-scrape with Javascript support?
You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.
The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.

C# How can I check if a URL exists/is valid?

I am making a simple program in visual c# 2005 that looks up a stock symbol on Yahoo! Finance, downloads the historical data, and then plots the price history for the specified ticker symbol.
I know the exact URL that I need to acquire the data, and if the user inputs an existing ticker symbol (or at least one with data on Yahoo! Finance) it works perfectly fine. However, I have a run-time error if the user makes up a ticker symbol, as the program tries to pull data from a non-existent web page.
I am using the WebClient class, and using the DownloadString function. I looked through all the other member functions of the WebClient class, but didn't see anything I could use to test a URL.
How can I do this?
Here is another implementation of this solution:
using System.Net;
///
/// Checks the file exists or not.
///
/// The URL of the remote file.
/// True : If the file exits, False if file not exists
private bool RemoteFileExists(string url)
{
try
{
//Creating the HttpWebRequest
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
//Setting the Request method HEAD, you can also use GET too.
request.Method = "HEAD";
//Getting the Web Response.
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
//Returns TRUE if the Status code == 200
response.Close();
return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
//Any exception will returns false.
return false;
}
}
From: http://www.dotnetthoughts.net/2009/10/14/how-to-check-remote-file-exists-using-c/
You could issue a "HEAD" request rather than a "GET"?
So to test a URL without the cost of downloading the content:
// using MyClient from linked post
using(var client = new MyClient()) {
client.HeadOnly = true;
// fine, no content downloaded
string s1 = client.DownloadString("http://google.com");
// throws 404
string s2 = client.DownloadString("http://google.com/silly");
}
You would try/catch around the DownloadString to check for errors; no error? It exists...
With C# 2.0 (VS2005):
private bool headOnly;
public bool HeadOnly {
get {return headOnly;}
set {headOnly = value;}
}
and
using(WebClient client = new MyClient())
{
// code as before
}
These solutions are pretty good, but they are forgetting that there may be other status codes than 200 OK. This is a solution that I've used on production environments for status monitoring and such.
If there is a url redirect or some other condition on the target page, the return will be true using this method. Also, GetResponse() will throw an exception and hence you will not get a StatusCode for it. You need to trap the exception and check for a ProtocolError.
Any 400 or 500 status code will return false. All others return true.
This code is easily modified to suit your needs for specific status codes.
/// <summary>
/// This method will check a url to see that it does not return server or protocol errors
/// </summary>
/// <param name="url">The path to check</param>
/// <returns></returns>
public bool UrlIsValid(string url)
{
try
{
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
request.Timeout = 5000; //set the timeout to 5 seconds to keep the user from waiting too long for the page to load
request.Method = "HEAD"; //Get only the header information -- no need to download any content
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
int statusCode = (int)response.StatusCode;
if (statusCode >= 100 && statusCode < 400) //Good requests
{
return true;
}
else if (statusCode >= 500 && statusCode <= 510) //Server Errors
{
//log.Warn(String.Format("The remote server has thrown an internal error. Url is not valid: {0}", url));
Debug.WriteLine(String.Format("The remote server has thrown an internal error. Url is not valid: {0}", url));
return false;
}
}
}
catch (WebException ex)
{
if (ex.Status == WebExceptionStatus.ProtocolError) //400 errors
{
return false;
}
else
{
log.Warn(String.Format("Unhandled status [{0}] returned for url: {1}", ex.Status, url), ex);
}
}
catch (Exception ex)
{
log.Error(String.Format("Could not test url {0}.", url), ex);
}
return false;
}
If I understand your question correctly, you could use a small method like this to give you the results of your URL test:
WebRequest webRequest = WebRequest.Create(url);
WebResponse webResponse;
try
{
webResponse = webRequest.GetResponse();
}
catch //If exception thrown then couldn't get response from address
{
return 0;
}
return 1;
You could wrap the above code in a method and use it to perform validation. I hope this answers the question you were asking.
Try this (Make sure you use System.Net):
public bool checkWebsite(string URL) {
try {
WebClient wc = new WebClient();
string HTMLSource = wc.DownloadString(URL);
return true;
}
catch (Exception) {
return false;
}
}
When the checkWebsite() function gets called, it tries to get the source code of
the URL passed into it. If it gets the source code, it returns true. If not,
it returns false.
Code Example:
//The checkWebsite command will return true:
bool websiteExists = this.checkWebsite("https://www.google.com");
//The checkWebsite command will return false:
bool websiteExists = this.checkWebsite("https://www.thisisnotarealwebsite.com/fakepage.html");
I have always found Exceptions are much slower to be handled.
Perhaps a less intensive way would yeild a better, faster, result?
public bool IsValidUri(Uri uri)
{
using (HttpClient Client = new HttpClient())
{
HttpResponseMessage result = Client.GetAsync(uri).Result;
HttpStatusCode StatusCode = result.StatusCode;
switch (StatusCode)
{
case HttpStatusCode.Accepted:
return true;
case HttpStatusCode.OK:
return true;
default:
return false;
}
}
}
Then just use:
IsValidUri(new Uri("http://www.google.com/censorship_algorithm"));
A lot of the answers are older than HttpClient (I think it was introduced in Visual Studio 2013) or without async/await functionality, so I decided to post my own solution:
private static async Task<bool> DoesUrlExists(String url)
{
try
{
using (HttpClient client = new HttpClient())
{
//Do only Head request to avoid download full file
var response = await client.SendAsync(new HttpRequestMessage(HttpMethod.Head, url));
if (response.IsSuccessStatusCode) {
//Url is available is we have a SuccessStatusCode
return true;
}
return false;
}
} catch {
return false;
}
}
I use HttpClient.SendAsync with HttpMethod.Head to make only a head request, and not downlaod the whole file. Like David and Marc already say there is not only http 200 for ok, so I use IsSuccessStatusCode to allow all Sucess Status codes.
WebRequest request = WebRequest.Create("http://www.google.com");
try
{
request.GetResponse();
}
catch //If exception thrown then couldn't get response from address
{
MessageBox.Show("The URL is incorrect");`
}
This solution seems easy to follow:
public static bool isValidURL(string url) {
WebRequest webRequest = WebRequest.Create(url);
WebResponse webResponse;
try
{
webResponse = webRequest.GetResponse();
}
catch //If exception thrown then couldn't get response from address
{
return false ;
}
return true ;
}
Here is another option
public static bool UrlIsValid(string url)
{
bool br = false;
try {
IPHostEntry ipHost = Dns.Resolve(url);
br = true;
}
catch (SocketException se) {
br = false;
}
return br;
}
A lot of other answers are using WebRequest which is now obsolete.
Here is a method that has minimal code and uses currently up-to-date classes and methods.
I have also tested the other most up-voted functions which can produce false positives.
I tested with these URLs, which points to the Visual Studio Community Installer, found on this page.
//Valid URL
https://aka.ms/vs/17/release/vs_community.exe
//Invalid URL, redirects. Produces false positive on other methods.
https://aka.ms/vs/14/release/vs_community.exe
using System.Net;
using System.Net.Http;
//HttpClient is not meant to be created and disposed frequently.
//Declare it staticly in the class to be reused.
static HttpClient client = new HttpClient();
/// <summary>
/// Checks if a remote file at the <paramref name="url"/> exists, and if access is not restricted.
/// </summary>
/// <param name="url">URL to a remote file.</param>
/// <returns>True if the file at the <paramref name="url"/> is able to be downloaded, false if the file does not exist, or if the file is restricted.</returns>
public static bool IsRemoteFileAvailable(string url)
{
//Checking if URI is well formed is optional
Uri uri = new Uri(url);
if (!uri.IsWellFormedOriginalString())
return false;
try
{
using (HttpRequestMessage request = new HttpRequestMessage(HttpMethod.Head, uri))
using (HttpResponseMessage response = client.Send(request))
{
return response.IsSuccessStatusCode && response.Content.Headers.ContentLength > 0;
}
}
catch
{
return false;
}
}
Just note that this will not work with .NET Framework, as HttpClient.Send does not exist.
To get it working on .NET Framework you will need to change client.Send(request) to client.SendAsync(request).Result.
Web servers respond with a HTTP status code indicating the outcome of the request e.g. 200 (sometimes 202) means success, 404 - not found etc (see here). Assuming the server address part of the URL is correct and you are not getting a socket timeout, the exception is most likely telling you the HTTP status code was other than 200. I would suggest checking the class of the exception and seeing if the exception carries the HTTP status code.
IIRC - The call in question throws a WebException or a descendant. Check the class name to see which one and wrap the call in a try block to trap the condition.
i have a more simple way to determine weather a url is valid.
if (Uri.IsWellFormedUriString(uriString, UriKind.RelativeOrAbsolute))
{
//...
}
Following on from the examples already given, I'd say, it's best practice to also wrap the response in a using like this
public bool IsValidUrl(string url)
{
try
{
var request = WebRequest.Create(url);
request.Timeout = 5000;
request.Method = "HEAD";
using (var response = (HttpWebResponse)request.GetResponse())
{
response.Close();
return response.StatusCode == HttpStatusCode.OK;
}
}
catch (Exception exception)
{
return false;
}
}

Categories