I'm trying to find the best way (in code) to determine the final destination of a shortened URL. For instance http://tinyurl.com redirects to an eBay auction. I'm trying to get the URL for the eBay auction. I'm trying to do this from within .NET so I can compare multiple URLs to ensure that there is no duplicates.
TIA
While I spent a minute writing the code to ensure that it worked the answer was already delivered, but I post the code anyway:
private static string GetRealUrl(string url)
{
WebRequest request = WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Head;
WebResponse response = request.GetResponse();
return response.ResponseUri.ToString();
}
This will work as long as the short url service does a regular redirect.
You should issue a HEAD request to the url using a HttpWebRequest instance. In the returned HttpWebResponse, check the ResponseUri.
Just make sure the AllowAutoRedirect is set to true on the HttpWebRequest instance (it is true by default).
One way would be to read the URL and get the result code from it. If it's a 301 (permanent redirect) then follow where it's taking you. Continue to do this until you reach a 200 (OK). When using tinyurl it could happen that you will go through several 301 until you reach a 200.
Assuming you don't want to actually follow the link, for TinyURL you can append /info to the end of the url:
http://tinyurl.com/unicycles/info
and it gives you a page showing where that tinyurl links to, which I assume would be easy to parse using xpath or similar.
Most other URL shortening services have similar features, but they all work differently.
Related
I am automating a test for a page that contains a URL that needs to be then tested.
I created a method that I believed was giving me the http status code:
public string ContentUrlHttpRequest()
{
HttpWebRequest protocolWebRequest = (HttpWebRequest)WebRequest.Create(ContentUrl());
protocolWebRequest.Method = "GET";
HttpWebResponse response = (HttpWebResponse)protocolWebRequest.GetResponse();
return response.Headers.ToString();
}
ContentUrl() is another method i created to find the element on the page with the url to be tested and gets it's value.
I have also tried return response.StatusCode.ToString(); but the response i received was "OK".
I know that the response from that url needs to be = 200. I have this assertion that compares the response from the ContentUrlHttpRequest() to the expected results (200):
Assert.AreEqual("200", ContentUrlHttpRequest(), "The Url is not live. Http response = " + ContentUrlHttpRequest());
The response i am getting from ContentUrlHttpRequest() is not the status code but:"Date: Mon, 03 May 2021 09:07:13 GMT".
I understand why it is happening, it is getting the header of the page that is searching. But how could I get the status code? Is it possible with Selenium? Is there something wrong with my method and instead of Headers I need to use something different?
Unfortunately i am not able to provide with the urls that i am testing, or the platform with the url as they are confidential. Hopefully my issue is clear and you guys can give me some guidance.
You are not returning the response status code. You are returning the headers.
You should replace the return statement with this:
return ((int)response.StatusCode).ToString();
I guess you should use response.Status.ToString(); instead of response.Headers.ToString();
But the status contains not only the number like 200 or 401 but also text.
So if you are going to use response.Status.ToString(); you should Assert.True(ContentUrlHttpRequest().contains("200"))
Or you can use response.StatusCode.ToString(); this will give you the status number itself String without additional texts.
I have an array of images URLs. I want to know which of these URLs are correct and which not, without using try-catch, and I want to do that as fast as possible.
I would think the only way for you to know which urls are correct is to just make an HTTP request to the URL. If you have a lot of pictures, this will always take time. You can minimize that time by just making a HEAD HTTP request (as opposed to a GET and downloading the whole response), and checking the status code of the response. If the status code is 200, you might assume that you get the picture you're looking for, if it is 404, you know the url is incorrect.
Code might be something like:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://example.com/");
req.Method = "HEAD";
HttpWebResponse resp = (HttpWebResponse)(req.GetResponse());
HttpStatusCode statuscode = resp.StatusCode;
A note on getting a 200 response: If you get a 200 repsonse, you cannot be sure that you are actually getting image you want. You might be getting something else, e.g. a redirect from the image url.
I know this question has been answered before in this thread, but I couldn't seem to find the details.
In my scenario, I am building a console application which will keep an eye on html page source for any changes. If any update/change occurs, I will perform further operations. Moreover, I'll also perform a request after every 1 second, or as soon as the previous request finishes.
I can't seem to figure out should I use HttpWebRequest or WebClient for downloading the html page source and perform comparison? What do you think would be an ideal solution in my case? Speed and reliability both :)
I'd go with HttpWebRequst because it's not as abstracted and lets you fiddle with HTTP params quite a bit. It gives you the option to not download the entire page if the server returns "file not changed", for example.
If you add some parameters to your request like IfModifiedSince (it might be HEAD or GET request) the server may return the response code 304 - NOT MODIFIED. Refer to description of caching in HTTP for further explanation.
The point is to make sure that you only download the full page when it's actually modified since the last time you fetched it. Most of the time it won't be changed (I suppose, can't know for sure without knowing your domain), so you only need to get a lightweight response from server which simply states "nothing changed here".
Update: code sample demonstrating the use of IfModifiedSince property:
bool IsResourceModified(string url, DateTime dateTime) {
try {
var request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
request.IfModifiedSince = dateTime;
request.Method = "HEAD";
var response = (HttpWebResponse)request.GetResponse();
return true;
}
catch(WebException ex) {
if(ex.Status != WebExceptionStatus.ProtocolError)
throw;
var response = (HttpWebResponse)ex.Response;
if(response.StatusCode != HttpStatusCode.NotModified)
throw;
return false;
}
}
This method should return true if the page was modifed since the dateTime date and false if it wasn't. GetResponse method will throw a WebException if you make a HEAD-request and the server returns 304 - NOT MODIFIED (which is kinda unfortunate). We have to make sure that it's not some other web connection problem, that's why I check the status of web exception and the HTTP status in response. If anything else caused an exception we just throw it further.
Console.WriteLine(IsResourceModified("http://example.com", new DateTime(2009)));
Console.WriteLine(IsResourceModified("http://example.com", DateTime.Now));
This sample code produces the output:
True
False
Note: make sure to read Jim Mischel's addition to this answer as he gives few good advices on this technique.
I was going to leave this as a comment to #Dyppl's response, but it became too long.
Dyppl's response is generally good advice, and the way that I would approach this problem. However, there are a few things you should keep in mind.
First, there's no reason to do a HEAD request, followed by a GET if the page has been modified. You can do a GET with the IfModifiedSince header set, and the server will either return the entire page or a 304. Doing the HEAD first, followed by the 'GET`, ends up making two requests to the server, which defeats much of the purpose of the conditional request.
Second, you should set the IfModifiedSince property to the LastModified value returned by the previous response (i.e. HttpWebResponse.LastModified) because the server's time might not be synchronized with your computer. Also, I've found that a large percentage of sites, particularly those with generated content (like WordPress blogs) lie. They always return the current date/time in the LastModified header. As a result, there is no benefit to doing the If-Modified-Since check on those sites.
If you know that the site lies and always returns the current date/time, you can keep track of the ContentLength header that's returned from the page when you download it. Then, when you want to check to see if the page has changed, do a HEAD request and check the returned ContentLength header with the saved value. If they match, then it's unlikely that the page has changed. If they don't match, then do a GET request to update your copy of the page and keep the new ContentLength.
This technique does have the disadvantage of requiring two requests if the page has changed. It's also not 100% reliable on all servers. Some will return a different ContentLength for the HEAD request, and some don't return a valid ContentLength at all. That said, I've found it to be effective for a large number of sites.
I am reading from the history database, and for every URL read, I am downloading it and storing the data into a string. I want to be able to determine if the link is a download link, i.e. .exe or .zip for e.g. I am assuming I need to read the headers to determine this, but I don't know how to do it with WebClient. Any suggestions?
while (sqlite_datareader.Read())
{
noIndex = false;
string url = (string)sqlite_datareader["url"];
try
{
if (url.Contains("http") && (!url.Contains(".pdf")) && (!url.Contains(".jpg")) && (!url.Contains("https")) && !isInBlackList(url))
{
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Only a test!");
String htmlCode = client.DownloadString(url);
}
}
}
Instead of loading the complete content behind the link, I would issue a HEAD request.
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
Quote of http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
See these questions for C# examples
How to check if a file exists on a server using c# and the WebClient class
How to check if System.Net.WebClient.DownloadData is downloading a binary file?
You're on the right track; you'll need to examine the ResponseHeaders after a successful request:
var someType = "application/zip";
if (client.ResponseHeaders["Content-Type"].Contains(someType)) {
// this was a "download link"
}
The tricky part will be in determining what constitutes a download link since there are so many content types possible. For example, how would you decide whether XML data is a download link or not?
Try to check WebClient's ResponseHeaders collections to validate response file type.
In case, anyone has the same problem, I have used an attribute in the history places.sqlite database which came in very handy!
Places.sqlite contains a table called moz_historyvisits which contains a column visit_type. According to [1], a visit_type of 7 is a download link. Therefore, reading this value will determine if it is a download link without reading the response header or even sending out a head method.
[1] http://www.firefoxforensics.com/research/moz_historyvisits.shtml
I am trying to read the HTML of a page that contains a non-delayed redirect. The following snippet (C#) will give me the destination/redirected page, not the initial one I need to see:
using System.Net;
using System.Text;
public class SomeClass {
public static void Main() {
byte[] data = new WebClient().DownloadData("http://SomeUrl.com");
System.Console.WriteLine(Encoding.ASCII.GetString(data));
}
}
Is there a way to get the HTML of a redirecting page? (I prefer .NET but a snippet in Java or Python would be fine too. Thx!)
Unless the redirect is done on the client side you can't. If the redirect is done server side, then no html is actually generated to the client, but the header is redirected at the new server.
It would take more work, but rather than using WebClient, use HttpWebRequest and set the AllowAutoRedirect property to False. A redirect will then throw an exception, but you can get any response text (and some pages do have response text along with the redirect) from the exception's response object. After you get the response from the exception, you can issue another HttpWebRequest for the redirect URL (specified in the Location response header).
You might be able to do something similar with WebRequest if you create a derived object, MyWebRequest, where you overload the GetWebRequest method and set the AllowAutoRedirect property. I don't know what kind of exception, if any, the DownloadData method will return if you do something like that.
As somebody said previously, this will only work for those pages that do client-side redirects (typically 301 or 302). If there is server-side redirection going on, you'd never know it.
Simplest answer would be to add the current page onto the QueryString component of the redirect when redirecting, for instance:
Response.Redirect(newPage + "?FromPage=" + Request.Url);
Then the new page could see where you cane from by simply looking at Request.QueryString("FromPage").
If you want to get the source of an html page you can use this tool:
http://www.selfseo.com/html_source_view.php